AWS Cloud Operations Review

In late 2019, a multi-national enterprise asked Akkodis to review its AWS Cloud operations

Having started their AWS Cloud journey in early 2018 and applying a large amount of automation to their VPC environment creation, there were concerns about a number of factors of the AWS Well-Architected principles, particularly around security.

The Challenge

With a growing landscape, they had over 150 AWS Accounts, with over 100 Virtual Private Cloud environments. The organisation ran their corporate DNS domain in split-horizon – with an internal (on-premises) view that all cloud environments had to have the ability to resolve names for.

With the point of view of 2018, they had established a shared-services VPC, and a pair of DNS resolvers running atop of two small EC2 virtual machines. In each connected VPC, they had updated the DHCP Options to point all clients at these two EC2 instances, which in turn would either query DNS on premises, or resolve to the outside world.

As they had started to use VPC Endpoints for various services in their VPC architectures, they were finding these were not being used in each workload account.

It also became apparent that Guard Duty, enabled for all environment, was not getting the visibility of the DNS queries in the originating source environment: they we only appearing in the shared-services environment, which made problem determination more difficult.

AKkodis has had extensive experience with the Guard Duty service since its inception, and its history in improving and adding new security findings over time is a key piece of workload visibility and security operations.

The Solution

AKkodis routinely runs minor improvements of its customer environment when delivering a DevOps ongoing engagement, and as such, pays close attention to the changes that fundamentally improve the capability of the AWS Cloud environment.

One key release from November 2018 was the ability to define an outbound DNS resolver rule that would support the split-horizon view for the VPC DNS resolver. Called Route 53 Outbound Resolver, it replaces the current architecture, removing the need for a customer to run (and maintain) their own DNS services for the purpose of split-horizon.

This service also allows the Outbound rule and endpoints to be shared across account using the Resource Access manager (RAM). Thus, only one Outbound Resolver needs to be defined for an enterprise, with the Resolver having two Endpoints that originate queries towards the on-premises resolver from different Availability Zones of the existing shared-services VPC.

With this Rule shared across all organisations in the AWS Organization, each workload could then attach this rule to their VPC(s), adjust the VPC DHCP Options to reference the AmazonProvidedDNS (local per VPC) and wait for hosts to refresh their DHCP leases.

Security Focus

Much of our operational review was directed to Security considerations that had become apparent. Security is a principal focus for the Akkodis AWS Cloud Practice, and a key element of the Well-Architected principles which have stood as a foundation to our cloud capability. For example, EC2 Instances had not been rebase-lined/refreshed to newer instance families, and on some occasions, this was due to the installed operating system not being updated enough to have support for the newer instance types. This is typical of versions of RedHat prior to 7.3, prior to required Linux kernel patches being in the baseline kernel (custom kernels can be compiled, but stock kernel support is less management overhead). EC2 instances should have default encryption for EBS Volumes enabled, which will also encrypt EBS snapshots at rest.

We discovered that Application Load Balancers (ALBs) did not have sufficiently strict TLS configurations defined on them that are considered security by today’s standards, a move forward we are likely to see repeated every few years in the industry. The AWS Security Token Service (STS) should be disabled in Regions where it is not intended or required for use. Some Instances were running with hard coded IAM credentials, which do not rotate. We always recommend to use EC2 IAM Instance Profiles per role of system in a solution, with policies locked down to permit only the required API calls (if any). DNS traffic restrictions were also key: VPC Security Groups should not permit UDP and TCP 53 to egress systems, as Instances should only use the VPC-provided DNS Resolver, helps limit the ability for Botnet Command & Control to permeate through the network without detection. With the VPC DNS Resolver being used (as detailed above with split horizon), the configured Amazon Guard Duty can then inspect and alert (or block, with AWS DNS Firewall) on suspicious DNS traffic, including data exfiltration via DNS.

Diving Deeper on Security Groups, removing all UDP traffic for INGRES and EGRESS is a target, and ensuring that administrative ports like RDP do not permit horizontal movement in a network: egress traffic should restrict down to just the expected traffic we expect to establish, e.g.: for an App server being to a database server, or to S3. Network Traffic for Time (NTP), DNS, and Instance Metadata should all be loaded from Link-local services, such as the AWS Time Sync Service. Furthermore, removing the use of RDP and SSH in the network for instance introspection, and using Amazon Systems Manager Session Manager (a mouthful to say, admittedly – henceforth, SSM) as the data plane for administrative access means any detection of attempts for RDP and SSH access directly can be seen as an attempted compromise. For Amazon S3: Block Public Access is a standard that should be universally enforced. Any public serving of objects via HTTP/HTTPS should all be routed via Amazon CloudFront – and only over HTTPS, with a valid certificate, and possible Amazon Web Application Firewall (WAF) in place to limit access (CloudFront can access static objects being served from S3 via an Origin Access Identity, and thus is not anonymous/public on S3). Access to S3 should be via a VPC Gateway Endpoint for S3, which enabled policies to not only ensure that objects in S3 are being fetched from specific VPC Endpoints (and with credentials) but can also limit EC2 instances from accessing 3rd party open/anonymous Buckets as a way of infiltration or exfiltration of data via 3rd party S3 Buckets.

Any proxy services (Squid, BlueCoat, or other types of Proxy) deployed into an environment should be configured to block requests for the IPv4 and IPv6 addresses for the instance metadata service, and any application should look for Server Side request Forgery that could trick an application from erroneously fetching sensitive credentials from the Instance Metadata Service and handing it back to external clients. It’s worth noting, the original malicious request that the proxy is fetching may have a valid DNS name, such as Something.domain.com, which when resolved, points at local addresses like 169.254.169.254. Again, having Guard Duty in the DNS traffic flow can help identify these attempts.

When using RDS: ensuring all connections to the database are connected, using the force ssl option in the parameter group, helps ensure that credentials on the wire are encrypted, including any that come in from other sites such as via VPN or Direct Connect. Solutions that interact with the database should regularly obtain the fresh RDS Root Certificate and be able to validate the Common Name of the database meets the hostname in the connection string.

Lastly, on CloudFormation templates, performing routine CloudFormation Drift Detection and ensuring solutions remain compliant, either by backporting local changes into templates, removing local changes to regain compliance helps with security visibility. It’s also an opportunity to ensure that administrators take required changes through review and release procedures. There are many other recommendations that were discovered (more than 30), but the lessons are quite clear; the lack of adhering to the evolving security threats over time and incorporating these improvements into the operational solutions is leaving open exposures that activities like patching alone cannot remediate.

Outcomes and Results

There are several improvements the customer now sees:

  • No maintenance on the DNS server software (VPC resolver is fully managed component of the VPC)
  • No maintenance on the DNS server hosts (OS patching)
  • No instance maintenance of the DNS server host (instance family changes)
  • No VPC peering (or other connectivity) from workload accounts to shared services purely for DNS traffic
  • All instances can have an EGRESS Security Group rule clean up that removes outbound UDP and TCP 53 completely, for a better security posture.
  • Guard Duty now sees security issues on the source account for errant or unwanted DNS traffic
  • Local VPC Endpoints per VPC now resolve correctly
  • Increased DNS capacity: not relying on two DNS servers, but a scalable VPC cache as the EC2 fleet grows across all accounts.
  • No interruption to service during the host maintenance; previously during maintenance/reboot of DNS hosts, 50% of DNS queries would hit a 2 seconds timeout before attempting to query the second DNS host (UDP transport default timeout)

The change replaces two EC2 instances with one VPC Resolve Outbound of similar cost.