Averting Cloud Misconfiguration at Scale with DIAL


Cloud misconfigurations are a major cause of cloud security breaches with more than 40% breaches that are attributed to it.  The autonomy constraints to build and establish a mature cloud native incident response are deemed to exist industry wide. Attackers and defenders thrive for a cat-and-mouse game on ever growing cloud assets and it has raised alarms in Information Security teams across all organizations on how effective they are in protecting their customers and securing their data.

Drifts in infrastructure changes are becoming so volatile and the speed at which the changes are happening in an organisation’s infrastructure raises an important question of “How often do we face challenges to address a critical incident in the cloud just because the alerts were delayed?”.

Those classic cloud misconfigurations (viz. wide open security groups, internet exposed PaaS/SaaS services, unpatched/orphaned systems) are always twofold for an organization at any scale. To add to this problem, cyber criminals continuously scan various cloud security providers using bots scouring for misconfiguration to exploit, the urgency of this becomes higher.

One of the crucial Key Performance Questions (KPQ) for any incident response process is to continuously improve and reduce Mean Time to Detect (MTTD) from days to seconds. The scale and complexity of this challenge prompted us to take a step back and think on how we can avoid it. This problem statement helped us create a tool to monitor for any such misconfigurations and protect our infrastructure against threats.

This blog hence introduces you to our tool here at CRED – DIAL (Did I Alert a Lambda?) that can automatically monitor, detect and alert about the misconfigurations in our cloud infrastructure.

meet DIAL

DIAL Overview
DIAL Overview

DIAL (Did I Alert a Lambda?) is a single stop solution which helps us to easily monitor all our AWS accounts to detect any misconfigurations and even keep an eye on infrastructure in near real time. DIAL is leveraged by the security team at CRED to centrally investigate threats across all AWS accounts by giving them complete control over granularity of any detected security misconfiguration and threats. Some of the key advantages of using DIAL are –

  • With the modular and stateless master-worker architecture, it is easy to deploy and scale.
  • All alerts are stored in DynamoDB, which can act as a data store for all historical incidents which can be further used for analytics or long term storage.
  • These alerts could be forwarded to your SIEM or incident management platform for further analysis or tracking.
  • Severity of different types of alerts can be customized based on your preference and organizational needs.
  • Services Covered by DIAL – EC2, S3, IAM, RDS, DynamoDB, SSM Parameter Store, Secrets Manager, ECR, ES and Route53.
  • Cost efficient as this is invoked only when events of interest are seen.
  • Easily configurable modules that can be used to add or modify rules to cover new AWS services on any number of AWS accounts.
  • Configurable severity based on the location of the worker node.

DIAL has an inbuilt detection mechanism against some of the common misconfigurations that can put your AWS infrastructure at risk, like creation of too wide open security groups for compute or databases, creation of public readable data stores, too permissive IAM access policy or deletion of multi factor authentication devices for an IAM user to name a few which can lead to a data leak, exploitation of servers to mine cryptocurrency or become a part of bot network are some of the few possible outcomes.

Sample DIAL Alerts

There are a couple of reasons why we choose to build DIAL instead of using open source or commercial solutions. We could have combined a few tools together to achieve the goal, but it would not have been an elegant solution and would have a lot of complexity and operational overhead and limited scope of enhancements.

  • Have an event based approach. All the existing solutions that we reviewed were either based on post actions of an alert or would process logs themselves.
  • Respond to misconfiguration events in an automated fashion.
  • Have a low infrastructure footprint and as near real time detection capabilities as possible.
  • Different response/severity based on the AWS account that the alert is being triggered in.

breakdown of DIAL: the architecture

DIAL is composed of AWS services including Lambda, API Gateway and EventBridge deployed in Master-Worker architecture and is generally meant to be used in an AWS Organization. The architecture can be divided into two parts – the worker and master. The worker nodes process the event and alerts based on the detection rules that are either pre-defined or user created through SNS/Slack while the master nodes aggregates all the alerts sent by the worker nodes and are forwarded for further analysis.

DIAL Architecture

The workers take the help of AWS EventBridge to receive all the logs from CloudTrail, which contains all the audit logs from the actions taken within our environment. The logs are filtered by AWS EventBridge before being sent to AWS Lambda for invocation, only the events of interest are forwarded. Once the Lambda is invoked it would pull metadata from the event like which principal performed the action, from where it was performed, time it was performed, etc and enriches them with geolocation and reputation data to add some contextual information to the alert and takes appropriate action depending on the rules that have been configured.

Once the event has been processed the details about the event and the additional metadata are sent to the master node for writing it to the database and forwarding this to the incident management system or monitoring system. The master is invoked by the workers via the API Gateway, which can be invoked by sending an HTTP request and acts as an aggregator of all actions taken by the worker node in their corresponding AWS Accounts. 

In an AWS Organization, the master can be deployed anywhere within the organization, while the workers would need to be deployed in every AWS account and region that is being used within that AWS account. When a new AWS account is created, just the worker node would need to be deployed without any configuration needed in the master. 

This helps us in keeping scale as we can keep adding new AWS accounts like lego building blocks without any need to change anything. Based on our deployment of DIAL, we have an average detection time of less than 4 seconds and maximum of 10 seconds.

lifecycle of DIAL alert

Unlike other aspects of IT, security is typically never a finished product, but rather a continuous process. The results of each phase feed into the next phase of the lifecycle, providing for a continuous monitoring and improvement of security. The alert presented here follows the basic steps of Identify, Assess, Protect and Monitor. 

DIAL follows a similar model of cycle by constantly monitoring all the actions being performed within the AWS environment and assessing them based on the configured detection rules. DIAL can forward and log all the alerts on a centralized dashboard which can be further used. It can help us to keep track of all such alerts raised by DIAL and be a single stop solution for our security team to get visibility across all the AWS accounts or take automated remediations based on the defined policies. All these happen in conjunction with DIAL constantly monitoring newly generated events.

  • Any action that is taken with an AWS environment either from AWS CLI, AWS Console or AWS SDK will be captured in AWS CloudTrail.

  • Let’s say a developer creates a database instance using AWS CLI command similar to the snippet mentioned below.
    aws rds create-db-instance --db-instance-identifier test-mysql-instance --db-instance-class db.t3.micro --engine mysql --master-username admin --master-user-password 123456 --allocated-storage 20 --publicly-accessible
  • As soon as the event is generated in CloudTrail, the event would be read by AWS EventBridge where it would be filtered depending if that is an event of interest for us or not. If it is then the worker would be invoked and if not, it would be dropped.

  • The worker node reads the metadata like IP Address for this event as well as pulls the GeoIP and reputation data based on the metadata for this particular event. The detection rules are then run to detect for misconfiguration defined by the detection rules that are either pre-configured or written by the user.

  • If there is any sort of error that came in while processing those events, the event would be sent to an S3 bucket for storage to further analyze the reason for failure for the particular  invocation.

  • Once the event is processed, it triggers an alert called “Database being public”. The alert is sent to us as a Slack message via webhooks, as well as a notification is sent to the master for writing it to the database and forwards it to your SIEM or a Incident Management solution.
Slack Alert for internet facing RDS instance being created
  • All the requests sent to API Gateway need to be authenticated with a configurable pre-shared key. There would be no response from the API Gateway until the requests are authenticated.

  • This is a sample alert which was parsed and submitted as a ticket to TheHive. All the major meta-data pertaining to the event has been parsed and hence this alert can help any analyst to answer basic questions in context of this alert.
Case in TheHive of internet facing RDS MySQL instance being created
  • This ends the lifecycle of an alert. Once an analyst has been notified via Slack or Incident Management platform, it is upto the analyst to take appropriate remediation action depending on the business needs and the severity of the issue.

road ahead for DIAL

While we are actively working on open sourcing DIAL for everyone to look at and experiment with, we are also looking at other ways we could improve DIAL. Some of things that are already planned for –

  • Adding additional detection for new attack vectors/misconfigurations for the existing services.
  • Extending DIAL to cover more AWS Services.
  • Fine tuning DIAL architecture to have a lower Infrastructure footprint.
  • Automated remediation of alerts based on the user preference.
  • Addition of other Integration like JIRA for Incident Management

Thanks to Saransh Rana, Divyanshu Mehta and Harsh Varagiya for building this and keeping CRED secure.


1 comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • Great work! One question though – why did you choose to build this in-house vs use a third-party vendor like Prisma, CloudSploit?

about CRED

CRED is a members-only platform that rewards the creditworthy individuals of India with exclusive experiences, rewards and upgrades.

%d bloggers like this: