Organizations continue to increase their footprint in the cloud. Yet, applying governance and enforcing policies across multiple cloud environments at scale is a challenging endeavor. Organizations typically have governance policies written in English, but translating them into enforceable code has historically required some custom workarounds. In the same vein as infrastructure-as-code (IaC), DevOps engineers now seek a similar GitOps approach to implement shared policies across cloud-based workloads.
Cloud Custodian is one such toolset that can manage and enforce cloud policies in a standardized format. Using Cloud Custodian, a Cloud Center of Excellence has the building blocks to create policies for security governance, development guardrails and cloud cost optimizations. The free, open source project, which has been steadily evolving for years, recently gained its incubation status with the Cloud Native Computing Foundation (CNCF) and to date boasts over 350 contributors on GitHub.
I met with Kapil Thangavelu, Cloud Custodian creator and maintainer and CTO at Stacklet, to learn more about Cloud Custodian and what features we should anticipate from the project in the near future. Below, we’ll provide a comprehensive introduction to Cloud Custodian, exploring its history, how to use it and its core benefits. We’ll also analyze a few sample policies to showcase how Cloud Custodian functions in practice.
What is Cloud Custodian?
Cloud Custodian got its start as an internal tool developed within Capitol One toward the end of 2015, explains Thangavelu. At the time, engineers were grappling with the newfound realities of managing the cloud at scale and realized that having one-off ad-hoc policies across teams was causing friction. “It’s really hard to manage and enforce policies at scale,” says Thangavelu. So, engineers toyed with creating a more GitOps approach to policies around governance and security that could be run as an internal SaaS solution.
Thus, Cloud Custodian was born. The YAML DSL tool allows you to easily define rules to manage cloud infrastructure. It can offer real-time enforcement and produce event-based responses. For example, you can create policies that require tags on resources or one that turns off instances during off-hours. Or, a policy could prevent a developer without the proper privileges from accidentally making an open load balancer. Cloud Custodian supports AWS, Azure, and GCP and policies are specified to a resource type, whether it’s EC2, ASG, Redshift, CosmosDB or a PubSub Topic.
Thangavelu describes Cloud Custodian as a bunch of Lego bricks that developers can pick up to build guardrails in arbitrary ways. Instead of shipping with canned rulesets, “Cloud Custodian makes customization itself a first-class citizen,” he says. Capital One open sourced the project and contributed Cloud Custodian to CNCF in August 2020.
Example Policies
Cloud Custodian comes with a library of actions and filters. Let’s take a look at a few example policies across the three main cloud providers. But first, to install Cloud Custodian from PyPI, you can initiate the following command:
$ python3 -m venv custodian
$ source custodian/bin/activate
(custodian) $ pip install c7n
Let’s consider an example policy that would improve security hygiene on AWS. Below is one such policy that creates a CloudWatch Event to be triggered anytime a user logs in from an invalid IP address. This could be used to ping a security administrator to investigate. With this rule, it’s also possible to turn on an auto-remediation function.
policies:
- name: invalid-ip-address-login-detected
resource: account
description: |
Notifies on invalid external IP console logins
mode:
type: cloudtrail
events:
- ConsoleLogin
filters:
- not:
- type: event
key: 'detail.sourceIPAddress'
value: |
'^((158\.103\.|142\.179\.|187\.39\.)([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
\.([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))|(12\.([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
\.([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])\.([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))$'
op: regex
actions:
- type: notify
template: default.html
priority_header: 1
subject: "Login From Invalid IP Detected - [custodian {{ account }} - {{ region }}]"
violation_desc: "A User Has Logged In Externally From A Invalid IP Address Outside The Company's Range:"
action_desc: |
"Please investigate and revoke the invalid session along
with any other restrictive actions if appropriate"
to:
- CloudAdmins@Company.com
- SecurityTeam@Company.com
transport:
type: sqs
queue: https://sqs.us-east-1.amazonaws.com/12345678900/cloud-custodian-mailer
region: us-east-1
Next, let’s consider a policy that would add some degree of governance for developers working on Azure. One simple yet popular policy involves tagging the ownership details to individual resources. This policy, therefore, will tag all resource groups with the creator’s email address:
policies:
- name: azure-auto-tag-creator-resource-groups
resource: azure.resourcegroup
description: |
Tag all existing resource groups with the 'CreatorEmail' tag; looking up to 10 days prior.
actions:
- type: auto-tag-user
tag: CreatorEmail
days: 10
Finally, let’s consider a policy that could aid in cost optimization on GCP. Below is a sample policy to enforce minimal CPU utilization for autoscalers on GCP. Implementing such a policy could help avoid underutilization to combat rising cloud costs.
vars:
min-utilization-target: &min-utilization-target 0.8
policies:
- name: gcp-autoscalers-enforced
resource: gcp.autoscaler
mode:
type: gcp-audit
methods:
- v1.compute.autoscalers.insert
filters:
- type: value
key: autoscalingPolicy.cpuUtilization.utilizationTarget
op: less-than
value: *min-utilization-target
actions:
- type: set
cpuUtilization:
utilizationTarget: 0.8
Benefits of Cloud Custodian
By automating a lot of the tedious policy management away, Cloud Custodian could reduce risk and accidents through more streamlined cloud governance. “It solves the natural problems when infrastructure is in everyone’s head,” says Thangavelu. By aggregating ad-hoc scripts and unifying policies across an organization, you could immediately instigate new rules without manually reminding all members of an organization, which could take years.
For those familiar with Open Policy Agent (OPA), you may notice some overlap in the objectives, as both are engines for enacting cloud-native policies. Compared to OPA, Cloud Custodian has some developer experience perks. For one, you don’t have to use Rego, as the policies are written in YAML, which is a familiar configuration language for DevOps engineers. Cloud Custodian uses abstractions on event runtimes for each cloud provider. Furthermore, compared to OPA, you don’t need to bind the engine for a particular problem domain, says Thangavelu, as Cloud Custodian is specifically bounded to cloud governance and management.
Cloud Custodian is also battle-tested. Many big names are using it in production, including Capital One, HBO Max, Intuit Inc, JP Morgan Chase & Co, Siemens and Zapier. Across its adoption, the tool tends to either be utilized by a Center of Cloud Excellence, security teams to conduct reporting or reveal real-time incidents or the CFO to direct cost optimizations.
Future of Cloud Custodian
Cloud Custodian maintainers describe the project as a consistent firehouse. Since the project tracks changes in all cloud provider resources in real-time, it’s quite a dynamic toolset. In addition to monthly updates, users can expect many additional updates on the horizon, including support for Tencent Cloud as well as additional Kubernetes support. Additional work is being done around shift-left capabilities to meet the developer where they are, says Thangavelu.
For more information, check out the Quickstart Guide to Cloud Custodian, or view the project roadmap here.