Before answering the question let me give you a bit of context:
When we decided to move our platform to the Cloud the goal was to make product development faster and easier and reduce the efforts running custom infrastructure. These aspects were (and still are)specifically important:
- Benefit from external innovation that is happening in the cloud space
- Make teams more autonomous and less dependent on shared infrastructure
- Foster „You Build It – You Run It“ (YBIYRI) culture
- Provide cost transparency and with this induce cost awareness
We talked to other companies that undertook similar migrations. These were companies that were roughly comparable in size and culture. We got a lot of valuable experiences first hand from these exchanges. One thing we learned that using a single AWS account for productive workloads did not work well for them at scale. Many moved to a multi account setup while Zalando directly started with this approach.
When we evaluated which approach to take for ImmobilienScout24, the criteria we looked at were:
- Blast radius of incidents like security breaches, hitting API and resource limits and human error
- Auditability of changes
- Clarity about ownership and use case at resource level
- Standardized tooling and workflows for deployment, monitoring, alarming, …
- Freedom to diverge from standards when necessary
- Cost transparency
We checked these criteria against three different scenarios:
- Single account for the whole company
- Shared development, testing and production accounts for all services and teams (company staging)
- At least one account per team and/or product, freedom to create more accounts for finer separation (multi-account)
The result of our evaluation was that the multi-account scenario has clear advantages in limiting the blast-radius, cost transparency and YBIYRI.
As newer services often do not support tagging – DynamoDB or CloudFront only got tagging recently – separation by account is the only way of identifying the responsible business unit. While we had some success introducing a DevOps culture on our data center platform, we felt that a clear separation of services would further improve YBIYRI. One account is owned by a clearly defined group of people and only those could be responsible for changes.
The multi-account scenario was only lacking in “standardized tooling.” It would take more effort to distribute and run commonly used services. Also resources would not be used as efficiently by many smaller installations as by one large installation. But at the same time, quality of these services would likely be better because setup and maintenance would have to be user-friendly and automated.
Furthermore, we decided that between the accounts we will not provide an inter-account network of private communication channels like VPN. Instead all communication should be done via public internet using safe communication channels like HTTPS together with OAuth2.
This would enforce usage of explicitly defined APIs and clear relationships between producer and consumer. In the data center we have lots of unauthenticated backend connections between services just because “it works.” Clients often rely on services in a way that was not intended. During our migration towards a micro service architecture these unclear dependencies proved to be fragile and error-prone. For more details, please read Schlomo Schapiro’s blog post on this topic.
Since that decision we have integrated more than 40 accounts. Also tools and services were built to support the strategy :
- Light weight OAuth2 provider: authentication and authorization of machines and employees for our own APIs
- AFP: authentication of machines and employees to securely use AWS APIs in multiple accounts by using temporary credentials
- Continuous delivery for software and infrastructure
- Distributed logging and monitoring stack that is used in self-service
- Tools for reproducible account setup
After two years the decision for a multi account strategy proves to be a huge success.
Cost transparency is indeed easily achieved. Business units have taken over responsibility for the costs their accounts are creating.
Reduced Vendor Lock-in
Communication through public APIs with hard security is widely accepted among engineers and easily set up. This also prevents us from a total vendor lock-in. Each product cannot only be built in a separate account but even at another cloud provider. In fact, we already integrate services running in Azure with services running in the data center using the OAuth service. So each product can deeply integrate with the cloud provider offering the best technology. Based on this experience, we believe to have a valid exit strategy in case AWS stops to be the best solution for us. See another of Schlomo’s blogs posts on this topic here: http://blog.schlomo.schapiro.org/2015/08/cloud-exit-strategie.html
While AFP makes working with multiple accounts easy and straightforward, like OAuth2 it also enables further security hardening. Because it issues temporary IAM credentials only, the threat of “losing” credentials is reduced to a minimum. It also enforces a model of declarative security that simplifies configuration and compliance.
Autonomy and Responsibility
Teams have full autonomy using AWS. The self-service character of our infrastructure components does not require any further central administration of resources after the account was created. This makes it hard to enforce standardized tooling and workflows, which are important to the company so that engineers can switch teams and teams can contribute to each other’s products. Our solution to this dilemma is to make the tooling so good and easy to use for standard tasks that there is simply no necessity to invent your own. Everybody is encouraged to support and contribute to the common toolset. But keeping this community alive requires continuous efforts.
Account owners accept the responsibility for security and compliance. The cloud team supports by providing a platform that is secure by design and consulting when evaluating risks and countermeasures.
The actual sizing and assignment of accounts is not strictly regulated. We started creating accounts for teams but quickly found out that assigning accounts per product or group of related services makes more sense. Teams are changing members and missions but products are stable until being discontinued. Also products can be clearly related to business units. Therefore we introduced the rule of thumb: 1 business unit == 1 + n accounts. It allows to clearly identify each account with one business unit and gives users freedom to organize resources at will.
Next Step: Optimize!
One of the biggest cost factors in AWS is EC2. We already use Docker as a developer-friendly packaging format but run each container on a single EC2 instance. Running these containers on clusters will provide higher utilization and lower costs, but only if the clusters reach a reasonable size. Therefore account sizes will rather start growing than shrinking. Another idea is to introduce shared accounts for compute clusters only. Stay tuned!
So, as you have probably anticipated: There is no definitive answer to the original question.