DevOps Chats: Continuous Delivery at Airbnb

Spinnaker Summit 2019 Preview: Airbnb is rapidly moving from a monolith Ruby on Rails application to a distributed SOA/Kubernetes architecture in Kubernetes. The new architecture uses self-service codified pipelines and easy webhook integrations, scale adoption and collaboration across the company. Even though continuous integration isn’t new to Airbnb, every team now needs to be able to scale CI across 100’s of containerized services in AWS EC2.

Software Engineer Brian Wolfe co-led the decision to move to Spinnaker and build in more automation. At one year into the project, Airbnb has 40 services in production with many more to follow.

This episode of DevOps Chats features a preview of Brian’s talk, “Scaling a Migration to Continuous Delivery (Airbnb)”. Brian’s talk is on Saturday, November 16 11:00 am, at Spinnaker Summit 2019 in San Diego.

As usual, the streaming audio is immediately below, followed by the transcript of our conversation.

Transcript

Mitch Ashley: Hi everyone. This is Mitch Ashley with staging-devopsy.kinsta.cloud, and you’re listening to another DevOps Chat podcast. Today I’m joined by Brian Wolfe, who is a software engineer at Airbnb. Our topic is a talk that he’s gonna be delivering at Spinnaker Summit 2019 in San Diego. The topic of that talk is scaling of migration to continuous delivery. Continuous delivery is a hot topic right now. That’ll be happening on Saturday, November 16th at 11:00 a.m.

Brian, welcome to DevOps Chat.

Brian Wolfe: Thanks so much, Mitch. It’s really an honor to be talking with you today, so.

Ashley: Well I’m more honored to have you on. I appreciate you taking the time. Tell us a little bit about you. Introduce yourself, tell us what you do and a little bit about–I think we know Airbnb–but tell us about what part in Airbnb that you work in.

Wolfe: Cool. Yeah, so I assume most people who listen to the podcast know what Airbnb is, but we’re a worldwide community of hosts and guests. We bring people together to have really local experiences. I’m gonna skip the rest of that little spiel.

Ashley: [Laughs]

Wolfe: But I’ve been at Airbnb for about three-and-a-half years, and most of that time I’ve been kind of on the operational tooling side, and so a lot of that was observability works, so looking at your metrics and traces and logs, and then performance works, so like figure out how to make Airbnb–monitor Airbnb performance, make it faster and that sort of thing. And so but for the last year I’ve been the tech lead on our continuous delivery team, which is kind of an ambitiously named team. We’re not there yet. But the goal of this team was to really up level how we deliver software at Airbnb.

And so my experience is mostly looking at how we’re making the transition from a monolith which was written in Ruby on Rails to an SOA over the last two years and kind of continuing on into 2020. And part of that is how do we scale how we deliver our software. It used to be a more core team who understood how to scale our monolith. And now every team kind of needs to know how to deliver that software, how to scale it and how to keep it working.

Ashley: Now, is the move to an SOA, and so we mean service-oriented architecture, correct? Is that what you’re referring to?

Wolfe: Yeah.

Ashley: Is the move to that kind of architecture what prompted you to–or order Airbnb to–invest so heavily in this automated continuous delivery? Or was that kind of happening in parallel? Gives a little bit of idea of context of how this kind of came together.

Wolfe: Yeah, so it definitely happened because of the move to SOA. And if you look at what our processes were before, it was very consolidated. Most people were applying the same code base. And so you kind of could have practices that worked for that one code base. But as we moved to SOA we now have hundreds of services that are being deployed, and you need to be able to scale those best practices across all of those services. And so that requires that you add these automation pieces so that humans can make fewer mistakes.

Ashley: That makes a lot of sense. I imagine you have a kind of continuous integration happening at the front end of this, right, leading into your continuous delivery?

Wolfe: Absolutely. So we’ve had continuous integration for a long time. And so all of our unit tests and that sort of thing and all of our builds are happening in continuous integration. That’s been the case for at least three or four years. And we have fairly mature tooling around that. And we actually did a migration to our new platform there fairly recently that containerizes all of our builds and all of our MCI so it kind of runs in a more uniform manner.

Ashley: Mm-hmm. Is that moving to Kubernetes? Is that the change you’re referencing or something else?

Wolfe: So even stuff that’s not running in Kubernetes, so as you can imagine Airbnb is about 10 years old, and so we have a lot of technical history. And so some of that stuff is gonna be running on bare EC2 instances, and some of that is running inside Kubernetes at this point. And so along with the move to SOA we’re also migrating to Kubernetes. And that migration is progressing rapidly. But there’s still gonna be a lot of stuff that has to be built on EC2 and running on EC2. But the builds themselves can happen inside containers.

Ashley: I know we don’t think of Airbnb as something that’s built 10 years ago and evolved over time, right? We think of it as [Crosstalk] – [Laughs].

Wolfe: Yeah, yeah. It has a surprising amount of history for such a ____ company, you know?

Ashley: Well it’s an interesting, you know, it’s an interesting situation that you are involved in, because you’re not talking about a monolith application that was built 30 years ago or 20 years ago. It’s something fairly recent, Ruby on Rails, you know? So it’s still a contemporary programming language, something that everybody is familiar with. But even you have to go through your own kind of architectural and now continuous delivery evolution of that technology.

Wolfe: Absolutely. I think it’s kind of surprising that even at a company that’s ten years old you have what we consider legacy parts of our stack. I think that happens quickly with a field that evolves as rapidly as ours. And if you look at how continuous delivery has evolved in the last ten years, it’s been a pretty big shift.

One of the big things we’ve seen is, you know, we had a really good solution for deployment if you look in 2013-2014 time frame with our system and deploy board. But if you look at it now, it looks a lot like a really good CI system, like a really good continuous integration system. And the parts around actually having an organized pipeline that automates the deployment out, we don’t really have that in house yet. And so that’s really why we’ve been adopting Spinnaker as a solution.

Ashley: Interesting. Well tell us a little bit about were you involved in the choice of bringing in Spinnaker? Was that happening while or before you joined Airbnb?

Wolfe: Mm-hmm. So the decision for adopting Spinnaker happened last year. And so it was a discussion between myself, my manager and then Jing Jing and Jens, who are on the team. And it was largely about do we want to keep investing in deploy board, which was our existing solution. It was deeply integrated with all of Airbnb stack. Had a lot of stuff in it for how we deploy MonoRail, which is our big Ruby on Rails application.

Now, do we want to build automation into that, or do we want to bring in something new? And so we did a big research project to look at what are all the available options out there, and what would those give us, and how could we make those work at Airbnb. And at the end of that we decided that we should not invest more in our internal solution and should instead bring in Spinnaker and customize that to kind of encode Airbnb opinions into that via extensions and then use that as our platform for deployment moving forward.

So we’re still early days in that. We’re about one year in. We have about 40 services onboarded right now. These will be all production services, some of the really critical services out of Airbnb. But considering we have thousands of services at Airbnb, there’s some work to do to kind of make it the standard across.

Ashley: Mm-hmm. Excellent. Well interesting. Tell us some more about I know in the description of your talk, you talk about that you’re gonna discuss how self-service codified pipelines and easy web hook integrations help you scale adoption and collaboration across the company. Tell us some more about what that is.

Wolfe: Yeah, so one of the really harder parts about working at a company that’s growing as fast as Airbnb is its really hard to match the needs of everyone. And so you want to make everything self-service if you can so that, you know, I’m creating a standard pipeline that you know I deploy one thing, I do some AB comparison, I roll it out to my next environment, I do another maybe canary analysis rollout, and so on.

So I can come up with standard components that people can use. But then you have stuff that people have built over time. And so our research team has regression detection that is really specific to them. And they think it provides really high signal if some functionality regression has occurred because you have things like search ranking and the actual layout of the results in the response package, and you need to make sure those things don’t change if you don’t intend them to change.

And so they built a service to actually do that regression detection, and they want to integrate that with their deployed pipeline. And they want that to just be something that automatically happens every time that they run a deploy. You deploy to staging, you call this service with some arguments, you wait for that regression test to pass, and then if it fails you present some custom UI that says, “Okay, this is what failed. This is why. And this is how you can go learn about it some more.” And you move onto the next stage or you fail the pipeline and figure out what’s going on.

What we wanted to do is make it really easy for teams to plug in services like this, ‘cause they do provide the most value. The way we’re doing that is by integrating, actually, with our interface definition language that we use internally. And so we actually extend Spinnaker to just provide a little web hook stage that people can reuse and call out to their services. But then they get a nice user interface on top of that. And so they can kind of provide just a base level of functionality within Spinnaker. And it’s kind of a minimal amount of work for us to onboard a new type of stage. Does that kind of make sense?

Ashley: Yeah. I was just gonna ask you about that. Sometimes it can be a real challenge to introduce such a fundamental tool, like this is fundamental to your software development and delivery process and what kind of things that you can do to minimize the disruption, lower the bar in the area of entry for the teams. And a lot of it depends, too, on how much your organization does software process similarly, or is it very decentralized and everybody kind of does their own thing. It’s such an application if it’s been in a monolith kind of state. I imagine there’s a lot more similarities in how people work at Airbnb. But you’ve gotta still figure out how do you make this easy, right?

Wolfe: Yeah. So I think we’re lucky in the sense that right now there’s really one deployment tool that people use, and that’s deploy board. And now we’ve introduced a second deployment tool. And our strategy here is actually we make the experience when you’re first onboarding actually look a lot like deploy board.

And so Spinnaker lets you customize the user interface. And so we actually have a panel within Spinnaker that shows a view that actually looks really similar to the view that you get in our old tool. But then it provides all this power under the hood when you actually start your deploy. And so that’s been pretty valuable for us for onboarding because it doesn’t look that dissimilar, I would say a little rough around the edges, to be honest still.

It’s, you know, we spent a lot of time optimizing that flow for deploy board. People get a little bit confused sometimes. But just having that initial experience look similar has made onboarding a lot easier. When we first proposed using Spinnaker, people were like, “I can’t figure out what’s going on in the UI. I don’t know what I’m doing here. What are you guys thinking?” But just by providing that similar viewpoint people are like, “Oh, I know what to do. I’ll just click on this button, which looks the same as we saw before, and that should start my deploy doing the right thing.”

Ashley: You know, never underestimate the value of an easy-to-use user interface or process or whatever.

Wolfe: Especially just having that familiarity, I think, it’s–Jens on my team has been a continual advocate for really pushing on, like, make this easy to transition from our old tool to our new tool. And I would say that there are some usability gaps that we’ve had to come across. But they’re getting better, I would say.

Ashley: How far along in the rollout process of this are you? Are you in the still kind of beginning third of it, working with some of the early teams that are adopting? Are you now moving into more broader adoption of it? Where are you kind of in that process, that maturation process?

Wolfe: Yeah, so we’re kind of in the–we’re calling it a beta stage. So in alpha stage we were really hitting some early adopters, people who were willing to take some risks. Now we’re onboarding kind of the big customers, the ones who are pretty demanding and making sure that this actually works for them. And so what we’re doing is we’re measuring how much do we prevent, and so how many rollbacks are we preventing, how many incidents are we preventing, and making sure that Spinnaker, as the platform we’re currently providing, is actually delivering business value.

We’re limiting our onboarding throughout 2019 to about 50 teams. And at the end of 2019 we’ll have a really good idea of what works and what doesn’t. And in 2020 we’re going to open up the adoption to the full company. Kind of at the same time we’re going through a lot of scaling exercises to make sure that we will actually be able to scale to the full company.

Ashley: And give us an idea of the 50 teams, what are usually the size of those teams? What do they range?

Wolfe: Yeah, so the smallest teams will be probably about six or seven developers deploying a service. The largest team that we have onboarded has about 45 developers. And then we’re aiming for one of our front-end services, which is more monolithic, and so that’ll have several hundred developers working on it. But then there’s kind of an operational expert team that manages what that deployment looks like.

Ashley: Mm-hmm. Well that’s a pretty good range in different sizes of teams. So I’m sure–and parts of the application–so I assume they bring different challenges with them. How do you measure success from your decision of both going down the path of implementing continuous delivery and Spinnaker? You talked about business value. You also talked about preventing how many rollbacks and incidents that also happened. But how do you demonstrate to the people that said we’re gonna spend this money having you go implement this, and tell them that it was worth it?

Wolfe: Totally. That is a great question. And so one of the big motivators for us was actually for regression prevention. There are kind of two aspects for CD. One of them is increased productivity for developers. And then another one is having more vigorous processes for rollout. And we were definitely focused on the vigorous processes for rollout and having automated canary analysis by default for every service. And so that’s kind of the route that we’re pursuing, really, in 2019 in terms of proving the business value.

Ashley: Mm-hmm.

Wolfe: In 2020 we’re gonna have more metrics and more ability to say something around how productive we have made our engineers, or at least that we haven’t made the story harder. But right now it’s proven a little bit harder to quantify that. So we get people saying, “Yeah, we love it. We love that we can kind of ignore the deploy process now.” But aside from having a CSAT style score where you’re just saying, you know, rate how productive you are on a scale of one to ten, it’s been pretty hard to quantify the productivity gains we’ve seen with Spinnaker.

Ashley: Well, that is often a challenge, right, for us in the software world in implementing technology, and as you’ve talked about some of the regression testing issues, maybe even incidents in the old process compared to the new, hopefully there are some metrics you can also use from the old methods to the new and to give you some places to start, at least, to benchmark those things.

Wolfe: Certainly for regression detection we have. We have strong evidence that automating this pipeline–and so what I really think of Spinnaker as is a way to automate your run book. Just the fact that it is automated now has reduced the number of regressions that have gotten out to production. We have very strong evidence of that at this point.

Ashley: Well and hopefully you can sing some of the praises, the good things of what the teams have learned so far with folks that are adopting it as you take on some of the bigger challenges like the larger teams do.

Well now, I appreciate you coming on the podcast. It’s been fascinating learning about what you’re gonna talk about. And I wish you the best in your presentation at Spinnaker Summit.

Wolfe: Thanks so much, Mitch. This has been a pleasure. And I hope to meet you at some point soon.

Ashley: That would be fantastic. I would enjoy that very much. I’d like to thank my guest today, Brian Wolfe, software engineer at Airbnb for sharing with us his experience with Spinnaker and how they are migrating to a continuous delivery model and scaling that up in the organization. Brian’s gonna be talking at the Spinnaker Summit 2019, which is November 15th through the 19th in San Diego. His talk is on Saturday the 16th at 11:00 a.m. And I believe you’re gonna be doing that with Jens Vanderhaeghe? Is that correct?

Wolfe: Yes, that is correct.

Ashley: Awesome. Good to have you both up there. Well thank you everyone for joining us today. I’d like to thank our listeners for taking the time to check out the podcast and hear about this talk. This is Mitch Ashley with staging-devopsy.kinsta.cloud. And you have listened to another DevOps Chat with my guest. Be careful out there.

— Mitchell Ashley