#673 Strange Loop 2017 Talk

Sorry for taking forever to get this up. This talk was presented September 30th, 2017 in St. Louis, Missouri.

Practical Applications of the Dickerson Pyramid

Transcript

(Slightly edited for clarity)

Good morning, everyone. This is Practical Applications of the Dickerson Pyramid. My name is Nat Welch. I'm @icco on Github and Twitter. I'm a Lead Site Reliability Engineer at First Look Media in New York. This talks history of the applying the Dickerson hierarchy through a bunch of different jobs. Some of them, they'll be talking about this company called iFixit, which I worked at many years ago in San Luis Obispo, California as a software developer; Punchd, which is a very small startup in San Francisco; Google, which you've probably heard of, as a software engineer on Google Offers, and a site reliability engineer on two Google Cloud products, Google Cloud Storage and Google Compute Engine. Then, my most recent past job, I was a site reliability engineer for 2016 at Hillary for America in Brooklyn, New York.

A quick aside as context, site reliability engineering means a lot of things to a lot of people or a small number of people maybe. I learned SRE at Google. All of my views are heavily tinted by working for a large web company. I'm hoping that you can take some of these practical applications and apply it to whatever sort of software type job you do. If you have questions about how to maybe apply this to something else that's unusual, please talk to me afterwards.

I wanted to talk, what is this Dickerson Pyramid thing? This was originally published ion a summary of Nori Heikkinen -- Sorry if I mispronounced that -- talk from QCON New York 2015. It's originally by Mikey Dickerson, who is an engineer at Google. Then, also, at United States Digital Service. He worked with Nori, plus many others on the healthcare.gov thing, and came up with this hierarchy pyramid as a way to explain the things you need to make a service reliable.

I've redrawn it in green, and put a little circle around it because it's my talk. The circle is supposed to mention communication, which, I think, is the main thing that's missing from the pyramid. SRE, as a role, as I've experienced it, requires talking to a lot of people. While as a software developer, I could maybe spend 70% of my time coding and 30% of my time talking to people. I find SRE is a much higher ratio, 50/50 or higher, depending on the service and company. I'm hoping to show how communication applies to all of these, along with the practical applications I'll be mentioning.

The bottom level, as you saw, was monitoring. Often, those little graphs that spike sometimes for better or for worse, this is always a software that I've used at a bunch of different companies. Point here is that you can use whatever tool works for you. IFixIt, we used Cactus and Ganglia, and it works. At Punchd, we just used Pingdom because, really, all we cared about was if the software was up and our customers could get to it. Google has a large internal system called Borgmon for monitoring. Hillary for America, we used a hosted version of Datadog. First Look Media right now, we use Pingdom and Cloudwatch.

How does communication fit into monitoring? Monitoring is the thing that you use to find out if things are up, or working, or doing the thing you expect. Every business is different. What you actually want to watch and pay attention to is different. You need to talk to the people you work with, and what monitoring tools fits your business needs, and what metrics does your business care about, and also what metrics you care about to actually do your job as either a software engineer, SRE, or ops person, or whatever, and making sure that not just your job, but all of the roles that you have to support can actually get to those metrics, and that they're being recorded.

My first practical application of monitoring comes from my current job, First Look Media. We're a growing media company. We only currently have 10 or 11 engineers, but we support a bunch of different websites. Right now, our current metrics, things are up most of the time, but we're starting to realize that we're having outages that we're not noticing. August, we had three outages. We're better monitoring what had improved downtime. We had about 10 to 15 minutes of degraded service for each outage that if we had monitoring or better monitoring, we probably would have discovered those outages earlier.

Right now, we're working through plans. The big thing is cost evaluation. We don't have a lot of money because we're relatively young. Figuring out what the tools we need versus how much money we're willing to spend, and what solves the business problems, these are large conversations that are ongoing. The bigger thing is also figuring out how to change our engineering culture, so that people actually think to look at the metrics instead of just like, "Well, we could data cash later," or "We could just scale it." Often times, there is no data to support that, but we're still figuring all of that out. It's a culture shift along with starting to ask the right questions.

Picture of our hierarchy again, monitoring at the bottom. Now that we have monitoring, we can talk about incident response because as soon as you have monitoring, you're going to start seeing the things that are not working as you expect. There are two categories I think of when I think of incident response. Alerting, along the lines of who gets alerted, and when you get alerted, and how quickly does that person need to respond. Incident communication, where is communication done, and who is in charge.

There's a long topic that someone else could probably cover on NIMS, which is how most US incident response groups deal with delegation when you come do a large fire, or national emergency, or things like that. It's often adapted, which basically is the first person who shows up is in charge until they delegate it to someone else. Just a thing to think about.

The communication part is, what is an outage for your business. What's worth waking up for? What should you be investigating first thing tomorrow morning? Those are very different topics. Often times, you have to decide whether the business is actually accurate that you should be waking up for a five-minute outage, or if it's something that you can wait until tomorrow. A common problem is, do people know where to go when there is an incident? If something is down, is there a place they can check to actually know that this is being actually worked on, and that they're not first person to discover that something is not working.

My example here is from Hillary for America. It's a very dramatically growing organization. In June of 2016, we decided to redo all of our alerts. We had been given a bunch of alerts from the Groundwork, which is an organization we were working with, plus we had written a bunch, and we had changed as an organization greatly. In the summer of 2015, we were six people. By the summer of 2016 or 2015 to 2016, we were 60 or 70, maybe 80 engineers. While the alerts were great, they were waking us up probably two or three times a week when we didn't need to be, and firing a lot throughout the day for things that weren't actual issues.

We were now the democratic nominee for the democrat party. We had much different concerns than we did the previous year. We started documenting all of our existing alerts. We worked on a doc. We had four SRE ops people inside of HFA. We wrote down the doc of like, "These are the rules of how we're going to evaluate into our three buckets," which are wake me up, send a message on Slack, or just make sure we have a graph. We went through that. We broke up our hundred somewhat alerts that we had among the team. We came back and discussed each alert, and made sure that the suggested bucket by the person who went through that alert worked.

We went from pages being about 30% actionable to about 85% actionable. I started sleeping a lot better, which I was a huge fan of. We explicitly left some noisy alerts in because that's what our business needed. We had a few occasions where four floors would spike on our site because people would link to URLs that didn't exist, either due to just late-night communication, someone missed a step in the checklist, or people that aren't associated with the campaign making up the URLs. We needed to figure out pretty much in all cases, we'd want to be woken up for that. That tended to be a very noisy alert, as an example. For the most part, we slept a lot better.

Pyramid. Next step is postmortems. We've had incidents. Now, we need to go over what the incident was, and try to make ourselves better. The whole point of postmortems on my mind is to document an outage and discuss them. Often, something you hear is, "Will there be a postmortem?" I'm a big proponent of if someone asks, the answer is yes. To also explain this picture, this is my postmortem with a breakfast sandwich. It was delicious.

The communication aspect of postmortem, the only reason a postmortem exists is to communicate. There's a lot here, but you're writing a document for future reference. You're communicating with people you haven't met yet, you very possibly may never work with, but you're trying to document everything that happened, and what went down, and how you fixed it, and why you made your decisions. Important is to write a document that communicates with all parties associated with your business like, do you have a customer service team? Can they understand that they can communicate with external customers? Can you have other engineering teams? Can they understand it, so that they cannot make similar mistakes? Can your boss understand it, so they know what you're doing all day? Those sorts of things.

Example comes from when I was working on Google Compute Engine. There were a lot of outages in 2013. 2013 was the year Google was computing or changing Google Computer Engine from being a data service. It should have been launched in June of 2012 to the external world, to December of 2013 when it opened it up to the full public. There were just a lot of outages. There were about six different product teams and one SRE team. We were running the issue where escalation changes weren't working. People would have outages, and not tell someone. Then, another team would notice the outage because they were dependent on it. Then, they would have an outage. There was no collaboration going on. Then, also, people would break for similar reasons. Like the storage team would break for one reason. Then, the networking team would break for a very similar reason.

We started doing two things. For the most part, we bought everyone into a biweekly meeting for purely on communications. Get everyone on the same room. For the very least, all of the technical leads, cyber reliability engineers, product managers, and senior leaderships know this is what you should be telling the people under you of how we communicate, how we do postmortems, and the types of outages that we're having across our different subteams. Then, we would also bring in, usually, the engineers associated with the actual outage. Then, share the postmortems widely because this is also a distributed team. We had engineers mainly in Seattle, but also in San Francisco, Mountain view, and a few in New York and Dublin.

Next step is testing and releasing. We've dealt with the immediate fires like the monitoring was to make sure we know what's currently going on or has gone on recently. Incident responses to respond to things that we discover in our monitoring. Postmortems to go over those incidents that just happened. Now, we can start looking forward. Usually, the way that engineering teams move forward is by writing code and releasing it.

Testing and releasing is my favorite area for automation because there's a lot you can do to improve the helpers' lives. Often times, people look at how long their build takes, or how long their test run takes. There's many ways. There's a reason there's an entire release engineering field into making this faster. It's very dependent on your team, I think, more so than any of the other categories because every team writes software so differently, depending on your stack, depending on your environment that you're deploying it to. The overall key is to improve people's lives by catching bugs and other issue before they're available to every single user that uses your product.

Conversations that you can start having around testing and releasing, are tests being written? Do tests actually test how customers use your product? How often does your team want to release? Can you release without an outage? Does that matter to your organization? What will happen to this service when you release? These are all sorts of things to start thinking about because, often times, people just write a slew of unit tests and they go like, "Codes tested." You have to have those accessions with the engineers that you're working with.

Two quick stories around this, mainly because I've personally made both of these mistakes. The first engineer I worked with at Hillary for America named Amy Hailes. Amy saved the day basically by automating our Fastly config. Fastly is a CDM service. They let you push Varnish config straight to them. You can write VCL, and send it up. As long as it compiles, you're good to go. It's live on all their Edge Notes. We're running into the problem where (A), we had a different config in both dev, our development environment, and our production environment. Also, we weren't really testing things.

Amy automated our testing across all of our systems to (A), make sure our Varnish compiles and it met a basic set of tests. Then, she also automated the deployment, so that anything that got deployed to dev also go to deployed to prod, and vice versa. We had a consistent config across both environments. Considering that she did this is in June or July, and from there, for those of you who followed the US political elections or presidential election cycle, July is a calm moment with lots of people talking. Then, you keep going crazy from there.

The next is an engineer named Bill at Google. He was my main code reviewer on a project I worked on called the Google Cloud Status. I wrote some very basic unit tests. My build would just be green. It's like, "Bill, let's just auto-deploy on green." It will be great. Every time you review my code, we'll press submit. Then, five minutes later, it will be running for everyone, and it will be amazing. I wanted this because it would allow me to develop faster. Since this was a side project, and not my normal day gig, I wanted to use those 30-minutes here and there that I got to work on it to actually quickly iterate on something.

Bill said no because Bill is a saner person than I am. Bill pointed out that based off of the unit test, I didn't actually know if the code was working. He actually wrote a unit test for me that a lot like basically this whole site was broken, but the tests were all green, as just an example, to prove that I was wrong, which was nice. It reminded me that I didn't actually know how people are going to be using this because, right now, my internal customers were basically me. We started walking through, and actually doing user tests, and started writing up immigration tests. Then, we had an immigration test that turned green, so that we could automate deployment to get it out there, which was nice because we had also meant to do user test at some point. It forced our hand. Thanks, Bill Thiede.

Testing and releasing. We know what the impact of our changes is because we're monitoring our system. Also, we know when we're releasing things. The next thing is to figure out how we're going to grow over time. It's often called capacity planning. Graph of growth of cost over time, but the idea of capacity planning is a plan for every structure growth. It's easy to ignore now, but a large number of web companies around other people's computer, but it's still something you have to think about.

You can help your business out a lot by figuring out the cost of how your infrastructure is going to grow. If you know the next year, give or take, it's going to cost us $20,000, that's a lot different than $5000, or a million dollars, or whatever your scale is. It also helps your engineers think about how they're going to engineer things and design an architect. If you're saying, "We need to stick to only growing our customer by 10 machines," their infrastructure decisions will, hopefully, be different than, "Yeah, you could have whatever you want." It's just something to think about.

The communication is basically, the first step is figuring out how much money you have. In some cases, you don't have a real budget. I've worked at places like Google where there's someone out there managing a budget, but your capacity planning, for the most part, is not there yet. Google Offer, for example, for a long time, we weren't thinking about those. Like, "We're just going to push this feature." It doesn't really matter how it affects things, but working at a small business, when we were an engineering team of six at Punchd, $50 was dinner for the whole team. That was a lot of money because we were 20 something.

How much does the business expect to grow over the next time period? Does that match with your plans? If the business is expecting to figure to double your number of users over the next six months, do you have the money to actually scale to meet that? Are you making the engineering decisions to meet that? It's just you need to have those conversations early, but only early enough that you actually have the data to support what you're going to do.

Example of this from Hillary for America is a software called Pogostick. Pogostick was built was Alex Berke. She built a tool that was a URL shortener for tracking email metrics. Basically, when people sent emails, political campaigns send a lot of email, at least, in the US. We get large spikes of traffic because people will get this email that's like, "Hillary Clinton wants you to donate as much money as you possibly can to us, so that we can keep the lights on and she can win. That would cause large spikes to traffic as people clicked these links. We would want monitor for these spikes. These spikes would grow over time as people signed up for our email list.

As the list grew, the email team would reach out to Alex, and basically be like, "Hey, this week, we're planning x number of sends, which is an increase from last week, and our email list has tripled in size or something like that." Completely arbitrary metrics. She would then decide, "How much of that growth? Are we prepared for that growth?" She had actually saw all of the metrics. She had built a dashboard that met her needs to explain how the system was working.

Sometimes, you'd be like, "Our last send, which was half of this, used 50% of our CPU," or something like that. "If we double it, and last time we doubled, we saw a similar CPU jump. We should increase the number of course we have," or something like that. She would then reach out to the ops team saying, "Hey, I'm going to need more machines." We'd able to have this conversational loop, but since it was basically based on user signup, she didn't have … be able to build as much of a plan to be able to react relatively quickly to say, "Hey, we need to grow this amount because over the past day, we've had to react in this way."

She had the tools to basically initiate the conversation. I didn't have to do anything until Alex sent me a PR or said like, "I'm going to do this. Is that okay?" I'd be like, "Yeah, fine. Spin up four more and four excels," or whatever she needed to do, which was really great for both. We're moving, lowering the amount of energy. We had to focus on each individual application, but also letting her plan for capacity over time based off of conversations she was having with the email team.

Next is, I guess, my favorite topic, which is development. I love the phrase, "Automate yourself out of a job." I think, I don't know, that's why I got really attracted to SRE is I can work on tools, so that I don't have to do the same thing in six months because my memory is horrible, and I'll just often forget how things work. If I've automated it, I really don't have to worry about it. I had said that quote before, and also quit because of that. I feel like I'm, at least, sticking to my mantra for whatever that's worth.

This is another communication thing is I think most software engineers, feel free to correct me if I'm wrong after the talk, but most of us want to build the cool thing. That has a cost to it, both your time and your energy. There are a lot of things. In most organizations I've worked at, there's been one SRE to every 10 to 12 developers. That's a lot of incoming problems and things to think about, just like those 10 to 12 people.

Is it worth me basically ignoring those people for a month to build the next cool tool that I think will make everyone's lives better versus what's out there? Can I use terraform instead of writing my own changed config? Can I use a hosted monitoring system instead of inventing my own time series database or whatever? Or is there a script that does this thing instead of spending the 30 minutes to write it? These are conversations you have to have with yourself and your team knowing their priorities.

I go to the spring planning meeting, in my current job, of all of our engineers. I go to both of them. I basically lose my Wednesday to it because there's lots of planning and people talking about all of the work that they're doing, but it lets me know these are the priorities of the teams that I'm supporting and what they're doing, and whether or not I can take the time to go work on the tools that I want to work on, or if I need to focus on making sure that they can be successful at their jobs.

Then, I think the other aspect of this is how are other companies solving these problems. SREs done in a vacuum, as I've been mentioning over and over, is dev conversations, but not necessarily always within your company. Networking, I have a coffee with someone who works in another startup just down the street, and we talk about all sorts of engineering problems. He's been harping on me for weeks for not checking up this new monitoring system that he thinks is going to solve all my problems. It's a good way to share what other people are doing and how they're solving their problems. This conference and many others are great to actually figure out, how can we solve these problems in ways that either save us time, or energy, or money.

Then, this is my personal problem is I often think that things are much larger problems than most other people. I think we all have our little pet peeves on things that you want to fix right now. Really, they can sit for six months often or at least the day. I highly recommend stepping back for diving into your next coding project and be like, "Do we actually need to solve this right now?" I've definitely lost a few days, and then like, "That was a waste of time."

Examples. These are also from Hillary for America, just two tools that I personally bought or built. They were built in August and September because that was when I felt that we had the other layers taken care of. First was Edgeparty. Once all of our deployment to Fastly was automated, we are running into the problem that a lot of people want to add redirects across applications in a weird way. We moved all of that to basically a Slack pot in an API that people could use, so that anyone could write redirects in Slack, assuming they were part of a white-listed group of users.

Then, Pokemon. We wanted to test every single URL because we thought it was a good idea. Pingdom got expensive. We turned off our Pingdom account, and basically just never stopped tossing us for ourselves the next three months, but it provided response metrics for both time and, I guess, validity of all of our URLs, which was really helpful for us measuring growth over time. They were both built at the last quarter of the campaign.

We can now talk about, I guess, an area, which I feel like SRE teams of much larger companies realize, and that seems to be lost on operations and SRE groups at smaller organizations, just user experience. When you often talk, I think Google personally talks a lot about how SREs a customer service role, and that's because I think the organization has gone to the point where they've solved all of their other problems, I guess.

The point for all of this is to make sure your users have a good experience with your system both in terms of can they load the page, et cetera. We, as people who tend to focus on the infrastructure, we have a lot of effect on that because, sure, there's the feature aspect, but how those features are delivered to our users can be a make or break situation. If your page takes five seconds to load before you can click on anything, it's a much different experience than the instantaneous flash of Twitter, or Google, or something like that.

This is the conversation you'd have. There's, hopefully, someone who knows the answer to these questions at your business, but I've also been in positions where no one has been able to have just figure that out yet because you had just started monitoring or something like that. How are your users experiencing your product? Are they coming in mobile? Are they doing it all at once? Is it like they stay on the page for 20 minutes, or do they only come when you send them an email and they click a link?

Where are your users experiencing your product? Are all of your customers in New Zealand? You have much different networking latency in New Zealand to Virginia versus New York to Virginia? When are your users experiencing your product? My classic example of this now is, once again, Hillary of America because most of our users were in the US, and you could follow the US, the diurnal cycle. Google Compute Engine is running VMs for people all over the world and does not have that night-off time.

A way that I've approached this, I've slightly mentioned this product or thing earlier in the talk, which is status.cloud.google.com. Before the product had launched or before the Compute Engine had launched, this was something, I think, the SRE team was starting to think about, but no one else had really gotten around to it because, I was mentioning, in 2013, people were still fractured and working as independent subteams. I went out like, "Who's going to build status.cloud.google.com?" First of all, at the time, the technical infrastructure organizations cited Google didn't have any designers or anyone thinking about that. There was no person that could even start mocking this out. Then, also, most of the other subteams were working to make sure that they had all of the features that they needed to launch in June.

I, as a relatively new SRE, had some bandwidth. We started basically building up the thing. Building the actual website took, in reality, two months of time, but the next two years were spent talking to engineers, product teams, customer service teams deciding who would actually update this thing, how often does it get updated. I'm personally very proud of that because, now, I think Google's postmortems and status updates are some of the best in the industry. That takes a lot of time to make sure that people know this thing exist, so that they can talk together.

That improves the user experience because as someone who's using one of the cloud services now, you can go and like, "My website is down. Is it Google's fault?" If the answer is yes, hopefully, there's a little yellow or red dot on the screen. It's not a guarantee, but if it's constantly responding the way that you expect them to be telling the truth, then you can, at least, start focusing on your own service, and not blame other people, and actually have a reasonable answer to why something is not working.

That was the pyramid. Again, I hope that it came across that communication is really important in all these, and that you can apply it to many different levels of companies, whether you're six people or 70,000. Please talk to people. I have these slides that we put up somewhere, but there's lots of references and future reading. It's lots of stuff.

I'll be here, standing over here, for questions. If you don't feel comfortable talking in person, you can DM me, @icco on Twitter or that's my email address.

Thank you so much.