How we reduced oncall pager load from 300/ week to 15/week..
Strategy and framework to deal with high oncall load/tech debt.
When I started in DynamoDB, our team was getting paged ~50 times a day ie 300+ times a week! Instead of the typical weekly oncall rotations, the team moved to a daily rotation to deal with this pager load initially, and then to a 12 hour shift between US and Dublin.
Shipping new releases would take months as we did not have reliable automated tests. Some features were way behind. Engineers would constantly get pulled into operational issues in every sprint. Some of our operational metrics were going the wrong way and we had one of the biggest outages in AWS.
Over the next year, we were able to get out of that downward spiral. Pager load reduced to 10-15 pages a week - while the service itself scaled 5x (75 million requests/sec, 5 nines of availability) and we also delivered multiple new products like backup/restore. The first time we went a full day without a single page, my manager, asked me if the ticketing system was broken!
How did we get out of that oncall spiral?
Many products and services go through phases - a phase where they are innovating significantly and accumulate ‘debt’. Some amount of debt is healthy, otherwise you maybe over engineering early on. Too much debt will eventually result in a massive drag on the team. DynamoDB had gone through the initial innovation phase and had accumulated ‘debt’.
Diagnosis: The first step is identifying (diagnosing) the state your team is in. That will usually dictate what strategy to use. In this situation, we were continually falling behind (i.e lack of predictability) and oncall load was unsustainable -fixes like going to 12 hour shifts usually only address symptoms and don’t address the root cause. The team was not able to make progress on the feature backlog or address root causes of issues. As a result, they were falling behind. This diagnosis drove the strategy.
Strategy and framework: Given the situation and challenges, our strategy and tactics, were largely focused on 3 areas. These sound simple but it requires business / leadership support and explicit hard conversations to enable this.
Create Focus and Predictability:
We had conversations with product and business to identify must-have business priorities or features. The rest were deprioritized. The product team was supportive as they recognized that the current situation was not sustainable. In some cases, what they needed was predictability to better manage customer expectations. On the engineering side, running more things in parallel adds to cognitive load and requires leadership (technical, management, product) bandwidth. Doing fewer things at a time helped the team conserve and put more behind important arrows.
We then expanded the box that was allocated ‘to keep the lights on’ in each sprint. For e.g. instead of just one primary oncall, we created a secondary oncall focused on root cause. This allowed us to focus on fixing root causes when issues happened instead of trying to get off the 12 hour rotation as quickly as possible. While this was more ‘planned’ operational oncall allocation than previously, it led to better predictability in sprints versus having to unpredictably pull people every sprint to deal with operational issues. As we made progress in fixing issues, it had a nonlinear impact in quickly driving pager load down.
Fix Root causes: We also analyzed all the categories of issues to create a plan to address the areas that would reduce the most pager load. This plan and progress was reviewed regularly with leadership. Sometimes fixes were simple like tuning, in other cases it required building new automation or solutions. For e.g. we found a category of issues that paged on StorageNode failures and we invested more in our auto-healing process. Similarly we invested significantly in automating the testing to bring down manual processes from 6 weeks to less than a week.
Additional staffing: After some analysis, we also identified that the team was quite understaffed to handle both the maintenance and feature development. Running a service like DynamoDB which was required for every other AWS service was a significant operational load. The service had attrited senior engineers previously due to high load and struggled to hire new engineers, who are understandably reluctant to join a service with high operational load. The most common question that internal transfers asked me was pager load (more on how we hired for the service in a later post). We did a lot of work to hire and build out the critical areas of the service, which allowed progress in multiple areas simultaneously.
Measuring and Rewarding progress: As we made these changes, we tracked progress closely and celebrated wins in this area. In many companies, new feature work gets attention and the work to improve system and operational performance does not get the same level of recognition even though keeping systems running smoothly and sustainably is perhaps more important. We made a conscious effort to reward work in this area including in promotions, calibrations and team communications. If you are interested in building a culture of operational excellence in your teams, this post covers those aspects more.
Through a lot of hard work, the team eventually not only got the service load to a manageable level but also built significant innovations as well.
Please share with your friends!