You've heard the adage "practice makes perfect," but at PagerDuty, an organization focused on helping IT operations and DevOps teams manage incident resolution through software, failure makes perfect. Or at least practicing failure helps this company improve its products and services, keeping customers happy and engineers in control.
This approach started in the fall of 2013 when an engineer at PagerDuty became fed up with the fact that his team couldn't discover product bugs early enough in the production stages. It was difficult for the engineers to be proactive and find solutions to problems before a customer encountered a bug for themselves. So, inspired by Netflix's controlled failure testing, they decided to introduce Failure Fridays.
It might sound counterintuitive, but it's grown into a tradition that has helped his team become better prepared when disaster strikes. And it's become an important ingredient in running a successful business, especially at a company like PagerDuty.
"We were mainly inspired by what Netflix had been doing already, how they tested and prepped for infrastructure resiliency, introducing failure scenarios into [their] infrastructure in a live environment and being able to do that in a controlled and safe manner," says Tim Armandpour, vice president of Engineering at PagerDuty.
Introducing failure Friday
Every Friday, Armandpour's team leads get together to figure out what they want to test that week. It can be anything from a new service they launched or sometimes it goes as far as taking an entire availability zone or data center down and forcing it out of commission. The goal is to make sure they can actually stay up and running in the face of emergencies, all without affecting the customers and clients.
It's about understanding failure scenarios, says Armandpour, and establishing best practices and developing thoughtful strategies for when things go wrong. It's also about fostering a team bond, so that everyone can work together under pressure and remain calm through a "controlled and intentional" approach, he says.
PagerDuty is in the business of digital disaster preparedness. The company offers software to businesses to help them better approach, handle and remedy those late night emergencies and disaster scenarios that can occur with technology. For Armandpour, Failure Fridays seemed like a natural extension of the company's overall mission, "We actually started to practice what we preached around how quickly you can get from identifying an issue to actually resolving it," he says. "We want to make sure we're at our best when our customers are at their worst."
What they've learned
Managing three data centers hosted through two different cloud providers, PagerDuty strives for an "always on" environment, so clients are never without their data. And they've gotten close, thanks in part to what they've learned from past Failure Fridays.
For example, Armandpour's team has uncovered scaling issues within their infrastructure, bugs in their products and redundancies in their process. And in one instance, they were able to uncover a bug in Apache Zookeeper that was causing a persistent problem for years and then alert that community to the issue.
Taking a controlled approach
Of course, Failure Fridays aren't done without care -- a lot of planning, strategy and thought goes into each failure scenario. War rooms are set up, teams are briefed and everyone has a handle on what is about to happen and what part they will play. For example, Armandpour's team will often take down a server -- which is one third of their infrastructure -- and then methodically bring it back up within an hour without a single customer noticing, which is the point. He says it helped them become better triage and implement fixes, as well as, build confidence in the team and even develop practices around fire-drill scenarios.
"We're big believers in the notion that you need to plan for things that will go wrong, especially those things that aren't in your control," says Armandpour. And, as he points out, when you are relying on third-party cloud infrastructure for part of your business, "you only have so much control."
Each business will have to build its own approach to Failure Fridays to be successful, he points out. There isn't a simple formula for everyone to follow. Some smaller businesses might have some employees dedicated to managing failure scenarios, while a bigger company might treat it as a more centralized fire-drill, he says. However you approach it, you want the goal to be staying two steps ahead, and being proactive instead of reactive -- so you're never left struggling to fix a problem for a customer that could have been prevented.
"Building that super strong culture where you're not panicking in moments of failure, which I think is fairly commonplace, you build a ton of trust and empathy inside your organization that I think is absolutely invaluable, especially as organizations grow and infrastructures get more complex," Armandpour says.