Failure Fridays helps one company build a better customer experience

Image may be NSFW.
Clik here to view. Failure Fridays helps one company build a better customer experience

You've heard the adage "practice makes perfect," but at PagerDuty, an organization focused on helping IT operations and DevOps teams manage incident resolution through software, failure makes perfect. Or at least practicing failure helps this company improve its products and services, keeping customers happy and engineers in control.

This approach started in the fall of 2013 when an engineer at PagerDuty became fed up with the fact that his team couldn't discover product bugs early enough in the production stages. It was difficult for the engineers to be proactive and find solutions to problems before a customer encountered a bug for themselves. So, inspired by Netflix's controlled failure testing, they decided to introduce Failure Fridays.

It might sound counterintuitive, but it's grown into a tradition that has helped his team become better prepared when disaster strikes. And it's become an important ingredient in running a successful business, especially at a company like PagerDuty.

"We were mainly inspired by what Netflix had been doing already, how they tested and prepped for infrastructure resiliency, introducing failure scenarios into [their] infrastructure in a live environment and being able to do that in a controlled and safe manner," says Tim Armandpour, vice president of Engineering at PagerDuty.

Introducing failure Friday

Every Friday, Armandpour's team leads get together to figure out what they want to test that week. It can be anything from a new service they launched or sometimes it goes as far as taking an entire availability zone or data center down and forcing it out of commission. The goal is to make sure they can actually stay up and running in the face of emergencies, all without affecting the customers and clients.

It's about understanding failure scenarios, says Armandpour, and establishing best practices and developing thoughtful strategies for when things go wrong. It's also about fostering a team bond, so that everyone can work together under pressure and remain calm through a "controlled and intentional" approach, he says.

PagerDuty is in the business of digital disaster preparedness. The company offers software to businesses to help them better approach, handle and remedy those late night emergencies and disaster scenarios that can occur with technology. For Armandpour, Failure Fridays seemed like a natural extension of the company's overall mission, "We actually started to practice what we preached around how quickly you can get from identifying an issue to actually resolving it," he says. "We want to make sure we're at our best when our customers are at their worst."

What they've learned

Managing three data centers hosted through two different cloud providers, PagerDuty strives for an "always on" environment, so clients are never without their data. And they've gotten close, thanks in part to what they've learned from past Failure Fridays.

For example, Armandpour's team has uncovered scaling issues within their infrastructure, bugs in their products and redundancies in their process. And in one instance, they were able to uncover a bug in Apache Zookeeper that was causing a persistent problem for years and then alert that community to the issue.

Taking a controlled approach

Of course, Failure Fridays aren't done without care -- a lot of planning, strategy and thought goes into each failure scenario. War rooms are set up, teams are briefed and everyone has a handle on what is about to happen and what part they will play. For example, Armandpour's team will often take down a server -- which is one third of their infrastructure -- and then methodically bring it back up within an hour without a single customer noticing, which is the point. He says it helped them become better triage and implement fixes, as well as, build confidence in the team and even develop practices around fire-drill scenarios.

"We're big believers in the notion that you need to plan for things that will go wrong, especially those things that aren't in your control," says Armandpour. And, as he points out, when you are relying on third-party cloud infrastructure for part of your business, "you only have so much control."

Each business will have to build its own approach to Failure Fridays to be successful, he points out. There isn't a simple formula for everyone to follow. Some smaller businesses might have some employees dedicated to managing failure scenarios, while a bigger company might treat it as a more centralized fire-drill, he says. However you approach it, you want the goal to be staying two steps ahead, and being proactive instead of reactive -- so you're never left struggling to fix a problem for a customer that could have been prevented.

"Building that super strong culture where you're not panicking in moments of failure, which I think is fairly commonplace, you build a ton of trust and empathy inside your organization that I think is absolutely invaluable, especially as organizations grow and infrastructures get more complex," Armandpour says.

Tags:

IT news

DevOps

PagerDuty

customer relationship management

Features

Failure Fridays helps one company build a better customer experience

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112