Not awful outages: Telling users fast

Sometimes it takes awhile for someone to notice an incident. There’s a lot of reasons for that:

  • Few if any users are impacted.
  • The affected feature isn’t used frequently.
  • Most users and SREs are offline, due to the time of day or a holiday.
  • The continuous tests (your own monitoring) didn’t catch the error and then page out an SRE.

This user journey maps both the user’s view and an SRE’s view of an incident:

Incident user journey mapped to SRE journey

Once you know about an incident, it still might take awhile for a notification to be published. Given the complexity of cloud offerings, an incident needs some analysis to determine how large the blast radius is: number of users, scope of features, and number of regions affected. It’s not really helpful to users if you post a message that says:

“Uh-oh, something might be wrong. Maybe the whole Internet’s down. Or, it could be just our world-wide data centers. Or, maybe it’s only the AI apps in the Houston region.”

Induce panic or calm the patient

Photo by Tim Gouw on Unsplash

When generic messages like the one above are posted, users tend to worry and start immediately checking their apps. What we’ve seen in support channels is people posting several messages asking for help, speculating about the extent of the incident, and generally inducing more panic with other users. Crowd psychology confirms these panic behaviors, now played out in the digital realm.

So although it’s important to get a message out quickly, it’s also equally important to not panic users. We don’t want to waste their time, if we simply scope the impact for them. Then, they won’t lose an afternoon (or worse an evening) investigating something that might not even apply to their apps. On the other hand, you can and will post notifications, and they can calm users down, when worded well and sent in a time-sensitive manner. For what it’s worth, our research has indicated that cloud users expect incredibly fast response times, meaning within minutes.

To get a feel of the user’s perspective during an incident, please explore this snippet of activities of an AS-IS user journey:

Breaking radio silence

When you are ready to break radio silence, do it as quickly as possible. Different teams have varying processes to get messages out. We’ve already talked about templates as one mechanism to speed up the notification creation. Another option is to use APIs or other automation to decrease the time between learning about the incident and publishing its notification.

In cloud, you can also use updates to iteratively improve a notification. The first posting might have simple details. The next update might include actions that you’ve taken to diagnose the problem and even scope the impact up or down (“now we’ve discovered that only the European region is affected.”) A final update of course is needed to indicate that the incident is resolved, including any actions users might need to do. For example, users might need to restore from backup or to use a different configuration.

How frequent? Preferably you should make updates every hour, so that your users don’t think you’ve gone “radio silent.” In other words, don’t make your users wonder whether a problem still exists or it was fixed. They might be observing a different issue but assume that it’s related to the published incident, when it’s not.

Automating posts to Slack, Twitter, or similar

Slack has become the de facto way that many DevOps teams drive all their workflow and communications. It’s great to spur quick discussions, and it’s also an easy place to get the word out. So we started experimenting with one product, IBM Cloud Kubernetes Service, to see what channels were better than others. That team and its users were quick to ask for Slack.

For Slack, we’re trying out Slack bots to help facilitate communication in private channels with the SREs. Then the Slack bot can also be used to post official notifications in public channels. The following image show the dutyshift bot being tested in the #test-dutyshift channel.

chatOps example: template command

Credibility and authority for posting to social networks

Another thing we observed in support channels is credibility issues. Social networks have a variety of people in them. And some of them are helpful experts, but they’re not the official SREs responsible for the product. So we’d see seemingly random people posting what looks like official notifications, but they were not authorized to make public statements about incidents. Unsurprisingly, people in those channels can be a bit skeptical about the information they find on the Internet. So they start to doubt the credibility of the notification. Oops.

To address credibility issues, the people (including bots) who will post the incident information must have a complete profile in Slack,Twitter, or the other network as shown in this example:

  • Full name: Antonia Hernandez
  • Title understandable to users, for example IBM Site Reliability Engineering
  • Humans: Photo of the person (no avatars)
  • Bot: IBM team name and avatar is fine

Product manager for IBM Cloud. Food and travel lover. Sometimes found on the water. Opinions are my own.