Managing SRE

for some of the world’s busiest systems

Ian Miell - Author, Docker in Practice

Story

  • How I got from this

stress

Story

To this

chilled

Why Give This Talk?

  • Blog post

  • Share experience - 'war story'

  • Real culture change

  • How it really got done

Questions

  • How many have been on call?

  • How many have been unsure whether to escalate? Felt stress?

  • Brought down a live system while trying to fix it?

  • Ever said the phrase 'we need more documentation?'

How We See Our Orgs

howyouseeyourorg

How We See Other Organisations

howyouseeotherorgs

Context

  • How I got there

    • Dev, then tech lead, moved to on-call

    • Originally team of 5, grew to 10, later up to 50 worldwide

Where we were

  • 24/7

  • 1500+ Priority issues per year

  • Live code changes

Numbers

  • 22m bets in 24 hours (Grand National)

  • 60.5m account transactions

  • 173m account transactions over Cheltenham Festival

  • Thousands of 3rd line support issues per year of substance

  • 6-50 staff follow the sun 24/7 escalations

Something had to give…​

  • Issues kept rising

  • Adding more people was not working

  • Mental health

Trigger

  • Two books

    • The Checklist Manifesto

    • The Goal

The Checklist Manifesto

checklistmanifesto

The Goal

thegoal

Industries

  • Aviation

  • Construction

  • Medicine

  • Production

Key Ideas

  • PRACTICAL process automation (Checklist Manifesto)

    • HumanOps

  • Factory metaphor (The Goal)

    • SRE as knowledge factory

Step 1 - Investment

7 months' writing

count

Step 2 - Documentation is not enough

Operational documentation must be:

  • Simple to find

  • Easy to follow

  • TRUSTED (cf tests)

  • Deeper more structured docs can be elsewhere

'We need more documentation'

  • No!

  • Need less and better!

  • Need to use it properly

  • Documentation is not an artifact

Simplicity

Step 3 - Process

  • Triage

  • Post-incident Review

What Happened Next

Chronology

  • 4 months in - nothing!

  • 7 months - widespread approval

  • 12 months - processes absorbed by team

    • team self-regulated (mostly)

Benefits (I)

  • Onboarding

    • Smoother on-ramp

  • Calmer escalation

Benefits (II)

  • Discipline

  • Training/knowledge gaps easier to identify

  • More time

Challenges

  • The 'documentation fairy'

    • 'thanks [Engineer name]. Fix it then We should correct KBs as we go when we find something wrong'

Challenges

  • Within Org

    • Documentation standards

  • Dislike of process

Lessons

  • Numbers

  • Culture

Numbers

  • Triage

    • 10% ongoing cost

    • Like docs, initial setup hard

  • Post-incident review

    • 5-10% ongoing cost

  • Overall effort on docs ~10-20%

Culture

  • Changing culture is Hard

    • Carrot and stick needed (NHS)

    • Persistence - events will push you back

Thanks!