< Blog

Lessons We All Learn

Hero image for Lessons We All Learn
4 min read
  1. Email is the worst monitoring and alerting mechanism except for all the others.

  2. Your most critical services are kept alive by a handful of people whose job description does not mention those services at all.

  3. Most of your actual work is not covered by your OKRs.

  4. Absence of a signal is itself a signal.

  5. The severity of an incident is measured by the number of rules broken in resolving it.

  6. If a post-mortem follow-up task is not picked up within a week, it's unlikely to be completed at all.

  7. There is no cloud, it's just someone else's computer.

  8. Serverless isn't.

  9. When you determine "human error" as the root cause, then you're doing it wrong.

  10. If you break it, you own it - for now. If you fix it you own it forever.

  11. "Obsolete" doesn't mean it's not in use and relied on heavily.

  12. If you see a big name company give a talk at a conference about some cool thing they made, it's probably already been abandoned by that company.

  13. "Prod" is just another name for "staging". In other words you test in Prod.

Testing in Prod

  1. Your infrastructure uses a lot more self-signed certificates than you think. A lot more. In places that make you weep.

  2. Self-signed certificates beget long lived certs, which beget lack of certificate validity monitoring, which begets curl -k, which begets a lack of certificate deployment automation, which begets self-signed certificates.

  3. Containers create at least as many problems as they solve.

  4. Kubernetes creates problems that aren't even invented yet.

  5. The source you're looking at is not the code running in production.

  6. One in a million is next Tuesday.

  7. Two is one, and one is none.

  8. Very few operations are truly idempotent.

  9. "Asserting state" beats "monitoring for compliance" any day.

  10. Your network team has a way into the network that your security team doesn't know about.

  11. There are very few network restrictions creative and determined use of ssh port forwarding can't overcome.

  12. It is tempting to jump right into implementing a solution when the right thing may well be to not do the thing that requires the solution in the first place.

  13. Turning things off permanently is surprisingly difficult.

  14. That "completely automated" solution you set up requires at least three manual steps you didn't document.

  15. Schrödinger's Backup -- "The condition of any backup is unknown until a restore is attempted." -- is overly optimistic. If you’ve never restored from a backup, you don’t actually have backups.

  16. If you’ve never failed over to another region, you don’t actually have failovers.

  17. If you’ve never rolled back a deploy, you don’t have a mature deploy pipeline.

  18. In any organization practicing continuous integration, half of all commits are to fake out CI tests.

  19. There's an xkcd for the precise situation you find yourself in.

  20. Eventual consistency doesn't help when the system you're debugging hasn't converged yet.

  21. Real change can only be implemented above layer 7.

Layer 8

  1. Any sufficiently successful product launch is indistinguishable from a DDoS; any sufficiently advanced user indistinguishable from an attacker.

  2. Your herculean efforts to upgrade the OS across your entire fleet completed just in time for the EOL announcement of the version you upgraded to.

  3. Doubling your time estimate in the hopes of beating expectations won't work because your manager takes your estimate, has a hardy laugh, and then resets it back to what they already promised upchain.

  4. Management will always happily spend $$$ on outside consultants to tell them what you've been saying for years.

  5. Management will much rather invest in inventing a new, square wheel than fixing an old round one.

  6. A: However well you understand concurrency, you only need one coworker who doesn't understand concurrency to make your life an unending hell. B: You always have at least one coworker who doesn't completely understand concurrency.

Note: Many of these came from Jan Schaumann's Tweets 👍.

Sharing is Caring

Edit this page