Almost all of these are now obvious to me in hindsight, but this would’ve been an invaluable read a couple of years ago.
- Distributed systems are different because they fail often
- Writing robust distributed systems costs more than writing robust single-machine systems
- Robust, open source distributed systems are much less common than robust, single-machine systems
- Coordination is very hard
- If you can fit your problem in memory, it’s probably trivial
- “It’s slow” is the hardest problem you’ll ever debug
- Implement backpressure throughout your system
- Find ways to be partially available
- Metrics are the only way to get your job done
- Use percentiles, not averages
- Learn to estimate your capacity
- Choose id spaces wisely
- This is probably the only point here that was non-obvious to me. Essentially, how your data is uniquely identified has implications for storage and retrieval.
Consider version 1 of the Twitter API. All operations to get, create, and delete tweets were done with respect to a single numeric id for each tweet. The tweet id is a simple 64-bit number that is not connected to any other piece of data. As the number of tweets goes up, it becomes clear that creating user tweet timelines and the timeline of other user’s subscriptions may be efficiently constructed if all of the tweets by the same user were stored on the same machine. But the public API requires every tweet be addressable by just the tweet id. To partition tweets by user, a lookup service would have to be constructed. One that knows what user owns which tweet id. Doable, if necessary, but with a non-trivial cost. An alternative API could have required the user id in any tweet look up and, initially, simply used the tweet id for storage until user-partitioned storage came online. Another alternative would have included the user id in the tweet id itself at the cost of tweet ids no longer being k-sortable and numeric.
- Exploit data-locality
Networks have more failures and more latency than pointer dereferences and fread(3).
- Writing cached data back to persistent storage is bad
- Computers can do more than you think they can
- Use the CAP theorem to critique systems
- Extract services