- A lot of great advice; Designing Data-Intensive Applications covers a number of these topics and goes into a lot more detail.
You are lucky if 99.999% of the time network is not a problem.
- Are networks really reliable?
- Networking outages might be affecting more cases even though not all events are making much noise. Cloud customers don’t necessarily have visibility into their problems either.
- Databases might advertise themselves as ACID but might still have different interpretations in edge cases or how they handle “unlikely” events.
Each database has different consistency and isolation capabilities.
- Consistency and isolation are expensive capabilities. They require coordination and are increasing contention in order to keep data consistent.
- When having to horizontally scale among data centers (especially among different geographic regions), the problems become significantly harder.
- The SQL standard only defines four isolation levels even though there are more levels theoretically and practically available.
Clock skews happen between any clock sources
- The most well-hidden secret in computing is that all time APIs lie. Our machines don’t accurately know what the current time is. Our computers all contain a quartz crystal that produces a signal to tick time. But quartz crystals can’t accurately tick and drift in time, either faster or slower than the actual clock. Drift could be up to 20 seconds a day. The time on our computers need to be synchronized by the actual time every now and then for accuracy.
- Atomic and GPS clocks are better sources to determine the current time but they are expensive and need complicated setup that they cannot be installed on every machine. Given the limitations, in data centers, a multi- tiered approach is used. While atomic and/or GPS clocks are providing accurate timing, their time is broadcasted to the rest of the machines via secondary servers. This means every machine will be drifted from the actual current time with some magnitude.
- Google’s TrueTime is following a different approach here.
- TrueTime uses two different sources: GPS and atomic clocks
- TrueTime has an unconventional API. It returns the time as an interval.
- This method adds some latency to the system especially when the uncertainty advertised by masters are high but provides correctness even in a globally distributed situation.
AUTOINCREMENTing can be harmful
- In distributed database systems, auto-incrementing is a hard problem.
- Some databases have partitioning algorithms based on primary keys. Sequential IDs may cause unpredictable hotspots and may overwhelm some partitions while others stay idle.