The intake system needs to be simple to ensure reliability.
Scaling Kafka by partitioning intelligently, but they eventually outgrew this; working on more dynamic partitioning.
An individual metric is typically an 8-byte float. Estimated at 10TB/day.
Settled on local SSDs (+S3) over EBS. No built in snapshots, but much cheaper and faster.
Hybrid data storage: different options (based on latency) depending on recency of the data.
Use latency beacons to figure out how much lag exists in the intake pipeline. This granularity is sufficient…more sophisticated coordination options aren't worth the complexity.
Sketches are data structures used to analyze streams of data in real time.
Eg. HyperLogLog, Bloom filters
Used to cache data for percentile/distribution queries…these would otherwise need to touch the entire dataset at read time.
Eventually rolled their own DS that performs dynamic bucketing and essentially (as far as I can tell) ticks the right bucket's counter on each new data point.