Building a Real Time Metrics Database at Datadog

Very similar architecture to Heap's ingest/analytics stack
Multiple Kafka consumers, each writing to a different data store

  • The intake system needs to be simple to ensure reliability.

  • Scaling Kafka by partitioning intelligently, but they eventually outgrew this; working on more dynamic partitioning.

  • An individual metric is typically an 8-byte float. Estimated at 10TB/day.

  • Settled on local SSDs (+S3) over EBS. No built in snapshots, but much cheaper and faster.

  • Hybrid data storage: different options (based on latency) depending on recency of the data.

  • Use latency beacons to figure out how much lag exists in the intake pipeline. This granularity is sufficient…more sophisticated coordination options aren't worth the complexity.

  • Sketches are data structures used to analyze streams of data in real time.

    • Eg. HyperLogLog, Bloom filters

    • Used to cache data for percentile/distribution queries…these would otherwise need to touch the entire dataset at read time.

    • Eventually rolled their own DS that performs dynamic bucketing and essentially (as far as I can tell) ticks the right bucket's counter on each new data point.

Performance considerations