When a Microsecond Is an Eternity: High Performance Trading Systems in C++

  • This was a window into a completely different world; it’s almost unbelievable how different engineering priorities are with a HFT system.
  • The speaker works at a “market maker” HFT firm, which seeks to make a profit by buying and selling very quickly, while trying to maintain the “buy low, sell high” mantra. They don’t attempt to build a position or anything like that.
  • “Low latency, not high throughput”; they aren’t trying to squeeze a lot of stuff through their systems. Instead, they have a “hot path” that’s executed ~0.1% of the time, but when it executes, it has to be fast.
  • How fast? Typical wire-to-wire latencies are ~2.5μs (!) for the hot path.
  • Measurement is key, and far more important than memorizing the entire C++ spec.
  • You need to know what your compiler is doing so you can work with it + help it along now and then. Tools like https://godbolt.org/ are very useful here.
  • Hyperthreading can actually slow down a single-threaded workload because the caches (starting at L1) are shared between the hyperthreads.
  • “I saved 100-200ns” ← by changing this code (with a lot of branches) to this code (with far fewer branches)
  • Allocations are expensive; preallocate a pool of objects and reuse these.
  • Exceptions in C++ are relatively fast in the context of an exceptional condition (1-2μs), but abysmal as a control flow primitive.
  • Use polymorphism + templating to remove conditionals; this makes the hardware branch predictor happy.
  • “Multithreading is slow and is best avoided” 😯
  • “Google benchmark” is a good tool for microbenchmarks
  • C++'s OrderedMap is a good data structure for in-memory lookups.
    • Linked lists are not guaranteed to be allocated in contiguous memory which is terrible for caching.
    • The ordered map uses a contiguous sequence of buckets (so loading these is cache-friendly) which point directly to the data you’re trying to look up.
  • Inlining is not a silver bullet. Specifically, it can pollute the instruction cache and cause misses.
  • Optimize memory layout so pieces of data that need to be loaded adjacently can be fetched in a single cache line.
    • This is non-optimal from the point of view of software engineering / data modelling principles.
  • The hot path runs infrequently, so for a given instance of the hot path executing, the cache has probably been trampled by other things since the last time the hot path exeecuted.
    • They have a very clever approach to fix this.
    • Essentially, they simulate the hot path even when the hot path isn’t actually meant to run.
    • This keeps the caches warm, so when the hot path actually executes, it isn’t starting off with stale caches.
    • This simulation can go as far as sending the data to the network card but not letting it propagate further; this is pretty much an industry-standard practice, so network cards actually have hardware support for this!
    • (This is nuts)
  • You need to profile (measure different code paths in your program relative to one another) and benchmark (wire-to-wire execution time).
  • The speaker’s most effective setup uses actual production servers in a lab, with a high-precision switch that appends timestamps to the Ethernet-level packets. 🤯
  • glibc is sometimes an unexpected source of jitter.
    • In particular, std::pow can vary from ~50ns to ~500ms for almost trivially different inputs.
  • The most effective way to ensure that the cache isn’t trampled by things that are not the hot path is to turn off all cores but one, and run a single thread on that core.
    • The speaker actually does this, but I’m not sure how that one thread manages it’s IO workload.
  • Profile-guided optimization allows you to record your program’s execution in production, and pass the recording as input to your compiler, which can then generate a better-optimized version of the same program.
Edit