When a Microsecond Is an Eternity: High Performance Trading Systems in C++
This was a window into a completely different world; it’s almost unbelievable how different engineering priorities are with a HFT system.
The speaker works at a “market maker” HFT firm, which seeks to make a profit by buying and selling very quickly, while trying to maintain the “buy low, sell high” mantra. They don’t attempt to build a position or anything like that.
“Low latency, not high throughput”; they aren’t trying to squeeze a lot of stuff through their systems. Instead, they have a “hot path” that’s executed ~0.1% of the time, but when it executes, it has to be fast.
How fast? Typical wire-to-wire latencies are ~2.5μs (!) for the hot path.
Measurement is key, and far more important than memorizing the entire C++ spec.
You need to know what your compiler is doing so you can work with it + help it along now and then. Tools like https://godbolt.org/ are very useful here.
Hyperthreading can actually slow down a single-threaded workload because the caches (starting at L1) are shared between the hyperthreads.
“I saved 100-200ns” ← by changing this code (with a lot of branches) to this code (with far fewer branches)
Allocations are expensive; preallocate a pool of objects and reuse these.
Exceptions in C++ are relatively fast in the context of an exceptional condition (1-2μs), but abysmal as a control flow primitive.
Use polymorphism + templating to remove conditionals; this makes the hardware branch predictor happy.
“Multithreading is slow and is best avoided” 😯
“Google benchmark” is a good tool for microbenchmarks
C++'s OrderedMap is a good data structure for in-memory lookups.
Linked lists are not guaranteed to be allocated in contiguous memory which is terrible for caching.
The ordered map uses a contiguous sequence of buckets (so loading these is cache-friendly) which point directly to the data you’re trying to look up.
Inlining is not a silver bullet. Specifically, it can pollute the instruction cache and cause misses.
Optimize memory layout so pieces of data that need to be loaded adjacently can be fetched in a single cache line.
This is non-optimal from the point of view of software engineering / data modelling principles.
The hot path runs infrequently, so for a given instance of the hot path executing, the cache has probably been trampled by other things since the last time the hot path exeecuted.
They have a very clever approach to fix this.
Essentially, they simulate the hot path even when the hot path isn’t actually meant to run.
This keeps the caches warm, so when the hot path actually executes, it isn’t starting off with stale caches.
This simulation can go as far as sending the data to the network card but not letting it propagate further; this is pretty much an industry-standard practice, so network cards actually have hardware support for this!
(This is nuts)
You need to profile (measure different code paths in your program relative to one another) and benchmark (wire-to-wire execution time).
The speaker’s most effective setup uses actual production servers in a lab, with a high-precision switch that appends timestamps to the Ethernet-level packets. 🤯
glibc is sometimes an unexpected source of jitter.
In particular, std::pow can vary from ~50ns to ~500ms for almost trivially different inputs.
The most effective way to ensure that the cache isn’t trampled by things that are not the hot path is to turn off all cores but one, and run a single thread on that core.
The speaker actually does this, but I’m not sure how that one thread manages it’s IO workload.
Profile-guided optimization allows you to record your program’s execution in production, and pass the recording as input to your compiler, which can then generate a better-optimized version of the same program.