# When a Microsecond Is an Eternity: High Performance Trading Systems in C++

• This was a window into a completely different world; it’s almost unbelievable how different engineering priorities are with a HFT system.
• The speaker works at a “market maker” HFT firm, which seeks to make a profit by buying and selling very quickly, while trying to maintain the “buy low, sell high” mantra. They don’t attempt to build a position or anything like that.
• “Low latency, not high throughput”; they aren’t trying to squeeze a lot of stuff through their systems. Instead, they have a “hot path” that’s executed ~0.1% of the time, but when it executes, it has to be fast.
• How fast? Typical wire-to-wire latencies are ~2.5μs (!) for the hot path.
• Measurement is key, and far more important than memorizing the entire C++ spec.
• You need to know what your compiler is doing so you can work with it + help it along now and then. Tools like https://godbolt.org/ are very useful here.
• “I saved 100-200ns” ← by changing this code (with a lot of branches) to this code (with far fewer branches)
• Allocations are expensive; preallocate a pool of objects and reuse these.
• Exceptions in C++ are relatively fast in the context of an exceptional condition (1-2μs), but abysmal as a control flow primitive.
• Use polymorphism + templating to remove conditionals; this makes the hardware branch predictor happy.
• “Multithreading is slow and is best avoided” 😯
• “Google benchmark” is a good tool for microbenchmarks
• C++'s OrderedMap is a good data structure for in-memory lookups.
• Linked lists are not guaranteed to be allocated in contiguous memory which is terrible for caching.
• The ordered map uses a contiguous sequence of buckets (so loading these is cache-friendly) which point directly to the data you’re trying to look up.
• Inlining is not a silver bullet. Specifically, it can pollute the instruction cache and cause misses.
• Optimize memory layout so pieces of data that need to be loaded adjacently can be fetched in a single cache line.
• This is non-optimal from the point of view of software engineering / data modelling principles.
• The hot path runs infrequently, so for a given instance of the hot path executing, the cache has probably been trampled by other things since the last time the hot path exeecuted.
• They have a very clever approach to fix this.
• Essentially, they simulate the hot path even when the hot path isn’t actually meant to run.
• This keeps the caches warm, so when the hot path actually executes, it isn’t starting off with stale caches.
• This simulation can go as far as sending the data to the network card but not letting it propagate further; this is pretty much an industry-standard practice, so network cards actually have hardware support for this!
• (This is nuts)
• You need to profile (measure different code paths in your program relative to one another) and benchmark (wire-to-wire execution time).
• The speaker’s most effective setup uses actual production servers in a lab, with a high-precision switch that appends timestamps to the Ethernet-level packets. 🤯
• glibc is sometimes an unexpected source of jitter.
• In particular, std::pow can vary from ~50ns to ~500ms for almost trivially different inputs.
• The most effective way to ensure that the cache isn’t trampled by things that are not the hot path is to turn off all cores but one, and run a single thread on that core.
• The speaker actually does this, but I’m not sure how that one thread manages it’s IO workload.
• Profile-guided optimization allows you to record your program’s execution in production, and pass the recording as input to your compiler, which can then generate a better-optimized version of the same program.
Edit