Concurrency in Go (Book)

Katherine Cox-Buday

Summary

Ultimately this book laid out a lot more custom patterns (cough..boilerplate..cough) to get seemingly simple things done concurrently in Go. The last chapter was a good read, diving into scheduler internals. Everything leading up to it was a lot more mundane, unfortunately.

Highlights

Chapter 1. An Introduction to Concurrency

Moore’s Law, Web Scale, and the Mess We’re In

This looked like a clever way to solve the bounding problems of Moore’s law, but computer scientists soon found themselves facing down the limits of another law: Amdahl’s law, named after computer architect Gene Amdahl. Amdahl’s law describes a way in which to model the potential performance gains from implementing the solution to a problem in a parallel manner. Simply put, it states that the gains are bounded by how much of the program must be written in a sequential manner.

And so it is in this world of multiple cores, cloud computing, web scale, and problems that may or may not be parallelizable that we find the modern developer, maybe a bit overwhelmed.

Race Conditions

Concurrent code is notoriously difficult to get right. It usually takes a few iterations to get it working as expected, and even then it’s not uncommon for bugs to exist in code for years before some change in timing (heavier disk utilization, more users logged into the system, etc.) causes a previously undiscovered bug to rear its head…

A race condition occurs when two or more operations must execute in the correct order, but the program has not been written so that this order is guaranteed to be maintained.

Most of the time, this shows up in what’s called a data race, where one concurrent operation attempts to read a variable while at some undetermined time another con‐ current operation is attempting to write to the same variable.

Why Is Concurrency Hard?

Atomicity

The first thing that’s very important is the word “context.” Something may be atomic in one context, but not another.

Memory Access Synchronization

In fact, there’s a name for a section of your program that needs exclusive access to a shared resource. This is called a critical section.

Deadlocks, Livelocks, and Starvation

A deadlocked program is one in which all concurrent processes are waiting on one another. In this state, the program will never recover without outside intervention.

It turns out there are a few conditions that must be present for deadlocks to arise, and in 1971, Edgar Coff‐ man enumerated these conditions in a paper. The conditions are now known as the Coffman Conditions and are the basis for techniques that help detect, prevent, and correct deadlocks.

Livelocks are programs that are actively performing concurrent operations, but these operations do nothing to move the state of the program forward.

Have you ever been in a hallway walking toward another person? She moves to one side to let you pass, but you’ve just done the same. So you move to the other side, but she’s also done the same. Imagine this going on forever, and you understand livelocks.

In my opinion, livelocks are more difficult to spot than deadlocks simply because it can appear as if the program is doing work.

Because synchro‐ nizing access to the memory is expensive, it might be advantageous to broaden our lock beyond our critical sections. On the other hand, by doing so—as we saw—we run the risk of starving other concurrent processes.

Simplicity in the Face of Complexity

As of Go 1.8, garbage collection pauses are generally between 10 and 100 microseconds!

For example, say you write a web server, and you’d like every connection accepted to be handled concurrently with every other connection. In some languages, before your web server begins accepting connections, you’d likely have to create a collection of threads, commonly called a thread pool, and then map incoming connections onto threads. Then, within each thread you’ve created, you’d need to loop over all the con‐ nections on that thread to ensure they were all receiving some CPU time. In addition, you’d have to write your connection-handling logic to be pausable so that it shares fairly with the other connections. Whew! In contrast, in Go you would write a function and then prepend its invocation with the go keyword. The runtime handles everything else we discussed automati‐ cally!

Chapter 2. Modeling Your Code: Communicating Sequential Processes

The Difference Between Concurrency and Parallelism

The first is that we do not write parallel code, only concurrent code that we hope will be run in parallel.

What Is CSP?

But first, let’s take a closer look at where Go got a lot of its ideas—the paper at the root of Go’s concur‐ rency primitives: Tony Hoare’s seminal paper, “Communicating Sequential Processes.”

How This Helps You

This is achieved by a promise Go makes to us: that goroutines are lightweight, and we normally won’t have to worry about creating one. There are appropriate times to con‐ sider how many goroutines are running in your system, but doing so upfront is soundly a premature optimization.

Go’s Philosophy on Concurrency

It is therefore completely understandable to be confused as to why the Go team chose to expose memory access synchronization primitives at all. What may be even more confusing is that you’ll see synchronization primitives commonly out in the wild, see people complain about overuse of channels, and also hear some of the Go team mem‐ bers stating that it’s OK to use them.

Remember the key word here is internal. If you find yourself exposing locks beyond a type, this should raise a red flag. Try to keep the locks constrained to a small lexical scope.

If you find yourself strug‐ gling to understand how your concurrent code works, why a deadlock or race is occurring, and you’re using primitives, this is probably a good indicator that you should switch to channels.

Isn’t it possible for (unbuffered) channels to deadlock too?

This is because channels use memory access synchronization to operate, therefore they can only be slower.

Chapter 3. Go’s Concurrency Building Blocks

Goroutines

They’re not OS threads, and they’re not exactly green threads—threads that are managed by a language’s runtime—they’re a higher level of abstraction known as coroutines.

This doesn’t sound any different than goroutines. How are green threads different?

Go follows a model of concurrency called the fork-join model.

It turns out that goroutines execute within the same address space they were created in, and so our program prints out the word “welcome.”

A newly minted goroutine is given a few kilobytes, which is almost always enough. When it isn’t, the run-time grows (and shrinks) the memory for storing the stack auto‐ matically, allowing many goroutines to live in a modest amount of memory. The CPU overhead averages about three cheap instructions per function call. It is practical to create hundreds of thousands of goroutines in the same address space. If goroutines were just threads, system resources would run out at a much smaller number.

But before we do, we have to cover one interesting thing about goroutines: the garbage collector does nothing to collect goroutines that have been abandoned some‐ how.

Something that might dampen our spirits is context switching, which is when some‐ thing hosting a concurrent process must save its state to switch to running a different concurrent process. If we have too many concurrent processes, we can spend all of our CPU time context switching between them and never get any real work done. At the OS level, with threads, this can be quite costly. The OS thread must save things like register values, lookup tables, and memory maps to successfully be able to switch back to the current thread when it is time. Then it has to load the same information for the incoming thread.

225 ns per context switch, wow! That’s 0.225 μs, or 92% faster than an OS context switch on my machine, which if you recall took 1.467 μs.

The sync Package - WaitGroup

WaitGroup is a great way to wait for a set of concurrent operations to complete when you either don’t care about the result of the concurrent operation, or you have other means of collecting their results.

The sync Package - Cond

This is important because a sig‐ nal on the condition doesn’t necessarily mean what you’ve been waiting for has occurred—only that something has occurred.

This sounds like a very weak guarantee!

Broadcast is arguably the more interesting of the two methods as it provides a way to communicate with multiple goroutines at once. We can trivially reproduce Signal with channels (as we’ll see in the section

“Channels” on page 64), but reproducing the behavior of repeated calls to Broadcast would be more difficult.

Channels

However, there is one thing to be wary of when determining whether or not you should utilize a Pool: if the code that utilizes the Pool requires things that are not roughly homogenous, you may spend more time converting what you’ve retrieved from the Pool than it would have taken to just instantiate it in the first place.

To declare a unidirectional channel, you’ll simply include the <- operator. To both declare and instantiate a channel that can only read, place the <- operator on the left‐ hand side, like so:

And to declare and create a channel that can only send, you place the <- operator on the righthand side, like so:

You don’t often see unidirectional channels instantiated, but you’ll often see them used as function parameters and return types, which is very useful, as we’ll see. This is possible because Go will implicitly convert bidirectional channels to unidirectional channels when needed…

This example works because channels in Go are said to be blocking.

This can cause deadlocks if you don’t structure your program correctly.

To close a channel, we use the close keyword, like so:

Notice that we never placed anything on this channel; we closed it immediately. We were still able to perform a read operation, and in fact, we could continue performing reads on this channel indefinitely despite the channel remaining closed.

The second value returned—here stored in the ok variable—is false, indicating that the value we received is the zero value for int, or 0, and not a value placed on the stream.

The range keyword—used in conjunction with the for statement—supports channels as arguments, and will automatically break the loop when a channel is closed.

Closing a channel is also one of the ways you can signal multiple goroutines simulta‐ neously. If you have n goroutines waiting on a single channel, instead of writing n times to the channel to unblock each goroutine, you can simply close the channel.

somewhat interesting because it means that the goroutine that instantiates a channel controls whether it’s buffered. This suggests that the creation of a channel should probably be tightly coupled to goroutines that will be performing writes on it so that we can reason about its behavior and performance more easily…

Buffered channels can be useful in certain situations, but you should create them with care. As we’ll see in the next chapter, buffered channels can easily become a prema‐ ture optimization and also hide deadlocks by making them more unlikely to happen.

This is an example of an optimization that can be useful under the right conditions: if a goroutine making writes to a channel has knowledge of how many writes it will make, it can be useful to create a buffered channel whose capacity is the number of writes to be made, and then make those writes as quickly as possible. There are, of course, caveats, and we’ll cover them in the next chapter.

The first thing we should do to put channels in the right context is to assign channel ownership. I’ll define ownership as being a goroutine that instantiates, writes, and closes a channel.

Unidirectional channel declarations are the tool that will allow us to distinguish between goroutines that own channels and those that only utilize them: channel owners have a write-access view into the channel (chan or chan<-), and channel utilizers only …

The select Statement

I highly encourage you to do what you can in your programs to keep the scope of channel ownership small so that these things remain obvious. If you have a channel as a member variable of a struct with numerous methods on it, it’s going to quickly become unclear how the channel will behave.

The Go runtime will perform a pseudo- random uniform selection over the set of case statements.

When more than one case is “ready”

If there’s nothing useful you can do when all the channels are blocked, but you also can’t block forever, you may want to time out. Go’s time package pro‐ vides an elegant way to do this with channels that fits nicely within the paradigm of select statements.

Time.after

Like case statements, the select state‐ ment also allows for a default clause in case you’d like to do something if all the channels you’re selecting against are blocking.

Conclusion

In the runtime package, there is a function called GOMAXPROCS.

In my opinion, the name is misleading: people often think this function relates to the number of logical processors on the host machine—and in a roundabout way it does—but really this function controls the number of OS threads that will host so-called “work queues.”

Chapter 4. Concurrency Patterns in Go

Confinement

Some languages support utilization of point‐ ers with explicitly immutable values; however, Go is not among these.

Lexical confinement involves using lexical scope to expose only the correct data and concurrency primitives for multiple concurrent processes to use.

Preventing Goroutine Leaks

The way to successfully mitigate this is to establish a signal between the parent gorou‐ tine and its children that allows the parent to signal cancellation to its children. By convention, this signal is usually a read-only channel named done.

The or-channel

Now that we know how to ensure goroutines don’t leak, we can stipulate a conven‐ tion: If a goroutine is responsible for creating a goroutine, it is also responsible for ensur‐ ing it can stop the goroutine.

This is a fairly concise function that enables you to combine any number of channels together into a single channel that will close as soon as any of its component channels are closed, or written to.

Doesn’t this leak all other goroutines?

Pipelines

Notice how each stage is taking a slice of data and returning a slice of data? These stages are performing what we call batch processing. This just means that they operate on chunks of data all at once instead of one discrete value at a time. There is another type of pipeline stage that performs stream processing. This means that the stage receives and emits one element at a time.

Best Practices for Constructing Pipelines

So in a nutshell, the generator function converts a discrete set of values into a stream of data on a channel. Aptly, this type of function is called a generator.

Some Handy Generators

for num := range take(done, repeat(done, 1), 10) { fmt.Printf("%v “, num) }

This syntax is awkward! Is there a better way, and are any of these part of the stdlib?

Empty interfaces are a bit taboo in Go, but for pipeline stages it is my opinion that it’s OK to deal in channels of interface{} so that you can use a standard library of pipe‐ line patterns.

Fan-Out, Fan-In

Fan-out is a term to describe the process of starting multiple goroutines to handle input from the pipeline, and fan-in is a term to describe the process of combining multiple results into one channel.

Queuing

In this way, the true utility of queues is to decouple stages so that the runtime of one stage has no impact on the runtime of another. Decoupling stages in this manner then cascades to alter the runtime behavior of the system as a whole, which can be either good or bad depending on your system.

If the efficiency of the pipeline drops below a certain critical threshold, the systems upstream from the pipeline begin increasing their inputs into the pipeline, which causes the pipeline to lose more efficiency, and the death-spiral begins.

Wouldn’t upstream systems just block, causing requests to be dropped?

The context Package

It turns out that the need to wrap a done channel with this information is very com‐ mon in systems of any size, and so the Go authors decided to create a standard pat‐ tern for doing so. It started out as an experiment that lived outside the standard library, but in Go 1.7, the context package was brought into the standard library, making this a standard Go idiom to consider when working with concurrent code.

But the Value method looks a little out of place. What’s it for? The Go authors noticed that one of the primary uses of goroutines was programs that serviced requests. Usually in these programs, request-specific information needs to be passed along in addition to information about preemption. This is the purpose of the Value function.

If you look at the methods on the Context interface, you’ll see that there’s nothing present that can mutate the state of the underlying structure.

This raises a question: if a Context is immutable, how do we affect the behavior of cancellations in functions below a current function in the call stack?

This is where the functions in the context package become important.

If your function needs to cancel functions below it in the call-graph in some manner, it will call one of these functions and pass in the Context it was given, and then pass the Context returned into its children. If your function doesn’t need to modify the cancellation behavior, the function simply passes on the Context it was given.

At each stack-frame, a function can affect the entirety of the call stack below it.

The context package is pretty neat, but it hasn’t been uniformly lauded. Within the Go community, the context package has been somewhat controversial. The cancella‐ tion aspect of the package has been pretty well received, but the ability to store arbi‐ trary data in a Context, and the type-unsafe manner in which the data is stored, have caused some divisiveness.

Chapter 5. Concurrency at Scale

Error Propagation

Let’s spend some time here discussing a philosophy of error propagation. What follows is an opinionated framework for handling errors in concurrent systems.

First, let’s create an error type that can contain all of the aspects of a well-formed error we’ve discussed:

Does one not exist in the stdlib?

Timeouts and Cancellation

If you take this line of reasoning to its logical conclusion, we see that we must do two things: define the period within which our concurrent process is preemptable, and ensure that any functionality that takes more time than this period is itself preempta‐ ble. An easy way to do this is to break up the pieces of your goroutine into smaller pieces. You should aim for all nonpreemptable atomic operations to complete in less time than the period you’ve deemed acceptable.

What are the implications for separation of concerns here? Does all cpu-bound code need to accept a channel?

Heartbeats

There are a few different reasons heartbeats are interesting for concurrent code. They allow us insights into our system, and they can make testing the system deterministic when it might otherwise not be.

In practice, how likely is it that heartbeats will fail to be delivered because of preemption delays?

Rate Limiting

Most rate limiting is done by utilizing an algorithm called the token bucket.

Chapter 6. Goroutines and the Go Runtime

Work Stealing

As we discussed in the sections “How This Helps You” on page 29 and “Goroutines” on page 37, Go will handle multiplexing goroutines onto OS threads for you. The algorithm it uses to do this is known as a work stealing strategy.

First, let’s look at a naive strategy for sharing work across many processors, some‐ thing called fair scheduling. In an effort to ensure all processors were equally utilized, we could evenly distribute the load between all available processors. Imagine there are n processors and x tasks to perform. In the fair scheduling strategy, each pro‐ cessor would get x/n tasks:

Unfortunately, there are problems with this approach. If you remember from the sec‐ tion “Goroutines” on page 37, Go models concurrency using a fork-join model. In a fork-join paradigm, tasks are likely dependent on one another, and it turns out naively splitting them among processors will likely cause one of the processors to be underutilized. Not only that, but it can also lead to poor cache locality as tasks that require the same data are scheduled on other processors.

OK, these sound like basic load-balancing problems that maybe a FIFO queue can help with, so let’s try that: work tasks get scheduled into the queue, and our process‐ ors dequeue tasks as they have capacity, or block on joins. This is the first type of work stealing algorithm we’ll look at. Does this solve the problem? The answer is maybe. It’s better than simply dividing the tasks among the processors because it solves the problem with underutilized processors, but we’ve now intro‐ duced a centralized data structure that all the processors must use.

Forks are when goroutines are started, and join points are when two or more goroutines are

synchronized through channels or types in the sync package. The work stealing algo‐ rithm follows a few basic rules. Given a thread of execution: 1. At a fork point, add tasks to the tail of the deque associated with the thread. 2. If the thread is idle, steal work from the head of deque associated with some other random thread. 3. At a join point that cannot be realized yet (i.e., the goroutine it is synchronized with has not completed yet), pop work off the tail of the thread’s own deque. 4. If the thread’s deque is empty, either: a. Stall at a join. b. Steal work from the head of a random thread’s associated deque.

Stealing Tasks or Continuations?

Recall that a thread of execution both pushes and (when necessary) pops from the tail of its work deque. The work sitting on the tail of its deque has a couple of interesting properties: It’s the work most likely needed to complete the parent’s join. Completing joins more quickly means our program is likely to perform better, and also keep fewer things in memory. It’s the work most likely to still be in our processor’s cache. Since it’s the work the thread was last working on prior to its current work, it’s likely that this information remains in the cache of the CPU the thread is execut‐ ing on. This means fewer cache misses!

In Go, goroutines are tasks. Everything after a goroutine is called is the continuation.

Go’s work-stealing algorithm enqueues and steals continuations.

All things considered, stealing continuations are considered to be theoretically supe‐ rior to stealing tasks, and therefore it is best to queue the continuation and not the

goroutine.

Unbounded Out of Order Stalling

Why?

So why don’t all work-stealing algorithms implement continuation stealing? Well, continuation stealing usually requires support from the compiler. Luckily, Go has its own compiler, and continuation stealing is how Go’s work-stealing algorithm is implemented. Languages that don’t have this luxury usually implement task, or so- called “child,” stealing as a library.

Before we analyze those, let’s set the stage by starting to use the Go scheduler’s nomenclature as laid out in the source code. Go’s scheduler has three main concepts: G A goroutine. M An OS thread (also referenced as a machine in the source code). P A context (also referenced as a processor in the source code).

In Go’s runtime, Ms are started, which then host Ps, which then schedule and host Gs:

I bring this up because there is one very important guarantee in the runtime: there will always be at least enough OS threads available to handle hosting every context.

When the goroutine eventually becomes unblocked, the host OS thread attempts to steal back a context from one of the other OS threads so that it can continue execut‐ ing the previously blocked goroutine.

How much of this is controlled by the scheduler v/s the kernel?

In this case, the thread will place its goroutine on a global context, the thread will go to sleep, and it will be put into the runtime’s thread pool for future use (for instance, if a goroutine becomes blocked again).

Should this be: “the scheduler will place the thread’s goroutine”?

The global context we just mentioned doesn’t fit into our prior discussions of abstract work-stealing algorithms. It’s an implementation detail that is necessitated by how Go is optimizing CPU utilization. To ensure that goroutines placed into the global con‐ text aren’t there perpetually, a few extra steps are added into the work-stealing algo‐ rithm. Periodically, a context will check the global context to see if there are any goroutines there, and when a context’s queue is empty, it will first check the global context for work to steal before checking other OS threads’ contexts.

Why a separate global context instead of (say) randomly picking a work queue?

Other than input/output and system calls, Go also allows goroutines to be preempted during any function call.

Appendix A - Race Detection

In Go 1.1, a -race flag was added as a flag for most go commands:

Edit