Simple generic parallelism idiom & C++17 specifics

If you programmed something and it turns out to be too slow, often you’d like to farm out the work to multiple threads. At this point you typically have a few choices:

  1. use some kind of producer/consumer pipe
  2. half-ass it by dividing the work into n chunks and launch n threads and hope for the best
  3. rely on one of the magic parallelization solutions (#pragma omp)
  4. use (newfangled) language features

All of these options are painful to some extent. Writing a high quality producer/consumer pipe is non-trivial, and you are likely to get it wrong. Dividing up work into n chunks only works if each chunk will (more or less) take the same amount of time to run, which you often can’t know.

Magic parallelisation techniques meanwhile may well do what you want, but often also do not. By virtue of their magic, it is hard to tell what is going on.

Finally, some languages have pretty neat primitives for running stuff in parallel. These may however not be universally available. For C++17, it turns out that at least on my computer (gcc version 11.2.0 Ubuntu 11.2.0-7ubuntu2), it works, but only if you hold it just right.

In this post I want to share an extremely simple compiler and language independent idiom that I chanced on a few months ago. It is so simple that once you see it you may wonder why it is even worth blogging about, but the thing is, I would’ve loved to have read about this one neat trick years ago.

At the end of this page, I’ll also show how to achieve the same thing using C++17 primitives, and what you need to do to make it work in practice.

The problem

  vector<Thing> input;

  // gather things ..            

  for(const auto& i : input) {
    doStuff(i);
  }

Here we have a whole bunch of Things that need stuff done to them. As it stands this code is as good as it gets on a single thread. If we now find that doStuff() is too slow, we’d like to run the loop in parallel.

In C++17 using a modern compiler, this is a oneliner (for_each(execution::par, ...), but it comes with some caveats we’ll get to at the end of this page.

Here is a universally applicable (and near-trivial) technique that will work in any language that supports atomics and threads (source):

  vector<Thing> input;

  // gather things ..            

  atomic<size_t> ctr = 0;

  auto worker = [&ctr, &input]() { 
    for(size_t n = ctr++; n < input.size(); n = ctr++)
      doStuff(input.at(n));
  };

In the piece above we define a lambda function called worker. In there, a loop gets ‘work item numbers’ from an atomic counter (ctr), and checks if this number is smaller than the size of the input vector. If so, it does stuff to the item n, and otherwise the work is done and the thread exits.

The key for line is worth some special attention. The first component and last components are identical, unlike a regular for loop in C(++). Both serve the same purpose: get the next number from ctr. If you type this in by hand, chances are enormous that you start with n = 0 and that breaks everything subtly, so do pay attention!

Next up, here is how the threads are fired up in modern C++:

  vector<thread> workers;
  for(int n=0; n < 4; ++n)  // number of threads
    workers.emplace_back(worker);
  
  for(auto& w : workers)
    w.join();

And that’s it! It is not quite a one liner, but the key concept in the for-loop is, and the rest is just mechanics. But without any further effort, this has divided the work fairly over multiple threads, even if some items require more work than others.

Note that this snippet launches 4 threads, so you have full control (and responsibility) over that number.

The modern C++17 solution

The C++ 2017 specification includes a lot of parallel algorithms, and these appear to be pretty well thought out. Even though it is 2022 as I write this, not all deployed compilers support this newfangled stuff, or support it particularly well. In addition, you may be constrained in what compiler versions you can use.

If it works, the equivalent of the idiom described above is:

std::for_each(std::execution::par, input.cbegin(), input.cend(),
                doStuff);

For Linux and gcc, out of the box, your parallelized algorithm code will often compile and run, but it might not actually use multiple threads (!) until you install the OneTBB Threading Building Blocks and recompile and relink with -ltbb. This is quite surprising behaviour.

The OneTBB library itself (GitHub page) has very interesting capabilities, so if you have to depend on it anyway, be sure to check out what else it has to offer. The license is Apache 2.0.

C++17 offers many parallelized algorithms, but note that these may not magically speedup all your code. Often your execution speed will be limited by RAM access latency and multiple threads only make that worse. Nevertheless, I’ve experienced amazing speedups with parallelized sort, for example. Finding out if this helps is as easy as adding std::execution::par to a call:

sort(execution::par, input.begin(), input.end());

But do test - if anything is off, your code will end up running single threaded anyhow. As an example, today in testing, I found out that std::for_each(execution::par) on a std::map will not run in parallel.

The source code to the examples above can be found here.

Summarising

Using a simple atomic counter and a well-crafted for-loop, it is possible to farm out work to multiple threads without too much hassle. This technique is obvious enough that I see it as an idiom. In addition, if you use C++17 and configure your compiler and linker correctly, you may be able to benefit from parallel algorithms provided by the language. This turns the simple construct above into a one-liner.