For a simple multi-core device, we could either use a low-level coarse-grained
threading API, such as Win32 or POSIX threads, or use a data-parallel model such
as OpenMP. Writing a coarse-grained multithreaded version of the same function
would require dividing the work (i.e., loop iterations) between the threads. Because
there may be a large number of loop iterations and the work per iteration is small, we
would need to chunk the loop iterations into a larger granularity (a technique called
strip mining, (Cooper and Torczon, 2011)).