The following discussion presents three versions of a function that performs
an element-wise vector addition: a serial C implementation, a threaded C
implementation, and an OpenCL implementation.
The code for a serial C implementation of the vector addition executes a loop with
as many iterations as there are elements to compute. Each loop iteration adds the
corresponding locations in the input arrays together and stores the result into the output
array:
// Perform an element-wise addition of A and B and store in C.
// There are N elements per array