Function std.parallelism.TaskPool.reduce.reduce
Parallel reduce on a random access range. Except as otherwise noted,
usage is similar to reduce
. There is
also fold
which does the same thing with a different parameter
order.
auto auto reduce(Args...)
(
Args args
);
This function works by splitting the range to be reduced into work units, which are slices to be reduced in parallel. Once the results from all work units are computed, a final serial reduction is performed on these results to compute the final answer. Therefore, care must be taken to choose the seed value appropriately.
Because the reduction is being performed in parallel, functions
must be associative. For notational simplicity, let # be an
infix operator representing functions
. Then, (a # b) # c must equal
a # (b # c). Floating point addition is not associative
even though addition in exact arithmetic is. Summing floating
point numbers using this function may give different results than summing
serially. However, for many practical purposes floating point addition
can be treated as associative.
Note that, since functions
are assumed to be associative,
additional optimizations are made to the serial portion of the reduction
algorithm. These take advantage of the instruction level parallelism of
modern CPUs, in addition to the thread-level parallelism that the rest
of this module exploits. This can lead to better than linear speedups
relative to reduce
, especially for
fine-grained benchmarks like dot products.
An explicit seed may be provided as the first argument. If
provided, it is used as the seed for all work units and for the final
reduction of results from all work units. Therefore, if it is not the
identity value for the operation being performed, results may differ
from those generated by reduce
or
depending on how many work units are used. The next argument must be
the range to be reduced.
// Find the sum of squares of a range in parallel, using
// an explicit seed.
//
// Timings on an Athlon 64 X2 dual core machine:
//
// Parallel reduce: 72 milliseconds
// Using std.algorithm.reduce instead: 181 milliseconds
auto nums = iota(10_000_000.0f);
auto sumSquares = taskPool .reduce!"a + b"(
0.0, std .algorithm .map!"a * a"(nums)
);
If no explicit seed is provided, the first element of each work unit is used as a seed. For the final reduction, the result from the first work unit is used as the seed.
// Find the sum of a range in parallel, using the first
// element of each work unit as the seed.
auto sum = taskPool .reduce!"a + b"(nums);
An explicit work unit size may be specified as the last argument.
Specifying too small a work unit size will effectively serialize the
reduction, as the final reduction of the result of each work unit will
dominate computation time. If TaskPool
for this instance
is zero, this parameter is ignored and one work unit is used.
// Use a work unit size of 100.
auto sum2 = taskPool .reduce!"a + b"(nums, 100);
// Work unit size of 100 and explicit seed.
auto sum3 = taskPool .reduce!"a + b"(0.0, nums, 100);
Parallel reduce supports multiple functions, like
std
.
// Find both the min and max of nums.
auto minMax = taskPool .reduce!(min, max)(nums);
assert(minMax[0] == reduce!min(nums));
assert(minMax[1] == reduce!max(nums));
Exception Handling:
After this function is finished executing, any exceptions thrown
are chained together via Throwable
and rethrown. The chaining
order is non-deterministic.
See Also
fold
is functionally equivalent to reduce
except the
range parameter comes first and there is no need to use
tuple
for multiple seeds.