|
Abstract.
In multithreaded execution
on shared-memory multiprocessors, the neural computation algorithms are partitioned into a
number P of threads, which concurrently run on P processors. Different grain values of
multithreaded parallelization of the neural computation are obtained depending on which
loop of the algorithm is distributed over the threads. These grain values correspond to
different parallelism degree exploited for parallelization of the training and usage
phases of the neural network (neuron parallelism and training set parallelism). The
analysis of the algorithms points out different types of data-dependencies which require
additional operations for removing them. Data-dependency carried by consecutive steps
occurs when a step requires all elements of an array (vector or matrix) which was computed
by multiple threads in a previous step. This data-dependency is removed by insertion of
synchronization points, which forces all threads to wait for the completion of a previous
step. Reduction operation in shared variables (like error measure) is parallelized by
partial results accumulation in thread local variables, followed by concurrent updating of
the shared variable, using critical section protection. Reduction operation in variables
which can't be shared (like weight matrices in training set parallelism) requires to add
an extra dimension to theses variables, resulting global shared arrays. Each thread owns
an element of these arrays and computes its value, and all other threads can access the
whole arrays for read. For each parallel implementation of neural networks training and
usage, the theoretical speedup, efficiency and scalability function are computed, in order
to determine whether parallel processing offer the desired improvement in performances and
to select implementation strategies. The estimated and measured results show that the
training speed is significantly increased with this approach on shared-memory systems and
this is an important fact in the present context, with wide availability of multiprocessor
stations. |