Parallelism, Scalability, Efficiency in Multithreaded Neural Computation

ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY
Volume 1, Number 2, 1998, 183 - 199

Adrian T. MURGAN, Felicia IONESCU
"Politehnica" University of Bucharest, Romania
Department of Computer Science
Spl. Independentei 313, 77206 Bucharest, Romania
E-mail: atmurgan@atm.neuro.pub.ro, fionescu@atm.neuro.pub.ro

Abstract.
In multithreaded execution on shared-memory multiprocessors, the neural computation algorithms are partitioned into a number P of threads, which concurrently run on P processors. Different grain values of multithreaded parallelization of the neural computation are obtained depending on which loop of the algorithm is distributed over the threads. These grain values correspond to different parallelism degree exploited for parallelization of the training and usage phases of the neural network (neuron parallelism and training set parallelism). The analysis of the algorithms points out different types of data-dependencies which require additional operations for removing them. Data-dependency carried by consecutive steps occurs when a step requires all elements of an array (vector or matrix) which was computed by multiple threads in a previous step. This data-dependency is removed by insertion of synchronization points, which forces all threads to wait for the completion of a previous step. Reduction operation in shared variables (like error measure) is parallelized by partial results accumulation in thread local variables, followed by concurrent updating of the shared variable, using critical section protection. Reduction operation in variables which can't be shared (like weight matrices in training set parallelism) requires to add an extra dimension to theses variables, resulting global shared arrays. Each thread owns an element of these arrays and computes its value, and all other threads can access the whole arrays for read. For each parallel implementation of neural networks training and usage, the theoretical speedup, efficiency and scalability function are computed, in order to determine whether parallel processing offer the desired improvement in performances and to select implementation strategies. The estimated and measured results show that the training speed is significantly increased with this approach on shared-memory systems and this is an important fact in the present context, with wide availability of multiprocessor stations.