A parallel program is designed to be run at the same time by several processors. The interest of a parallel program is that it is much faster than a non parallel equivalent (up to 1000 times).
R++ takes an interest in three sources of parallelism that are usable in current office computers: multi-core, graphics board, the cloud.
The languages and architecture we tested are: R with the MICE package, C mono-core, C via the graphics board (CUDA) and C multi-core (6, 8, 10 and 12).
The processors we used are:
For the test matrixes of different sizes have been studied: the number of variables goes from 100 to 1 000 and the number of observations for each of them goes from 1 000 to 1 000 000.
For each size of matrix we produced 10% missing values and fixed to 5 the number of imputations.
As part of his final year internship, Chai Anchen compared the efficiency of the bootstrap when using the different languages and architectures.
Multiple account assignment is a method that enables to make statistical analysis of incomplete data sets without underestimating the variance. The principle is:
Then the statistical analysis can be conducted with the completed data set.
No surprise, C is faster than R.
The performances of GPU and multi-core are less distinct. They are better than the CPU but only from a certain volume of data. Below, breaking down the data and communicating with the graphics board and the cores slow down the entire operation.
In the end: