R++ est un logiciel d’analyse statistique haute performance. Simple, rapide, efficace. Pour mettre les statistiques à la portée de tous.

 

Parallelism

Definition: Parallelism

A parallel program is designed to be run at the same time by several processors. The interest of a parallel program is that it is much faster than a non parallel equivalent (up to 1000 times).

R++ takes an interest in three sources of parallelism that are usable in current office computers: multi-core, graphics board, the cloud.

 

  • The multi-core solution is the fact of using simultaneously all the CPUs of a computer.
  • The graphics boards (GPUs) are massively parallel electronic circuits with thousands of processors. Those have a very small memory, which could be a handicap generally speaking but doesn’t matter in the specific case of mathematical calculations such as the resolution of linear systems.
  • Finally, the cloud is a network of connected computers.
logiciel parallelisme

Software and hardware

The languages and architecture we tested are: R with the MICE package, C mono-core, C via the graphics board (CUDA) and C multi-core (6, 8, 10 and 12).

The processors we used are:

  • CPU : Intel Xeon E5645@2.4Ghz
  • GPU : Tesla C2050, 3GB, 1.15GHz
  • Multi-core : 12-Core Intel Xeon E5645@2.4Ghz

 

For the test matrixes of different sizes have been studied: the number of variables goes from 100 to 1 000 and the number of observations for each of them goes from 1 000 to 1 000 000.

For each size of matrix we produced 10% missing values and fixed to 5 the number of imputations.

Multiple account assignment

As part of his final year internship, Chai Anchen compared the efficiency of the bootstrap when using the different languages and architectures.

 

Multiple account assignment is a method that enables to make statistical analysis of incomplete data sets without underestimating the variance. The principle is:

  • Initialise all the missing values. When a value is missing it is replaced with one of the possible values, which leads to a full set of data.
  • “Predict” the first variable missing values, thanks to the all the other variables, with a linear regression. The missing values of the first variable are replaced with the predicted values.
  • Iterate with all the variables. Each time replace the missing values with the predicted values.

Then the statistical analysis can be conducted with the completed data set.

Results & Analysis

No surprise, C is faster than R.

The performances of GPU and multi-core are less distinct. They are better than the CPU but only from a certain volume of data. Below, breaking down the data and communicating with the graphics board and the cores slow down the entire operation.

In the end:

  • For small calculations C is better (although it’s neutral on one single calculation, waiting 1 or 10 milliseconds doesn’t matter).
  • For big data sets the CPU is clearly more effective. It’s 1.24 times faster than the 12-multi-core, 2.4 times faster than the 4-multi-core, 3.7 times faster than C and 819 times faster than R.

For further information

Download Chai Anchen’s thesis. It explains in details the methods and results above.