R++ est un logiciel d’analyse statistique haute performance. Simple, rapide, efficace. Pour mettre les statistiques à la portée de tous.


Big data

Definition: Big Data

Big data is data sets that are so voluminous and complex that traditional data-processing application softwares are inadequate to deal with them. Typically when a matrix is bigger than the memory it can’t be loaded, therefore matrix calculations are compromised.

A solution is to make the calculation piece by piece. Instead of dealing with the entire matrix, it is sliced into submatrixes. A submatrix is loaded, the calculations are done and the result is stored. Another submatrix is loaded, and so on. Of course the process is less effective than processing the matrix unsegmented. But it has the merit of enabling a calculation that can’t be done otherwise…

Software and hardware

The languages and architecture we tested are: R, Matlab, ScaLAPACK

  • CPU: Intel Xeon E5630@2.53Ghz / Intel Core i5-2520M @2.50Ghz×2
  • Neptune: BULL B510 engine with 128 bi-socket compute nodes

Each compute node has the following features:

  • 2 Intel Sandy Bridge (E P E5-2670) processors
  • 2 processors [8 cores 2.6 Ghz (8 flops per cycle per core delivering a peak performance of 330 GFlops/s per node)]
  • 32 Go memory per node (DDR3 at a memory speed of 1066 MHz)
  • 32 Ko L1 caches (instructions and data), 256 Ko L2 cache per core
  • 20 Mo L3 cache shared by the 8 cores of each processor

Linear regression

As part of his final year internship, Zakarié Jorti worked on the linear regression of matrixes too big to be loaded in the memory. Icing on the cake, instead of working with submatrixes which size were similar to the memory, he segmented the matrix in much smaller units.

Therefore he could parallelize the calculation and the loading time of the submatrixes.

  • The first submatrix is loaded in the memory.
  • The first submatrix is processed while simultaneously the second submatrix is loaded in the memory.
  • The second submatrix is processed while the third is loaded…

This method is called “out-of-core”. It maximizes the loading time.

Results & Analysis

Zakariaé works on a laptop on one side and on the other side on Neptune, one of the CERFACS super computers. He is working on increasingly larger and larger amount of data to test the limits of the computers.

Results: on an out-of-core laptop, he managed to process the same volume of data as that addressable by the Neptune super calculator.

For further information

Download Zakarié Jorti’s thesis. It explains in details the methods and results above.