Big data is data sets that are so voluminous and complex that traditional data-processing application softwares are inadequate to deal with them. Typically when a matrix is bigger than the memory it can’t be loaded, therefore matrix calculations are compromised.
A solution is to make the calculation piece by piece. Instead of dealing with the entire matrix, it is sliced into submatrixes. A submatrix is loaded, the calculations are done and the result is stored. Another submatrix is loaded, and so on. Of course the process is less effective than processing the matrix unsegmented. But it has the merit of enabling a calculation that can’t be done otherwise…
The languages and architecture we tested are: R, Matlab, ScaLAPACK
Each compute node has the following features:
As part of his final year internship, Zakarié Jorti worked on the linear regression of matrixes too big to be loaded in the memory. Icing on the cake, instead of working with submatrixes which size were similar to the memory, he segmented the matrix in much smaller units.
Therefore he could parallelize the calculation and the loading time of the submatrixes.
This method is called “out-of-core”. It maximizes the loading time.
Zakariaé works on a laptop on one side and on the other side on Neptune, one of the CERFACS super computers. He is working on increasingly larger and larger amount of data to test the limits of the computers.
Results: on an out-of-core laptop, he managed to process the same volume of data as that addressable by the Neptune super calculator.