Big data is data sets that are so voluminous and complex that traditional data-processing application softwares are inadequate to deal with them. Typically when a matrix is bigger than the memory it can’t be loaded, therefore matrix calculations are compromised.
A solution is to make the calculation piece by piece. Instead of dealing with the entire matrix, it is sliced into submatrixes. A submatrix is loaded, the calculations are done and the result is stored. Another submatrix is loaded, and so on. Of course the process is less effective than processing the matrix unsegmented. But it has the merit of enabling a calculation that can’t be done otherwise…
The languages and architecture we tested are: R, Matlab, ScaLAPACK
Each compute node has the following features:
As part of his final year internship, Zakarié Jorti worked on the linear regression of matrixes too big to be loaded in the memory. Icing on the cake, instead of working with submatrixes which size were similar to the memory, he segmented the matrix in much smaller units.
Therefore he could parallelize the calculation and the loading time of the submatrixes.
This method is called “out-of-core”. It maximizes the loading time.
Zakariaé travaille sur un ordinateur portable d’un côté, sur Neptune (un des super calculateurs du CERFACS) de l’autre. Il travaille sur des matrices de taille de plus en plus grande afin d’explorer les limites des machines. Résultat : il réussit sur un ordinateur portable en out-of-core à traiter le même volume de donnée que celui adressable par le super calculateur Neptune.