2013: big data start to take off. Not to die an idiot I (Christophe Genolini, founder of R++, statistician and computer engineer, R fan) decide to delve further into the subject.
Google tells me that Big Data is data sets that are too voluminous to be dealt with traditional programs. But the definition of “too” varies depending on the domain: in 2013 IT engineers are already capable of processing about 50 To, whereas statisticians reach their limit around 1 Go…
Why? Why such a difference? Why are we, statisticians, limited to “farcical big data” when our IT cousins work with volumes 50 000 times bigger?
At the time I didn’t imagine for a second where this small question would lead me.
During the first brainstorming session the grievances are received. The users have to reply the questions “What’s really hard with the current software? What takes an entire day when you counted on one hour? What is difficult, boring, sensitive, long, risky, source of error…?
Opposite is the simplified result of a mini-session about statistical analysis softwares.
2013 still. As general softwares (Oracle, Python, SQL, designed and used by computer engineer) can process really big big data, there is a big temptation to abandon statistical analysis softwares (designed and used by statisticians). Unfortunately the general softwares are not appropriate:
In conclusion general softwares are not suitable for statistical analysis.
On the other hand business softwares are comprehensive but not effective. They do not (or not enough) use the graphics board, the networks, the out-of-core disk reading, or modern HCI.
Worse, every call in R is done by duplicating the argument. When you want to change one single value in a database you use “[“, which is a function. R duplicates the database, changes the value in the data base and overwrites the initial database. That’s one of the reasons for the difference 50 To / 1 Go.
Hence the project to create a statistical analysis software, designed by statisticians (to really meet users’ needs) and served by the latest IT innovations (to really be effective): R++, the Next Step…
2014. A final year intern, Chai Anchen, develops a statistical analysis method on graphics board. In theory, using the graphics board would be much faster than traditional programming. We want to check whether practice is in keeping with the theory.
We choose a modern analysis method called “multiple account assignment”. Chai develops in R with the “MICE” library, in C in a traditional processor, and in Cuda in a graphics board. He runs his programs with matrixes of different sizes.
The results are edifying! For big matrixes, Cuda is up to 800 times faster than R: the program runs for 2 minutes instead of 26 hours for MICE.
The proof is given: the slowness of statistical analysis softwares is not a fatality!
2015. During his final year internship, Zakarié Jorti works on the linear regression of matrixes too big to be loaded in the memory. The idea is to segment the matrix in units much smaller than the computer’s memory.
The first submatrix is loaded in the memory. This first submatrix is processed while simultaneously the second submatrix is loaded in the memory. Then the second submatrix is processed while the third is loaded…
This method is called “out-of-core”. It maximizes the loading time.
Zacharié works separately with a laptop and with Neptune (the super-computer of the French national weather bureau). He works with matrixes getting bigger and bigger to explore the limits of both engines.
Result: with the laptop and the out-of-core method he can process the same amount of data than with Neptune.
2015. The project begins to take shape. But I don’t want to leave anything to chance.
So I get in touch with my ex-colleague at the LRI, Michel Baudoin Lafon, world specialist of HCI. At the time I don’t really know what is HCI. I just remember an extraordinary demonstration Michel gave for the 20th anniversary of the LRI. I ask him to recommend me a book about HCI. Instead he gives a much precious initiation to video prototyping. I fell in love (with the science, not with Michel!). With his team, and then alone, I do a series of users meetings.
Grievances burst out: graph export, database fusion, data reading, encoding, alpha canal, outlier spotting… Solutions follow, often graphic, sometimes ingenious, always appropriate.
2016. Clément Dupont comes for his internship of master’s degree in HCI. During six months he develops a first interface for data management. Results are spectacular again: we processed in 5 minutes what would be processed in 1 hour with R.
In short: faster; bigger; easier. In all fields IT experts do better than statisticians. It is time to encapsulate all this in a software. But the project is huge, I still hesitate to throw myself into it.
I get in touch with Bruno Falissard, director of my ex-research team and quite my mentor. He doesn’t have much time but is passing through Toulouse. I meet him at 7 am at the airport and present him the project broadly. I remember his answer verbatim: “It’s a crazy project. It will be a lifework. But the community needs it. So go ahead!”
So I throw myself into the project. Alone I don’t stand a chance. I answer a call for applications for a public financing called ANR. I’m rejected, twice in a row.
I look for alternative options and the idea crossed my mind to go private. I get in touch with the region’s business incubator “just for info”. I get an appointment. They don’t ask any technical question, their single concern is “How will you make money”. I had never wondered.
I make a business plan, I apply, I’m rejected. I modify my application file, I apply again, this time it’s good.
3 months later I file articles of incorporation for Zébrys. Aim: creation and commercialization of a high performance statistical analysis software. R++ project was born!