Open Moulinette

open-data / civic
CHALLENGE

Every data scientist who uses machine learning with contextual features knows how it is difficult to gather enough data to fit a model, especially if his subject deals with people consumption.
Open data is a great solution to find free large datasets which report demographic analytics, public infrastructures referentials, and so on. Majority of principle cities have released its own datasets, but standardization of formats remains a big issue.
To go further you can read this blogpost by Romain about efficient open data.

SOLUTION

We have download all the most interesting data released by the French National Institute of Statistics and Economic Studies, INSEE.
We built Open Moulinette with Armand Gilles to process all these files (few hundreds):
– link data on the geographical referential (IRIS areas)
– convert geolocations to standard projections
– generate a unique, easily handled and unified datasheet

RESULT

Open Moulinette is effective, you can find the source code on Github. Feel free to contribute, it’s open source.
Special gift: If you’re interested in building a dashboard with these data (using Docker, Elasticsearch and Kibana), a dedicated tutorial is available.