Every data scientist who uses machine learning with contextual features knows how it is difficult to gather enough data to fit a model, especially if his subject deals with people consumption.
Open data is a great solution to find free large datasets which report demographic analytics, public infrastructures referentials, and so on. Majority of principle cities have released its own datasets, but standardization of formats remains a big issue.
To go further you can read this blogpost by Romain about efficient open data.
We have download all the most interesting data released by the French National Institute of Statistics and Economic Studies, INSEE.
We built Open Moulinette with Armand Gilles to process all these files (few hundreds):
– link data on the geographical referential (IRIS areas)
– convert geolocations to standard projections
– generate a unique, easily handled and unified datasheet