One of the most important aspects of our work at nam.R is to find, clean, aggregate and organise large datasets of geo-localised data found in Open Data portals. Very often the same geographical area or object is described in many different datasets, each containing a different piece of information.
To build a rich description of the object the real challenge is to join these pieces together. But in absence of a clear and coherent name or index, this process can become quite difficult and noisy.
For example when trying to aggregate information on a building we may find its price per square meter in a file, while another file may contain information on its heating system or material. Of course though buildings usually don’t have names, and are often indexed differently according to the source, or institution, behind the creation of the dataset. This means that to correctly merge the various sources one has to be creative.
Luckily more often than not the geographical objects come with the coordinates that describe their geometry, and this indeed can be used to merge the data on twin polygons. By twin polygons we mean polygons describing the same object in the different datasets.
That said one can be surprised to see how many small differences are found in the coordinates of one same object coming from different sources: the angles, the number of vertexes, the length of the edges and of course the geolocalization are often slightly different. Moreover a single building in a dataset can be easily represented as several buildings in another one. In the figure above we can see two tricky examples. All this implies that a unique definition of twin polygonsdoes not exist and therefore we leave this task to an algorithm.
The Machine Learning approach we chose consists of training a Random Forest algorithm to give a similarity score of two polygons. To do this we started by describing polygons through a number of geometrical features: compactness, PCA orientation, eccentricity, convexity and more. A polygon is therefore interpreted as a vector of numbers like in the figure above. This allowed us to quantify the difference of any given couple as the difference of these vectors .
To train our Random Forest algorithm we labeled by hand more than 20 000 couples of polygons, coming from spatial intersections of real datasets, as same or different.
The algorithm then learnt to give a similarity score based on the geometrical features of two polygons. This can be interpreted as the probability of the two polygons being the same one. By setting a threshold at t = 0.7 on the similarity score we obtained the following performances on our training set.
We considered these values good enough, and therefore started using this algorithm to put some order in the wildness of the geospatial data universe.
May the merging begin !
When writing his report AI for Humanity, giving meaning to artificial intelligence, Cédric Villani had auditioned nam.R and received a contribution that was included in the fourth part: “Artificial intelligence in the service of ecological transition”. In particular, Cédric Villani […]
A special wink for Emmanuel Bacry co-founder of nam.R and Florian Douetteau Founder of our partnership Dataiku. Congratulations to all actors to make France an international reference in artificial intelligence Cedric O, Mohammed Adnène Trojette Bertrand Pailhès Eric Labaye Professor […]
At nam.R we are working hard to build a Digital Twin of France, and to achieve that we use a lot of sources of information. One of the richest of them being aerial images. For us, humans, “reading” images is […]
To detect all solar panels on the roofs of french buildings, we used aerial imagery and the known outlines of the buildings, through a pipeline consisting of a solar panel outline detector and a filtering algorithm. We took inspiration from […]
Private and public open data, social network data, private data platforms… The web is an infinite source of external data. It comes in a variety of formats: data tables, geolocated data, APIs, images and text. How can organisations take advantage […]