One of the most important aspects of our work at nam.R is to find, clean, aggregate and organise large datasets of geo-localised data found in Open Data portals. Very often the same geographical area or object is described in many different datasets, each containing a different piece of information.
To build a rich description of the object the real challenge is to join these pieces together. But in absence of a clear and coherent name or index, this process can become quite difficult and noisy.
For example when trying to aggregate information on a building we may find its price per square meter in a file, while another file may contain information on its heating system or material. Of course though buildings usually don’t have names, and are often indexed differently according to the source, or institution, behind the creation of the dataset. This means that to correctly merge the various sources one has to be creative.
Luckily more often than not the geographical objects come with the coordinates that describe their geometry, and this indeed can be used to merge the data on twin polygons. By twin polygons we mean polygons describing the same object in the different datasets.
That said one can be surprised to see how many small differences are found in the coordinates of one same object coming from different sources: the angles, the number of vertexes, the length of the edges and of course the geolocalization are often slightly different. Moreover a single building in a dataset can be easily represented as several buildings in another one. In the figure above we can see two tricky examples. All this implies that a unique definition of twin polygonsdoes not exist and therefore we leave this task to an algorithm.
The Machine Learning approach we chose consists of training a Random Forest algorithm to give a similarity score of two polygons. To do this we started by describing polygons through a number of geometrical features: compactness, PCA orientation, eccentricity, convexity and more. A polygon is therefore interpreted as a vector of numbers like in the figure above. This allowed us to quantify the difference of any given couple as the difference of these vectors .
To train our Random Forest algorithm we labeled by hand more than 20 000 couples of polygons, coming from spatial intersections of real datasets, as same or different.
The algorithm then learnt to give a similarity score based on the geometrical features of two polygons. This can be interpreted as the probability of the two polygons being the same one. By setting a threshold at t = 0.7 on the similarity score we obtained the following performances on our training set.
We considered these values good enough, and therefore started using this algorithm to put some order in the wildness of the geospatial data universe.
May the merging begin !
Predicting the Solar Potential of Rooftops using Image Segmentation and Structured Data How could we automatically estimate the amount of electricity that solar panels could generate on any building of a country, instantly? It would require to be able to […]
The techniques might change but the objective of election campaigns is always the same: to understand each block in the constituency so that they can be effectively targeted with customized messages and campaign actions. Those days are gone when voters’ […]
Insurance sector suffers from lack of image. This market has a captive clientele (since they are legally obliged to continue paying after having purchased an insurance) that in return feels that they have to pay too much and that too […]
How do 18-35 year olds eat? In 2017, Kingcom did a study on the eating habits of millennials. This study revealed that 18-34 year olds are changing the social dimension of food with their massive use of internet during all […]
The contribution of big data in marketing will first translate into improved marketing tools and strategies. Personas are a thing of traditional marketing toolbox. These ‘robot portraits’ personify the customers and the kind of people a company is targeting and […]
When surfing on computer or smartphone, a user can leave a lot of personal information (digital footprint) such as age, gender, profession and most importantly, his current location. Information about user’s location can be transmitted even without his or her […]
Space industry has been growing strongly after the success of SpaceX. According to a report by the Tauri Group, in 2015, the total amount invested by venture capitalists in space deals rose to $ 1.8 billion, which was 70% more […]
As Josselin Perrus of Linkurious interviewed by Science & future said: “The brain processes the image 60 000 times faster than a text”. We understand the interest of the graphic representation to understand complex information. To visualize mega-data a picture […]
The consulting firm Roland Berger pulls up a table of the uses of digital technology in a report coproduced with Google, which nuance these enthusiastic findings. Indeed French people like e-shops: they buy online, via their computer, and with a […]
In France, the project of modernization of the public life is already bearing its fruits: steps like asking a birth certificate, report its income or consult the land registry are accessible online and adopted by many French people. This ease […]
Big data and Eco-Renovation: The combination is overdue The eco-renovation is to renovate the houses by improving their isolation and their ventilation and heating/cooling system. The benefits of the eco-renovation are convincing: more comfort for the occupants, reduced bills and […]
What are consumers waiting for to start reducing their energy bill? Maroš Šefčovič, vice-president of the Commission at the head of the energy Union in February 2017 declared that “the heating and cooling represent 40% of the consumption of […]