Quality Metrics for OSM: Directions and POI – Transcription
Our next speaker is Jiaqi Meng. And he’s going to be talking about directions and points of interest. Go ahead. And looking at the quality of these in OSM.
It’s working. Sure. Sorry. Notes. All right. Thanks for Rasagy bringing out that amazing conversation about the quality control of OpenStreetMap. I’m Jiaqi Meng. I’m a UC Berkeley senior student and a research assistant in the school of business. I’m doing research with a professor who give a great talk yesterday afternoon. We’re doing research on some OpenStreetMap topics. And today I’m going to talk about some quality metrics for OSM. So OpenStreetMap is very popular and used in many different systems and applications. One particular topic that interests us is how good performance is compared to other maps like Google Maps.
And we feel it’s important with the roads and point of interest and of course the running engine. We took to the most common types of scenarios, navigation and search point of interests.
So the first approach is rolling comparison. I want to compare this with the commercial platform like Google Maps. Because I use Google Maps more often than not. So it’s great as a standard. This gives a fairly clear idea of how OSRM does. Our second approach is point of interest completeness. It’s basically when I go to somewhere like Boulder, you want to search for pizza. How many pizza houses can pop up in OpenStreetMap? And for navigation first part, we have two main goals when comparing navigation. We want to know how good is OSRM compared to Google Maps and we want to know if it’s varying by different regions. By which I mean different counties or different states.
So the first problem we have is how to find a large set of roads that covers the whole United States. One road is constructed by two single points. One origin, one destination. In order to do this, we use open addresses.io. It’s a free database that stores millions of addresses across the states. Data is in many places, by which I mean zip code. A zip code column becomes mostly empty in many unpopular counties, like rural counties. As a result we wrote a simple program to determine the county ID, AKA, FIPS. And you fit into one point and output a polygon which indicates county you are in.
And we chose a hundred data points, 50 origin destination pair in each county to construct 50 roads in each county. And we fit all the roads into the OSM engine and Google metrics API. Takes a long time because there are about a hundred thousand roads. And the data the result. So when I’m feeding that into Google Maps metrics, I set the starting time to be 2 a.m. PDT to avoid most of the traffic. Then we can see a data snippet. You can see the origin the origin street and destination and FIPS, which is county ID. How long does Google take to go from somewhere to the other? And how long do OSM take and their distance? The units of distance are in meters and time is in seconds.
For the example they are around the same. And it was Google time definitely used a much shorter time. So on average out of about 100,000 routes in the U.S. Google Maps takes 24.839 kilometers while OSM takes 25.6 kilometers. Overall OSM distance are very comparable to Google Maps. The histogram on the left, I’m not sure if you can see the accesses, it’s kind of small.
It’s a plotted number of ratios in each interval. The ratio is the Google Maps divided by OSM. The higher the ratio, the longer Google Maps used for routing and vice versa. And the U.S. map. The dark area means that Google Maps uses a shorter distance than OSM. And the light area means OSM uses the shorter distance than Google Maps.
Okay. So, here are the data that I grouped by the county. I sort them by the last column. Which is the quotient. Again, the quotient of Google Maps distance and OSM distance. You can see the top ten counties the for OSM is longer and shorter than Google Map respectively. For example, first entry is Colorado. The ratio is 1.32. Down in the bottom ten where OSM distance are high than Google Maps distance.
Okay. So here is a box plot. Notice that I separate the U.S. into three regions. The west, the mid and the east. The color stands for the regions, basically. There are some spikes. You can see that Massachusetts are high above everyone else. And Indiana and North Carolina are kind of the bottom. For those of you who are not familiar with a box plot, the middle line in between each rectangle is the median. And the top is third quartile, and the bottom is first quartile, 25%. Each point if you can see each point is basically it means a county.
And yeah. It would be more clear if I sorted it. You can see that on the west and east, the entries are almost while the Google almost has a higher I’m sorry, a longer distance than the OSM. The ratio is higher. I normalized the Y axis so some of the ratios are below zero.
So here is an example. On the lefthand side is OSM, on the righthand side is Google Map. As you can see that OSM takes a shorter route. It’s definitely a lot shorter than Google Maps. And the time span is shorter than Google Maps does.
Here is another. It’s basically if you I’m not sure why the coordinate is so shifted. The coordinate is basically maps to the same point. Google Maps on the righthand side takes a large detour and ends up at the same point. And OSM stays where it is, which makes sense.
So our conclusion is we’re comparing in terms of the routing quality, and they are pretty quite close. And more on the east or the west. And because of the roading algorithms are different, I don’t understand in OSM routing engine. And because the difference at Google Maps has more data up to date, more traffic information. Okay.
So move on to the second part. It’s point of interests. The ability of a map to locate point of interest can mean a lot for users to be in a place quickly. We have two, one is how well can OSM look at POIs and that varies by different regions. To study how well can OpenStreetMap locate POIs, we have we chose 110,000 restaurants across 3026 counties across the use to test OpenStreetMap. We used a proprietary thirdparty point of interest database to find restaurants. And from the whole dataset we take 50 samples from each county without replacement.
Then we feed the coordinates into our pass API. For each coordinate for each coordinate you have a circle range of a hundred meters. You can set it at 500, doesn’t matter. And any property in the circle will be returned. If the property name is quite similar to a point of interest expected name, we would mark it as found. This is how we check existence of point of interest name. And you can see one data about five entries in the version. Because there are a bunch of unused columns in between.
Alibaba, I believe it’s a county and it’s 400 South Memorial Drive. The last two columns are uncolumns. Second to last is the matching score. Zero means it doesn’t get matched in OSM. You can try it on OSM right now. There is nothing in there. And Google Maps has it. I just tried last night. And there was supposed to be a matching store and there will be some JSON response where you can get detailed information about that restaurant.
So here’s a map overview. Out of 110,000 randomly sampled restaurants across about 3,000 counties, only 11%, which is 11,000 are found in OpenStreetMap. Which is I couldn’t say it’s a high it’s a high score. We found that more densely populated areas tend to carry more matching restaurants. The matching probability is related to date where the restaurant is first recorded in the yellow pages. The thick book. So newer restaurants, the less probability that it is on OpenStreetMap, which makes sense. Because relying on solely the community contribution, OSM has a limited ability of keeping the point of interest database up to date.
On the left is a U.S. map, again. The color means the matching score. You can see that some rural areas almost has a zero matching score. Here are the best ten counties in point of interest completeness. Some of the states and counties are the locations where our previous OpenStreetMap conferences are held. Washington, Seattle, D.C., Colorado, El Paso. I believe it’s a few kilometers down below where the border is. And the average score you can just multiply by a hundred. It’s a fraction, integer difference. And the county sampled, yeah.
Okay. So here’s another box plot. The access is, again, by the state number. As you can see, the 08 means Colorado. It’s about here. This is Colorado. So the first one is high up there. Because D.C. has only one county, which is D.C. itself. So it’s one thing. It’s high up there. Thanks for D.C. contributors.
The color, again, means the east and west east, mid and west. Sorry, couldn’t get the color in coherent with the slides. So these are sorted by the matching score on the Y axis. You can see that the top ten entries from the righthand side are almost all of them are in the east or west.
So one further question we want to ask is, which we find pretty interesting which fast food restaurants get matched most? Is it KFC? Burger King or McDonald’s or something else? To answer this question, I separated the data into two groups. Basically nonfast food or fast food. Nonfast food, about 80%. Highest bar, again, in D.C. Followed by some states in the west for red color. On the upper left is the mean of each region, west, east and middle.
The middle has the lowest matching score. Which is 7.3. So for fast food chains we have about 20% of these examples. I picked some of the most popular fast food chains. KFC, Arby’s I have never been there. McDonald’s. I love McDonald’s. And D.C., again, it’s in the top highest bar. I’m not sure why D.C. is so high. Anyone know? It’s in the highest bar. It’s around 30%. And in average the west is 23 the west has 23% matching score. The mid has 22%. Where the rest sorry, the east has 20% and the mid has 15%. Compared to the last slide, it’s about two times higher for known fast food chains. And the mean for all is 20%.
So guess who the highest? Nope. Subway’s actually one of the lowest. The highest is McDonald’s. McDonald’s has 32% of matching score. And the color stands for the number of entries it has in the sample. Basically, it’s pretty small. The legend over on the very right. McDonald’s has about 5,000 entries and also has a very high matching score, which is 32%. While Subway has around 6,000, or 5500 and is a very low matching score. Second to last.
I didn’t include I don’t think I included some, you know, some fast food chains that only are in the West Coast. Because they’re just too few entries. And also one fact that I found interesting is that, well, OSM contributors probably like burgers more. Because first entries are McDonald’s, Wendy’s, Burger King. And the last entries are Papa John’s there’s not a Pizza Hut. It’s because OpenStreetMap contributors love McDonald’s, or McDonald’s employees are OpenStreetMap contributors.
So this is my conclusion. Let me close it. In the west and east is point of interest better in matching average? And fast food giants are more likely to pair in OSM’s database. Directions are comparable to commercial providers like Google Maps, but the POI really needs improvement. Yep.
So just in case anyone found my data to be interesting, you can download data in the GitHub. It takes quite takes us really quite a long time to gather all the data because you have to call every API or set up the whole database in your AWS server. So it’s really precious data. And thank you so much.
[ Applause ]
Thanks, Jiaqi, we are running a little over time. But I want to have opportunity for one question. We will munch into a break a little bit. We are going to start the next session at 11:10 to get out if the photo. Anyone have a nice short question for Jiaqi? Or welcome to talk to him later as well.
AUDIENCE: Did you feel that you need to take into account known algorithm of OSM users for routing versus the unknown Blackbox that is Google? Did you take that into account at all? Had
I took that into consideration, but since we couldn’t figure out how either of I don’t know. Because we couldn’t find how Google’s navigation algorithm was written in the backend. And I had to find a way to find how OSM created their routing algorithm.
So for that I think it needs further investigation. Because both of them are black boxes. And I can only compare the distance rather than the time. Yeah. Thank you.
Thank you so much.