Who's on First 👊 OpenStreetMap

Who's on First 👊 OpenStreetMap – Transcription

We will invite Aaron Straup Cope from Mapzen to come up. Hi, I haven’t been able to be back to State of the Map since 2014, it is lovely to be here and thank you for inviting me to speak. I wanted to do a quick bit of house keeping before I get started. The full title of this talk is who’s on first, first bump, colon, OpenStreetMap, where the first bump is emoji code. There is no emoji code yet, but there is soon. Right now that left facing fist and right-facing are added to unicode 10, all that is left for the fist bump emoji is a combining character. The last slide that I just showed was an image because I prepared this talk on a different computer, and the laptop that I brought to the conference still cannot display unicode 10 characters. So somewhere between submitting this talk for consideration and its inclusion in the conference, the title was lost in translation. Instead of newly-minted unicode characters, which would not render on people’s screens, or the fake emoji short code, we used the more readily available on coming fist character. I’m not bothered with that, except the title hash a slightly more threatening tone than I ever intended. So, for the record, punching OSM is not what I had in mind. This is more of what I had in mind. And, if anyone does not know what is going on here, these are two people dressed up as the Wonder Twins. They are Saturday morning cartoon characters that battle for justice along familiar characters, such as Batman and Wonder Woman. And the Wonder Twins would activate their super powers by fist bumping each other and each sibling would transform into complimentary objects. The most ridiculous pairing was one twin into an eagle, and the other into a bucket of water. And the ’70s were weird like that in a time we don’t have to talk about today, except that Who’s On First would like to be the bucket of water to OSM’s eagle. Okay Who’s On First. Who’s On First is a gazetteer, and that is a giant phone book if you have not heard of it before, but it is a phone book of places rather than people. I will spend this talk talking about venues, Who’s On First has been around for two years, 40,000 words of theory and engineering decisions have been written. So I’m not going to talk about that. And what I would like to do is spend a moment and do a high-level overview of the project to describe the shape of the elephant, so to speak. To begin with, who’s on first is an openly-licensed dataset. At its most restrictive, data is published by attributions license, at Mapzen, it is published under a public domain license. Every record of Who’s On First has a stable, permanent, and unique identifier. There are no semantics encoded in those iDs. At rest, every record is stored as a plain-text geo JSON file. Our goal is to ensure that Who’s On First has the principles of portability, durability, giving text files as the common method of delivery. And there are records that are supplemented by an arbitrary number of other properties specific to that place. There’s a finite number of place types on Who’s On First, and they all share a common set of ancestors. As with properties, any given record may have as complex a hierarchy as circumstances demand. But what is important is that there’s a shared baseline hierarchy across the entire dataset. Individual records may have multiple geometries, multiple hierarchies, sometimes both. And records can be updated, superceded, and sometimes even deprecated. Once a record is created, it cannot be removed or replaced. And, most importantly, by design, Who’s On First is meant to accommodate all of the places. Who’s On First is not meant to be a carefully-selected list of important or relevant places. It is not meant to be the threshold by which places are measured. It is, we hope, meant to be the raw material by which many thresholds might be created. From Who’s On First’s perspective, the problem is not whether a place is important, or relevant, whether the place exists anymore, or whether its existence is disputed. What is relevant for Who’s On First critically is that people believe those places to exist or have existed. This is why I sometimes refer to it as a gazateer of consensual hallucinations. This will start animating in a moment, hopefully. This is work that my colleagues, Steven Eps, did earlier the last year to show how Yugoslavia has changed in the last 20th century. One of the ways I want to talk about this, Sarajevo in 1998 is very different from 2003, and 2017. I will let this run on a loop, and some of you have tweeted to the last slide. Specifically, that OpenStreetMap and Who’s On First have different licensing requirements and prevents data exchange in both directions. Now, I just said license, which means that somebody somewhere has one OSM lingo. But, more seriously, I would like to make one thing clear before I go on: I am not here as an employee of Mapzen, or an individual to ask or suggest in any way that OpenStreetMap re-visit the decision to adopt the ODBL. My personal feeling is that OpenStreetMap, by having accomplished the impossible in 13 short years, has earned the right to do whatever it wants. If OpenStreetMap chooses to have the ODBL, that’s fine with me. At the same time, I need and want an openly-licensed database of locations that can be used and adopted in both commercial and closed projects without any restrictions beyond attribution. This is the space that Who’s On First occupies, and that is why we will not, we cannot, import data from OSM. So, about this time, two years ago, I had a little freak-out at work. We had been discussing the availability and viability of open venue datasets. And the reason I freaked out is that, when you take OSM off the table, for all the reasons discussed, there are effectively no openly-licensed datasets for venues. And I was starting to wonder what we were talking about. Accurate and up to date venue data is a difficult and daunting challenge. There are a few companies that built successful businesses collecting and reselling that data. Other companies like Facebook have a similar and superior catalog of data, almost by accident, as a bi-product of their day-to-day work. So far, none of these companies have seen fit to share any of that data. That leaves two alternatives: The first is the gaping void of nothingness that we have all come to know and love, and the second is to embrace the idea that something is better than nothing, and if you can improve upon it over time is even better than that. To embrace, or accept that our burden, in the near term, and this is the burden we are going to have to carry for the foreseeable future, is one of managing absence. In addition to the hard work vetting, collecting, and improving data, we need to think about ways that we talk to people about the data that we don’t have, and to develop interfaces for buffering and tempering that absence. This makes an already daunting project exhausting to even think about. The good news is that there is, in fact, one open database that was available for use in 2005 for venues. In 2010, the geo services company simple geo published their places database, containing 21 business listings, under a creative commons public 0 domain license. Now, the company went out of business shortly after that. But, not before Jason Scott, of Archive Team Fame, managed to grab a copy of the data and put it on the archive. There’s a cautionary tale there, but that’s a different story. The first thing to know about the simple geo data is that it is a flawed dataset. For starters, it is almost 8 years out of date. Now, that is what we say, but when we say things are out of date, what we really are saying is that there’s no new stuff. When we say there is no new stuff, what we are often really saying is that there’s no new stuff for a particular kind of business, typically those focused on snacking, grooming, and nesting, and aimed at a particular demographic: 18-35-year-olds with disposable income. But a lot of businesses survive longer than 17 years. Think of the butcher that has been in your town for the last 75 or 80 years. We could go on. And, if we’re lucky, sooner or later we all get older and eventually need to call a plumber, an electrician, a notary, that kind of thing. And now, that the stuff that is in the simple geo dataset. So to say that it is out of date has always struck me as somewhat inaccurate and unfair. Now, the data is heavily slanted towards the U.S. 60 percent of the venues in the geo dataset is from the U.S. It claims it covers 60 countries, but that is disingenuous when you consider that all but a few countries have fewer than 100 venues in them. It is weighted towards professional services: A lot of accountants, doctors, and lawyers, and it makes it difficult to see the trees of the forest. There’s a lot of bad data, and none of us would be the first to cast the first stone, we all know bad data. There is minimal structural relational data. The only way to find records for a given locality or neighborhood is to load everything into a database and start performing spatial queries, and everything was published as a single, line-separated, geo JSON file, all 21 million records. If I had been there at the same time, I might have done the same thing. As a consumer, that decision is nothing but 100 percent sad-making. But it is a good start. And the first thing that we did was to explode that 21 million line text file into 21 million discrete records, with a new Who’s On First ID. It has forced us to think about how, from an operational point of view, we manage editing, storing and distributing that much data. Without getting in to all the details, trying to put 21 million files into a single Git repository is challenging enough as to be impossible. And simply having 21 million files on a single volume in 2017 still causes people to run out of inodes, and try explaining that to someone who doesn’t understand or care about file systems. But, there are more than 21 million venues in the world, and we are going to have to figure these out eventually, so we might as well start now. And this is what it looks like. And if all the venues in the world weren’t enough, we also want to be able to track historical venues. And so we will pretend that Who’s On First existed in 1985, and a writer published a review of the palace Stake House, associating the Who’s On First iD with their article. It is important to us that both the author and the readers of that article can rely on that iD forever with the confidence that its meaning won’t change, even though the Palace Stake House went out of business in 2009, and that space – drop in audio. By adding the hierarchy, we make it possible to find all the venues in the neighborhood, in this case, the mission, with nothing more than a two-part key value query. And sometimes I describe Who’s On First as a fancy accounting tool, it does not sound that exciting, but it lowers the barrier to entry for people to do something with the data. It is in its early days and we are doing work to track brands, assigning each their iDs and records. It is primitive right now, but it is something to build on. And a medium-term goal we have is to work with actual brands and have them contribute and take responsibility for maintaining their venues inside of Who’s On First. We will see if that works, but it is good to have goals. And, in addition to static downloads. Everything is available through the maps and places API. It takes about five minutes to grab all 76,000 venues in San Francisco, and dump them out as a geo JSON feature collection. And I could easily spent the entire talk discussing the UI and UX challenges, the OSM community should be proud of what they have done in that respect. We are a small team and the limiting factor is the short number of hoursane day. This is by another colleague, Dan Fifeerabout who can add tags to Who’s On First. Dan spent last summer in Korea and Taiwan, where we don’t have data, and where tiles and APIs come to a screeching halt when you are traveling on a throttled data plan. So what he does during the day is take pictures of venues, points of interest, signage, and uploading them to the map to create records for Who’s On First. And finally, we have been working with Al Venentine who road the address parser. We asked Alto take a first pass at address duplication. Address parsing is the first place where you go to give up. And it is clear that the first challenge that we have every single time is reconciling what we already have on Who’s On First, with whatever new offering is on the table. So we don’t have the luxury of giving up on the problem of address duplication. And Al’s agreement was not to solve all the things, and can be improved on going forward. And when I was building this, I was testing things on the usual suspects in London. And one of the benefits of software is that, when we establish a match between our venues and official listing, that is public data from all of these cities. We can flag the venue as current and remove the ambiguity that surrounded it that is built into the idea of taking a random data set off of 21 million places off of the internet. We can add concordances back to the city and their records. We want to hold hands with everybody. And as we finish the next phase of work is to deeplicate against itself, and to clean up the data even more. The last slide. And one of the things that is nice about the software it is works with fuzzy and approximate location for venues. It is possible to use the latitude and longitude for the postal code associated with the address, and the software will still find successful matches. What that means concretely is that basically every health and safety inspection record every published in the U.S. is now fair game for address de-duplication, which is kind of school. If you have ever looked at is, that is what they call it. We will get it from the health inspectors, and then we look at it, and there is no coordinate data. So what we can do is validate records we already have and start to input new ones with a high, if not perfect, degree of confidence and start chipping away at the no new stuff problem. Some of you may be wondering what this has to do with OpenStreetMap. And the honest answer is, I don’t know, but something, I hope. If Who’s On First can’t use OSM data, then we would like to make sure that it is easy for the OpenStreetMap community if and when it chooses to use our data. Maybe that means all we do is establish concordances between our venues and your venues, and maybe it means something more. My hope, at the end of this talk, is I have raised enough interest and questions that we might answer the question together. Thank you.
(Applause).
(Speaker far from mic). The short answer is yes. The question was: Have we looked at open addresses? Yes.

How do you prevent, or do you even care, if there’s, like, a cannibalistic relationship between Who’s On First and OSM, meaning that basically the data that you generate continues to thrive within OSM, and not on Who’s On First. Do you care about this, or see it as a challenge? So I don’t see it as a challenge. I would love it if the data made its way into OpenStreetMap, because then it would be available for all the people using OpenStreetMap for all of their tools. It is – and again, I want to reiterate, the issue is not asking OpenStreetMap to change the license. It is just that the decision that the community made to adapt the ODBL is problematic for some projects, which is where Who’s On First fits in. If we can shuttle the data back to you, great.
Going back to your Yugoslavia slide, are you saying that – are you saying that Yugoslavia would be a record in Who’s On First? Yes, it would be a record, and it would have a cessation date, and superceded by which ever was the next Yugoslavia. Would you try to go back to more historical things, like the Holy Roman Empire, or the USSR or something? The short answer is yes, the long answer is we don’t have time to work on that day-to-day so, from an engineering and an architectural perspective, we’ve tried to set things up so that will come naturally and be easy for people when we get there. But there’s a bunch of history buffs on the team. And what about indigenous areas, like the Navajo and Iroquois nations? Yes. That’s pretty cool. Thank you very much. (Applause). That’s the end of our sessions for today. I just wanted to remind everybody, there will be a social right over here on the rooftop. You can see it through the windows, and the way to get there is to go straight down the hall, through the other meeting room, and there’s a doorway apparently out that way, I have been told. So anyway, hope to see you all at the social tonight, and we will get a chance to catch up and talk more there. Thank you. Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx