This article was written by Adrian Bona as a draft for a talk at State of the Map US in Boulder, Colorado this past month. The talk did not make it into the program, but the technology lives on as a central part of our OpenStreetMap technology stack here at Telenav. We will continue to deliver weekly Parquet files of OSM data. Adrian has recently moved on from Telenav, but our OSM team is looking forward to hearing from you about this topic! — Martijn
Getting started with OpenStreetMap at a large scale (the entire planet) can be painful. A few years ago we were a bit intrigued to see people waiting hours or even days to get a piece of OSM imported in PostgreSQL on huge machines. But we said OK … this is not Big Data. Meanwhile, we started to work on various geo-spatial analyses involving technologies from a Big Data stack, where OSM was used and we were again intrigued as the regular way to handle the OSM data was to run osmosis over the huge PBF planet file and dump some CSV files for various scenarios. Even if this works, it’s sub-optimal, and so we wrote an OSM converter to a big data-friendly columnar format called Parquet. The converter is available at github.com/adrianulbona/osm-parquetizer.Hopefully, this will make the valuable work of so many OSM contributors easily available for the Big Data world.
How fast?
Less than a minute for Romania-latest.osm.pbf and ~3 hours (on a decent laptop with SSD) for the planet-latest.osm.pbf.
Getting started with Apache Spark and OpenStreetMap
The converter mentioned above takes one file and not only converts the data but also splits it in three files, one for each OSM entity type – each file basically represents a collection of structured data (a table). The schemas of the tables are the following:
node
|-- id: long
|-- version: integer
|-- timestamp: long
|-- changeset: long
|-- uid: integer
|-- user_sid: string
|-- tags: array
| |-- element: struct
| | |-- key: string
| | |-- value: string
|-- latitude: double
|-- longitude: double
way
|-- id: long
|-- version: integer
|-- timestamp: long
|-- changeset: long
|-- uid: integer
|-- user_sid: string
|-- tags: array
| |-- element: struct
| | |-- key: string
| | |-- value: string
|-- nodes: array
| |-- element: struct
| | |-- index: integer
| | |-- nodeId: long
relation
|-- id: long
|-- version: integer
|-- timestamp: long
|-- changeset: long
|-- uid: integer
|-- user_sid: string
|-- tags: array
| |-- element: struct
| | |-- key: string
| | |-- value: string
|-- members: array
| |-- element: struct
| | |-- id: long
| | |-- role: string
| | |-- type: string
Now, loading the data in Apache Spark becomes extremely convenient:
val nodeDF = sqlContext.read.parquet("romania-latest.osm.pbf.node.parquet") nodeDF.createOrReplaceTempView("nodes") val wayDF = sqlContext.read.parquet("romania-latest.osm.pbf.way.parquet") wayDF.createOrReplaceTempView("ways") val relationDF = sqlContext.read.parquet("romania-latest.osm.pbf.relation.parquet") relationDF.createOrReplaceTempView("relations")
From this point on, the Spark world opens and we could either play around with DataFrames or use the beloved SQL that we all know. Let’s consider the following task:
For the most active OSM contributors, highlight the distribution of their work overtime.
The DataFrames API solution looks like this:
val nodeDF = nodeDF .withColumn("created_at", ($"timestamp" / 1000).cast(TimestampType)) .createOrReplaceTempView("nodes") val top10Users = nodeDF.groupBy("user_sid") .agg(count($"id").as("node_count")) .orderBy($"node_count".desc) .limit(10) .collect .map({ case Row(user_sid: String, _) => user_sid }) nodeDF.filter($"user_sid".in(top10Users: _*)) .groupBy($"user_sid", year($"created_at").as("year")) .agg(count("id").as("node_count")) .orderBy($"year") .registerTempTable("top10UsersOverTime")
The Spark SQL solution looks like this:
select user_sid, year(created_at)) as year, count(*) as node_count from nodes where user_sid in ( select user_sid from ( select user_sid, count(*) as c from nodes group by user_sid order by c desc limit 10 ) ) group by user_sid, year(created_at) order by year
Both solutions are equivalent, and give the following results:
Even if we touched only a tiny piece of OSM, there is nothing to stop us from analyzing and getting valuable insights from it, in a scalable way.
If you are curious about more advanced interaction between OpenStreetMap and Apache Spark, take a look at this data bricks notebook.
OpenStreetMap Parquet files for the entire planet?
Telenav is happy to announce weekly releases of OpenStreetMap Parquet files for the entire planet at osm-data.skobbler.net.

