OpenRefine

OpenRefine transformation using the GeoNames Web Services API

Report March 17, 2015, from Sandra McIntyre, sandra.mcintyre@utah.edu, (801) 585-0969

Over the last couple of weeks, I experimented and learned more about the capabilities of OpenRefine, and once again I am impressed with how powerful it is for cleaning and transforming metadata. From my explorations, I think that OpenRefine will be very useful for people who want to include GeoNames information with their metadata and want to process more than one record at a time without a lot of re-typing. See my notes at the end of this section about using the OpenRefine interface and using GeoNames API.

I demo'd to the Task Force on 3/17/2015 one way I transformed spatial metadata. My example starts with an Excel spreadsheet of typical metadata, including LCSH-based and LCSH-like placenames, like “Ogden (Utah); Weber County (Utah)”. Then I did the following on the Spatial fields in the spreadsheet using OpenRefine scripting that ran on all records at one time:

    1. Cleaned up the text:
      1. Separated multiple values into separate rows, so each can be processed separately.
      2. E.g., changed ”Ogden (Utah); Weber County (Utah)” to one row with “Ogden (Utah)” and one row with “Weber County (Utah)”.
      3. Deleted the parentheses and added a comma to separate city and state.
      4. Result for “Ogden (Utah)”: “Ogden, Utah”.
    2. Used the GeoNames Web Services API to search on the text string and return the GeoNames information on the first result of the search (function: “search”). There are often many results.
    3. Result: long string of many XML fields
    4. [Needs improving: optimize the query so that the first search result is highly likely to be the result you want.] [Needs exploring: returning JSON may be better than XML. I couldn’t get the JSON returned to parse, so I used XML instead.]
    5. Parsed from the XML just the GeoNames ID number for that place (called “geonameId”).
    6. Result: “5779206”
    7. Parsed from the XML just the GeoNames latitude and longitude for that place (called “lat” and “lng”). Concatenated them into one string.
    8. Result: “41.223, -111.97383”
    9. Used the GeoNames Web Services API with the GeoNames number to get the hierarchy of names describing that place (function: “hierarchy” or “hierarchyJSON”).
    10. Result: long string of JSON
    11. [Needs more exploring: I think you can run an additional step to REVERSE the order of the results of the hierarchy, i.e., reverse the array, but I didn’t have time to test it.]
    12. Parsed from the JSON just the GeoNames name (called “name”) in English of each part of the hierarchy.
    13. Result: “Earth, North America, United States, Utah, Weber County, Ogden”
    14. Concatenated the values in #3, #4, and #6 above to create one complete field entry. (I got the order of the three values wrong, but it’s easy to change.)
    15. Result: “41.223, -111.97383; http://sws.geonames.org/5779206/; Earth, North America, United States, Utah, Weber County, Ogden”

The steps above are by no means the only option with OpenRefine, and there are other sequences of steps that should be run if the starting metadata has other characteristics. There are also probably more elegant ways to do this particular sequence than what I did. Please find them! And share with the rest of us! Thanks!

Notes about my experience with using OpenRefine:

    • It is free.
    • The package wasn’t hard to install on the Mac. Someone else installed it on my PC, so I don’t know how hard it was to do that.
    • The interface seems to have some glitchy things on the Macintosh at least. Sort of what you would expect from an open source tool! For example, extracted code wasn’t visible some of the time, even though it was there and I could copy it once I realized it was still there anyway.
    • There are obvious things I would like to do with its interface that weren’t easy to see how to do, like change the width of a column. I felt uncoordinated most of the time, made mistakes, went places I didn’t mean to go. It was time-consuming.
    • Having lots of real estate on your screen is necessary.
    • Documentation: There is one book that gives a good overview but doesn’t have many examples, Using OpenRefine. It’s pretty easy to look examples up online on different sites and groups, but in general I would say that it isn’t well documented, and you have to be prepared to hunt for what you want to do. Fortunately, the interface shows you the results instantly of what you try and gives you errors that are useful.

Notes about my experience with using GeoNames API Web services:

    • It is easy to get an “API key,” which is what you need to run queries. There is a daily limit on free hits, I believe, but I haven’t gone over the limit yet. My experiments have been small, though.
    • The Web Services are strong on certain things you want to do, but other things are missing.
    • The ticket to ride seems to be to get the GeoNames ID number (called “geonameId” in the code). If you get that, then you can retrieve anything else you need.
    • The services for retrieving things based on lat, lng also seem pretty strong.
    • The service is reasonably well documented, but the full set of options for each function take a little digging sometimes.
    • Applying someone else’s recipe in full probably won’t work for you. Each of us will need to commit to learning enough to write our own transforms based on the particular structure of the metadata we are starting with. So I don’t think that our Task Force could say, for example, “Just copy this extracted code into your project and Bob’s your uncle.”

Getting Started with your own OpenRefine experiments

You can get started quickly to experiment yourself:

    • Install OpenRefine on your PC or Mac.
    • Look at book called Using OpenRefine for overview of the tool.
    • Watch the three introductory videos on the OpenRefine.org home page.
    • Register for a username (API key) for GeoNames on the Login page. (Don’t use Sandra’s username. Each username is allowed a fixed number of hits daily.)
    • Look at GeoNames Web Servicea API documentation for information about how to run queries.

Sandra's initial files:

To experiment yourself with this, you are welcome to use these files. The files are attached at the bottom of this page.

  • Geospatial-Experiments.google-refine.tar.gz -- the compressed project file that I created in OpenRefine and then exported. Place this file in a local directory (not a mapped directory on a shared network). Then open it as your project file in OpenRefine. You can Extract the steps I used and apply them to your own project if you like. You will need to register for your own username in GeoNames. Please don't use mine, which is in the file, since each username is allowed only a certain number of free hits daily against the API.
  • Geospatial-Experiments-Sample-Data.xlsx -- the original spreadsheet of metadata that I created by copying five records from MWDL. I chose to include only records with a spatial field that had text strings with placenames.
  • Geospatial-Experiments-Finished_data.xls -- the final product, exported as an Excel file

Post back here below what you figure out so we can all continue to learn about OpenRefine and the GeoNames API. Thanks!!