Getting Started with Open Refine

The purpose of this tutorial is to introduce you to a powerful, yet user-friendly tool that allows you to clean, analyze, and visualize your tabular data. In this example we are interested in looking at the relationship between reported direct emissions and stationary combustion in major facilities across the US.

Exploring the Data: Greenhouse Gas (GHG) Emissions from Large Facilities

The following dataset named GHGRP_data_2015_8_13_16.xlsx has been downloaded and edited from the EPA Flight portal at: https://ghgdata.epa.gov/ghgp/main.do#. It contains all the Greengouse Gas Emissions reported to the EPA from Large Facilities during 2015. Download this table to your computer.

Open the table and spend a few minutes getting familiar with the structure and the meaning of each field. Think about relations, patterns, questions that you would like to solve or visualize with this data. The spreadsheet that we will be working with is called Direct Emitters.

Do you see any problems with the way the data has been collected?

Remember where you saved your table. We will be using it in our next step with Open Refine.

Transforming your data with OpenRefine

OpenRefine is a free, powerful tool that allows you to clean, discover, organize messy data and link it on the fly to other databases.

The main reasons for using OpenRefine are:

- It's free, open source, and your data is stored locally so you can clean your data without any privacy issues

- It combines the power of scripting with the simplicity of spreadsheets, which makes it more interactive and experimental than other tools

- User-friendly design that allows for easily detecting and fixing disparities in your data

File formats supported by OpenRefine are: Comma-Separated Values (CSV), Excel documents (.xls, .xlsx), Open Document Format spreadsheets (.ODS), JSON and XML. If you need other formats for your data, you can add them by way of OpenRefine extensions.

You can find excellent resources at: openrefine.org and download the program here.

Importing data

Launch OpenRefine. This should automatically open your web browser to the correct address: 127.0.0.1:3333.

In OpenRefine, on the right-hand side, select Get data from This Computer and click the Browse... button. Locate the GHG data mentioned above and click Open.

Press the Next button to preview your data.

In the screen that appears, we will make sure that we make sure to:

- Direct Emitters is the worksheet to import

- Check Ignore first 3 line(s) at the beginning of file

- Uncheck Store blank rows (bottom right corner of your screen).

If your data looks correct, press the Create Project button on the top right corner.

Basics of Open Refine

One big difference between OpenRefine and Excel is the unit of interaction. In spreadsheets, cells are the main units of interaction. However, OpenRefine works with columns and rows instead. Editing usually happens one column at a time or across rows that meet a specific criteria.

The second main difference is that spreadsheets are primarily used for performing calculations. The goal of OpenRefine is to explore your data, look for patterns, and transform it.

Let's get familiar with common operations. For example, change the number of records displayed by selecting 50 records on the Show menu at the top center of your screen.

We have decided that the FRS Id column is not important and want to delete the column. For that, click on the drop-down arrow at the left of the column title. Select Edit column > Remove this column.

We want to have the Facility Name as the first column that appears. To move the column, click the drop-down arrow next to it, select Edit column > Move column to beginning. Rename the column Industry Type (sectors) to Sectors_Industry_Type by selecting Edit column > Rename this column. Under Enter new column name, type: Sectors_Industry_Type. Click OK.

On your own: remove the column named Column at the end of the table.

Tip: to remove several columns at the same time, click on All at the top left corner of your table. Select Edit columns > Re-order/remove columns. In the window that appears, drag the columns to the right hand-side that you want to remove. For example,all the Petroleum and Natural Gas Systems columns.

Converting to Title-case

Take a look at the City column. Notice some cities are in uppercase while others are in title case. Let's transform the entire column to title case by clicking on the arrow next to City, select Edit cells > Common transforms > To titlecase. Notice the results. Repeat the same operation for the County column.

Keeping Track of Changes: the Undo/Redo tab

Open Refine keeps track of any operation or change you have made to your table. On the top left corner of the program, select the Undo/Redo tab. Notice each operation has a number and the last one is highlighted. To undo that operation, click on the one before. Notice the results. To redo it, click on the last operation again.

This tab is very useful for two additional reasons:

- You can see the number of cells affected by each operation

- You can see the code used for each operation, extract it and apply it to a different column or dataset.

Let's put this into practice. Undo the last operation on County. Click the Extract button and notice the code for each operation. Leave checked only the text for the column City. Copy that code. Close that window.

Click the Apply button. Paste the code. Change the column name City to County. Click Perform Operations and see the results. Cool way to get started in programming, don't you think?

Transforming your data with GREL

Removing Characters from a String

Look at the County column. Notice some records have the word county added. We want to keep the county names consistent so we will be removing the word county whenever we have it.

To clean the names, click the inverted triangle to the left of County, select Facet > Custom Text Facet. In the window that appears, under Expression, type: value.contains("County"). Click OK. Notice in the Facet/Filter tab that you have 2 choices:

- False - cells that do not contain the word County

- True - cells that contain the word County. These are the ones we want to work with. Select those in the Facet tab and notice only those records appear on the table on the right. How many records fit that condition? Now we will perform the operation of subtracting the word County out of those records.

Click on the inverse triangle to the left of the County column. Select Edit cells > Transform. In the Expression box, type: value.substring(0,-6) and notice the results

This function returns the text of the column minus the last six characters starting from the end.

Notice how you don't have any records that fill the true condition anymore. Remove the County facet by clicking on the X on the top left corner.

Try other combinations such as substring(1) and substring(0,4) to understand how this function works. For more information about working with text functions, refer to the GREL String Functions reference.

How would you subtract the first two characters of each State (Ca)?

GREL is the General Refine Expression Language (GREL) used to write transformation expressions in OpenRefine. However, you can also use Jython (implementation of Python designed to run on the Java platform) and Clojure (dialect of Lisp).

Type again value.substring(2) to get the full names on the State column. Click OK. Undo this last step.

Removing White Spaces

One big problem with geocoding names in the presence of white trailing spaces at the beginning or the end of the name. In your table, hover over the Collin cell and press edit. Notice there is a space after Collin. These are very hard to find, so instead of trying to find them manually, let's clean the entire column with a simple function. Click Cancel.

For expression type: value.trim(). Remember to make a cell transformation, go to Edit cells > Transform.

You cannot notice the difference in this window. Click OK.

To find out how many records have been affected by your last operation, go to the Undo/Redo tab on the top left corner. Here you have a history of all the steps you have made so far.

Notice your last step affected 1866 names.

You can undo any step and redo by simply clicking on the previous step.

If you press Extract, you can get the code for any of these steps in JSON format. This can be very useful if you want to use the same procedure with other datasets that have exactly the same format (perhaps for different election years). Do the same operation to the State and the City columns.

Converting data from string to numeric values

Look at the Primary NAICS Code column. The values are considered text. To change this, go to the column and select Edit cells > Common transformations > To number. Notice how the values are now green (way to represent numeric values in OpenRefine).

Exploring and Transforming your data using Facets: States

Tale a look at the States column. There should be only 54 values. To know exactly how many different values we have, go to this column and select Facet > Text Facet.

You should see 78 different choices for the results. Notice the similarities. Next to each choice you have the number of records in your table with that choice.

To manually change the choices, click on ala and edit. In the box that appears, rename it to AL. Click Apply. Notice how you have only 77 choices now.

If you look at the Text Facet, you still have 55 choices instead of 54. Change manually NYork to NY by selecting the choice on the Text Facet window and typing the new text for each.

Clean the Zip Code and the Industry Type (sector) columns

Notice there are some zipcodes with 9 digits instead of 5. Which functions can we use to remove the last five digits? Should we apply this to the entire column? Try it and see what happens. Undo and try the contains function to select only the zipcodes that fit that criteria. How many cells have zipcodes with 9 characters?

Transforming numeric data

Take a look at the Latitude column. Notice the variety of values. Think for a second how can we convert this column into numbers that are consistent in value.

To make sure you don't have data inconsistencies, create a numeric facet (Facet > Numeric Facet). Notice the big variation in values. Is this expected?

To make it more clear, close the numeric facet and create a log facet (Facet > Customized facets > Numeric log facet). Uncheck the Error ones and look at the top half of the bar.

This clearly shows two values extremely high compared to the rest. Move the handles on the bar for the facet so you can see only two records.

Obviously this is a data entry error. Notice there are two latitude value errors. Fix those manually.

Make sure you convert your results to number.

On your own:

Follow the same procedure to transform the Longitude to proper to numbers.

Scatterplot Facet

Go to the Total reported direct emissions column and select: Facet > Scatterplot Facet. Notice the new window that appears plotting this column against other numeric values in your data.

Click on the graphic that shows Total reported direct emissions vs. CO2 emissions. Notice how it gets inserted in your top left corner. Select the States with the highest values by drawing a rectangle around them and see your data get filtered by your selection.

We are going to cluster the data to combine the choices in a faster, coherent way. In the same column, select Edit cells > Cluster and edit... This tool allows you to find "groups" of cell values in your columns that are similar and cluster them using different algorithms. Examine the clusters. Did it do a good job guessing similar values?Click on Select All. Notice the Merge function checked for each cluster. In the New Cell Value, type the two-letter abbreviation for each state (CA, NM, etc). Click Merge Selected & Re-Cluster. Change the Method to nearest neighbor, distance function levenshtein, Radius = 2 and Block Chars = 2. See you can also merge Utah and Ohio.

Exporting your data and history

Once your data is ready for your next step in other programs such as Google Fusion Tables or Tableau, click the Export button on the top right corner and select Comma Separated Value. Save your File to your computer: GHGRP_data_2015_8_13_16-xlsx.csv

We will use this dataset for the visualization tutorial with Fusion Tables that you can find here.

Tips

    • Look for patterns in your data by identifying the common problem in the cells or rows that need change.

    • Remember to use filters and/or facets to isolate the rows you want to change.

    • Concatenate syntax to produce a single command to make your edits.

    • Extract your operation history so you can:

      • Use the same transformation to similar documents

      • Others can perform the same series of procedures to similar data

OpenRefine Resources

OpenRefine Wiki: GREL Reference: GREL String Functions

OpenRefine Recipes

Tutorial by David Huynh

Merging Datasets with Common Columns by Tony Hirst

Tutorial by Keith Jenkins at Cornell University

OpenRefine Wiki:

- clustering

Other Environmental Data sets

Emissions & Generation Resource Integrated Database (eGRID) - eGRID2014 Data File (XLS)