Next Level - Data Joins
Super VLOOKUPS
When we want to analyze data where our client demographic information is in table A and our program service data is in table B, what do we do? Instead of creating a million VLOOKUPs (and crashing excel), join the two tables together in Tableau Prep.
It is very common for a community's data is dispersed across a range of different tables. Though we try our best to maintain "master" lists that have every bit of information possible, these files are often cumbersome to put together and error-prone in their maintenance. Using joins makes creating combined tables a seamless process, and frees up time and mental energy that can be better used for analysis.
Our community is interested in how many of our outreach clients already have information in our BNL. To help answer this question, we are going to join our targeted outreach data to our BNL.
Skill Highlight: Joining Data
Joining data means combining two tables side-by-side based on a commonly shared field. In this example, table A on the left contains demographic data and table B on the right contains program data. They will be joined on the field they share in common - Client ID. This ties into the broader database concept of primary keys.
There are four different types of joins: inner joins, left joins, right joins, and outer joins. When thinking about these different types of joins, a Venn Diagram is a helpful mental model.
Inner Join - Only rows of data with key data appearing in both the left and right tables.
Only the client IDs that appeared in both the demographic and program tables will show as a result of this inner join.
Left Join - All rows of data from table A will appear but only data for which there is a matching key in table B will appear.
All Client IDs from the demographic table will show but the left join will only return the program data for which there is a match.
Right Join - All rows of data from table B will appear but only data for which there is a matching key in table A will appear.
All Client IDs from the program table will show but the right join will only return the demographic data for which there is a match.
Outer Join - All rows from both table A and table B will appear along with all available data from each table.
All Client IDs both the demographic and program tables are included and as much data is possible is returned to match the Client IDs.
Data joins are one of the most powerful and fundamental concepts in data management but they can take a lot of practice to master! Entire courses can be devoted to joins and database architecture. Start with data you are familiar with and soon you'll see the potential to connect all of your data with efficient joins.
Data Culture
Help your teammates use a single, consistent client ID to represent a single client across all data.
Infrastructure
Include relevant key data in every table (e.g. client IDs)
To maximize the utility of joins, streamline your data sets to only the data necessary and unique to that table and key data.
Tableau
Use Tableau Prep's Join Step or Tableau Desktop's relationship function to join tables together.
The first step towards joining the Targeted Outreach data with other data is to pull in our master BNL into the workflow. Connect to a new data source but instead of a static excel file, we are going to connect to Google Sheets. Find Google Drive and search for "Fellowship Data - Clients". Authenticate your connection and drop the "Client Data" sheet into the workflow.
Add a Clean Step just to take a look a the data.
Just like in Tableau Desktop, many actions in Tableau Prep are simply drag-and-drop.
Drag the new Clean Step on top of the Clean Step right before our output. We want to join our BNL data to our cleaned and reshaped Targeted Outreach data so we want to join after the previous steps.
A new branch of the workflow will appear with the new Join Step.
Click the Join Step to look at the details.
Applied Join Clauses - This designates the common "key" field on which we want to join the two tables. Tableau gives its best guess but in this case, we are interested in seeing what overlap we have across clients so we'll want to join on "Full Name" from the Targeted Outreach table.
Unfortunately there doesn't seem to be a "Full Name" field in the BNL. Any ideas on how to resolve this? Can we create a "Full Name" field in the BNL data from other fields?
Go back to the BNL Clean Step and click the Create Calculated Field button. In the calculated field box, create a formula that would create a full name, something like:
[First Name] + " " + [Last Name]
Save the calculated field as "Full Name" and return to the Join Step.
Now that we can join both tables on "Full Name", let's look at the rest of the settings.
Join Type: Inner - This indicates that only names that have data in both the Targeted Outreach and BNL files will be returned.
Summary of Join Results - This section shows how many rows of data are pulled from each table based on the join type.
With the inner join, it looks like 36 records have a match from the Outreach data and 84 records have no match. On the BNL side, 87 records have a match and 17,000 records do not.
Why do you think there are more matches in the BNL compared to the Targeted Outreach data?
Try changing the join type and see what happens to the summary.
Next to the settings and summaries is a list of every client name highlighted in black or red. Names in black are matched across both tables while red names only appear in one table.
Use this table to quickly scan for unexpected matches or mismatches.
Create a final Clean Step to examine your joined data. You'll now notice that every column from the BNL is now joined to the Targeted Outreach columns.
Create an Output Step and run your workflow to produce your joined dataset!
Joins open up a world of possibility
Joins are one of the most powerful skills in data analysis. The ability to seamlessly pull in data from other tables creates as many potential lines of inquiry as you have data. However, joins only work well when the other data cleaning and shaping fundamentals are followed!
Finally, let's examine data unions where we append new data to the bottom of existing tables.