As a learner, you can choose to complete one or multiple courses in the Google Data Analytics Certificate program. However, to obtain the certificate, you must complete all eight courses in the program and submit all graded assignments. This reading describes what is required to obtain a certificate, as well as best practices for you to have a successful learning experience on Coursera.
To receive your official Google Data Analytics Certificate, you must:
Pass all graded assignments in all eight courses of the certificate program. Each graded assignment is part of a cumulative graded score for the course, and the passing grade for each course is 80%.
AND one of the following:
Pay the course certificate fee,
Be approved for Coursera Financial Aid, or
Complete the certificate through an educational institution, employer, or agency that's sponsoring your participation.
If you’re a learner who has some experience with data analytics, you may wish to bypass this certificate program and proceed directly to a more advanced program such as the Google Advanced Data Analytics Certificate or the Google Business Intelligence Certificate. Determine whether you’re ready for one of these programs in an upcoming reading, Evaluate your current data analytics skills.
Here are some best practices to help you successfully complete the courses in the program:
Plan your time: Setting regular study times and following them each week helps make learning a part of your routine. Use a calendar or timetable to create a schedule, and list what you plan to do each day in order to set achievable goals. Find a space that allows you to focus.
Work at your own pace: Everyone learns differently, so this program has been designed to let you work at your own pace. Although your personalized deadlines start when you enroll, feel free to progress through the program at the speed that works best for you. There is no penalty for late assignments; to earn your certificate, all you have to do is complete all of the work. If you miss two assessment deadlines in a row, or if you miss an assessment deadline by two weeks, you'll see a Reset deadlines option on the Grades page. Click it to switch to a new course schedule with updated deadlines. You can use this option as many times as you need.
Be curious: If you find an idea that gets you excited, act on it! Ask questions, search for more details online, explore the links that interest you, and take notes on your discoveries. The steps you take to support your learning will advance your knowledge, create more opportunities for you in this high-growth field, and help you qualify for jobs.
Follow along with instructors: Each instructional video with an onscreen software or programming demonstration has a corresponding step-by-step guide. This guide is a reading that outlines each step an instructor performs in the video. You’ll also find additional information in the guide that explains the purpose of the concepts covered, why a step is done a certain way, and tips for using the tool.
Take notes: Notes will help you remember information in the future, especially as you’re preparing to enter a new job field. In addition, taking notes is an effective way to make connections between topics and gain a better understanding of those topics.
Review exemplars: Exemplars are completed assignments that fully meet an activity's criteria; many activities in this program have exemplars that you can compare to your own work. Although there are often many different ways to complete an assignment, exemplars offer you guidance and inspiration about how to complete the activity.
Build your career identity: Your career identity is the unique value you bring to the workforce. Watch this video to learn about the key components of career identity and write your own career identity statement. Discovering and defining your own career identity makes you better equipped to choose a career path that aligns with your strengths, values, and goals and is more effective in your chosen profession.
Study together: It can be hard to find the time to progress through a Google Career Certificate with all life’s distractions. Register for Study Group – live study sessions hosted Monday through Thursday that offer you dedicated time and space to focus on your coursework alongside a community of your peers. Participation in Study Group is optional for our learners.
Connect with other learners: If you have a question, chances are, you’re not alone. Reach out in the discussion forum to ask for help from other learners in this program. You can also visit Coursera’s private Google Data Analytics Community to expand your network, discuss career journeys, and share experiences. Check out the quick start guide.
Update your profile: Update your profile on Coursera to include your photo, career goals, and more. When other learners find you in the discussion forums, they can click on your name to access your profile and get to know you better.
To complete certain activities in the program, you will need to use digital documents, spreadsheets, presentations, and/or labs. Data analytics professionals use these software tools to collaborate within their teams and organizations. If you need more information about using a particular type of document, refer to these resources:
Microsoft Word: Help and learning: Microsoft Support page for Word
Google Docs: Help Center page for Google Docs
Microsoft Excel: Help and learning: Microsoft Support page for Excel
Google Sheets: Help Center page for Google Sheets
Microsoft PowerPoint: Help and learning: Microsoft Support page for PowerPoint
How to use Google Slides: Help Center page for Google Slides
This program covers a lot of terms and concepts, some of which you may already know and some of which may be unfamiliar to you. To review terms and help you prepare for graded quizzes, refer to the following glossaries:
Module glossaries: At the end of each module’s content, you can review a glossary of terms from that module. The module glossaries are not downloadable; however, all of the terms and definitions are included in the course and certificate glossaries, which are downloadable.
Course glossaries: At the end of each course, you can access and download a glossary that covers all of the terms in that course.
Certificate glossary: The certificate glossary includes all of the terms in the entire certificate program and is a helpful resource that you can reference throughout the program or at any time in the future.
Find the course and certificate glossaries using the course’s Resources tab. Access and download the certificate glossaries, then save them on your computer. To access the Data Analytics Certificate program glossary, click the following link and select Use Template.
Link to the glossary: Google Data Analytics Certificate glossary
OR
If you don’t have a Google account, download the glossary directly from the following attachment.
Data Analytics Certificate glossary
Providing feedback on videos, readings, and other materials is easy. With the resource open in your browser, locate the thumbs-up and thumbs-down symbols.
Click thumbs-up for materials that are helpful.
Click thumbs-down for materials that are not helpful.
If you want to flag a specific issue with an item, click the flag icon, select a category, and enter an explanation in the text box. This feedback goes to the course development team and isn’t visible to other learners. All feedback received helps to create even better certificate programs in the future.
For technical help on the Coursera platform, visit the Learner Help Center
Following are some data-collection considerations to keep in mind for your analysis:
Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.
If you don’t collect the data using your own resources, you might get data from second-party or third-party data providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect the data themselves. Third-party data might come from a number of different sources.
Datasets can show a lot of interesting information. But be sure to choose data that can actually help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.
If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs.
If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists.
Use the flowchart below if data collection relies heavily on how much time you have:
Question 1
What are cookies?
Status: [object Object]
1 point
Programs that enable users to access websites
Types of malware that can damage computers
Pieces of code that store information about a website
Small files stored on computers that contain information about users
Question 2
Fill in the blank: For data analytics projects, _____ data is typically preferred because users know it originated within the organization.
Status: [object Object]
1 point
multi-party
first-party
second-party
third-party
Question 3
A grocery store chain purchases customer data from a credit card company. The grocer uses this data to identify its most loyal customers and offer them special promotions and discounts. What type of data is being used in this scenario?
Status: [object Object]
1 point
First-party
Multi-party
Second-party
Third-party
Question 4
In data analytics, what term refers to all possible data values in a dataset?
Status: [object Object]
1 point
Representation
Source
Population
Sample
Now that you’ve learned about unstructured data and data formats, pause for a moment and think about unstructured data and how it compares to structured data. Most of the data being generated right now is unstructured, and being able to understand it empowers businesses to uncover hidden patterns; trends; and important information that may be inaccessible through traditional, structured data analysis. Answering the questions in this self-reflection will help reinforce what you’ve learned and enable you to make the most of all kinds of data throughout your career.
Review the following scenario. Then complete the instructions.
In this self-reflection, you’ll explore and reflect on the nature of unstructured data by interacting with a crowd-sourced dataset on the website Quick, Draw!
Quick, Draw! is a game in which people create drawings. Quick! Draw has used these drawings to amass a dataset that contains millions of images separated into categories such as plants, animals, or vehicles. On the Quick, Draw! website, you can examine these drawings or play the game to add your own. These drawings help train computers to recognize objects with artificial intelligence.
In this self-reflection, you’ll explore the Quick! Draw website and create your own drawings. Then, you’ll reflect on Quick! Draw’s dataset and what you can deduce about the nature of unstructured versus structured data.
Follow the instructions to complete each step of the activity. Then answer the questions at the end of the activity before going to the next course item.
Visit the Quick, Draw! website.
Click the drop down arrow featured next to Now visualizing: cloud.
Select a type of doodle to begin.
Click an image to get its details, like the number of doodles drawn. For example, there are more than 100,000 different drawings of elephants.
Scroll through the list and determine if there are any that don’t belong. If you find one that doesn’t appear to match the intended object, click on it and select Flag as inappropriate.
Explore other categories of drawings. Choose three that interest you and examine their doodles.
Explore further. Click Get the data to visit the GitHub page containing the entire dataset. As you become more familiar with data projects and start creating your own, you can return to this dataset and analyze it yourself.
Click Play the game to draw your own doodles and contribute to Quick, Draw!’s dataset.
Data is everywhere and it can be stored in lots of ways. Two general categories of data are:
Structured data: Organized in a certain format, such as rows and columns.
Unstructured data: Not organized in any easy-to-identify way.
For example, when you rate your favorite restaurant online, you're creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you're using unstructured data.
Here's a refresher on the characteristics of structured and unstructured data:
Structured data:
- Defined data types
- Most often quantitative data
- Easy to organize
- Easy to search
- Easy to analyze
- Stored in relational databases
- Contained in rows and columns
- Examples: Excel, Google Sheets, SQL, customer data, phone records, transaction history
Unstructured data:
- Varied data types
- Most often qualitative data
- Difficult to search
- Provides more freedom for analysis
- Stored in data lakes and NoSQL databases
- Can't be put in rows and columns
- Examples: Text messages, social media comments, phone call transcriptions, various log files, images, audio, video
As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.
Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.
The lack of structure makes unstructured data difficult to search, manage, and analyze. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others. And as you're learning, an unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.
This reading introduces you to data modeling and different types of data models. Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways.
Important note: As a junior data analyst, you won't be asked to design a data model. But you might come across existing data models your organization already has in place.
Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole.
Each level of data modeling has a different level of detail.
Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details.
Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables. That's the job of a physical data model.
Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
More information can be found in this comparison of data models.
There are a lot of approaches when it comes to developing data models, but two common methods are the Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships. As a junior data analyst, you will need to understand that there are different data modeling techniques, but in practice, you will probably be using your organization’s existing technique.
You can read more about ERD, UML, and data dictionaries in this data modeling techniques article.
Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team!
Question 1
What type of data is the height of a skyscraper?
Status: [object Object]
1 point
Discrete
Nominal
Qualitative
Continuous
Question 2
In data analytics, what is the term for data that is generated from, and lives, outside of an organization?
Status: [object Object]
1 point
Peripheral
External
Outer
Internal
Question 3
What are the key characteristics of unstructured data? Select all that apply.
Status: [object Object]
1 point
Fits neatly into rows and columns
Clearly identifiable construction
Unorganized
May have an internal structure
Question 4
Fill in the blank: A data model is used to organize _____ and how they relate to one another.
Status: [object Object]
1 point
database structures
spreadsheet fields
data elements
data visualizations
In this reading, you will explore the basics of Boolean logic and learn how to use single and multiple conditions in a Boolean statement. These conditions are created with Boolean operators, including AND, OR, and NOT. These operators are similar to mathematical operators and can be used to create logical statements that filter your results. Data analysts use Boolean statements to do a wide range of data analysis tasks, such as writing queries for searches and checking for conditions when writing programming code.
Imagine you are shopping for shoes, and are considering certain preferences:
You will buy the shoes only if they are any combination of pink and grey
You will buy the shoes if they are entirely pink, entirely grey, or if they are pink and grey
You will buy the shoes if they are grey, but not if they have any pink
These Venn diagrams illustrate your shoe preferences. AND is the center of the Venn diagram, where two conditions overlap. OR includes either condition. NOT includes only the part of the Venn diagram that doesn't contain the exception
In queries, Boolean logic is represented in a statement written with Boolean operators. An operator is a symbol that names the operation or calculation to be performed. Read on to discover how you can convert your shoe preferences into Boolean statements.
Your condition is “If the color of the shoe has any combination of grey and pink, you will buy them.” The Boolean statement would break down the logic of that statement to filter your results by both colors. It would say IF (Color="Grey") AND (Color="Pink") then buy them
The AND operator lets you stack both of your conditions.
Below is a simple truth table that outlines the Boolean logic at work in this statement. In the Color is Grey column, there are two pairs of shoes that meet the color condition. And in the Color is Pink column, there are two pairs that meet that condition. But in the If Grey AND Pink column, only one pair of shoes meets both conditions. So, according to the Boolean logic of the statement, there is only one pair marked true. In other words, there is one pair of shoes that you would buy.
The OR operator lets you move forward if either one of your two conditions is met. Your condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would be IF (Color="Grey") OR (Color="Pink") then buy them.
Notice that any shoe that meets either the Color is Grey or the Color is Pink condition is marked as true by the Boolean logic. According to the truth table below, there are three pairs of shoes that you can buy.
Finally, the NOT operator lets you filter by subtracting specific conditions from the results. Your condition is "You will buy any grey shoe except for those with any traces of pink in them." Your Boolean statement would be IF (Color="Grey") AND (Color=NOT "Pink") then buy them
Now, all of the grey shoes that aren't pink are marked true by the Boolean logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink condition. Only one pair of shoes is excluded in the truth table below.
For data analysts, the real power of Boolean logic comes from being able to combine multiple conditions in a single statement. For example, if you wanted to filter for shoes that were grey or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color = "Grey") OR (Color = "Pink")) AND (Waterproof="True")
Notice that you can use parentheses to group your conditions together.
Operators are symbols that name the operation or calculation to be performed. The operators AND, OR, and NOT can be used to write Boolean statements in programming languages. Whether you are doing a search for new shoes or applying this logic to queries, Boolean logic lets you create multiple conditions to filter your results. Now that you know a little more about Boolean logic, you can start using it!
Learn about who pioneered Boolean logic in this historical article: Origins of Boolean Algebra in the Logic of Classes.
Find more information about using AND, OR, and NOT from these tips for searching with Boolean operators.
This reading outlines the steps the instructor performs in the following video, Meet wide and long data. In this video, the instructor presents wide and long data formats and discusses the types of questions each format can help you answer.
Keep this guide open as you watch the video. It can serve as a helpful reference if you need additional context or clarification while following the video steps. This is not a graded activity, but you can complete these steps to practice the skills demonstrated in the video.
If you would like to access the spreadsheets the instructor uses in this video, select the link to a dataset to create a copy. If you don’t have a Google account, download the data directly from the attachments below.
Link to population datasets:
Population, Latin, and Caribbean Countries, 2010–2019, wide format
Population, Latin, and Caribbean Countries, 2010–2019, long format
OR
Download data:
Population, Latin, and Caribbean Countries, 2010–2019, wide format
Population, Latin, and Caribbean Countries, 2010–2019, long format
Wide data is a dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject. It is helpful for comparing specific attributes across different subjects.
Open the Population, Latin, and Caribbean Countries, 2010–2019, wide format spreadsheet.
Each row contains all population data for one country.
The population data for each year is contained in a column.
Find the annual population of Argentina in row 3.
In this wide format, you can quickly compare the annual population of Argentina to the annual populations of Antigua and Barbuda, Aruba, the Bahamas, or any other country.
Select column E, which contains each country’s 2010 population data.
Right-click column header E and choose Sort Z to A.
Notice that Brazil is now at the top of the list because it had the highest population in the year 2010.
Select column H.
Right-click column header H and choose Sort A to Z.
Notice that the British Virgin Islands are now at the top because they had the lowest population of all countries in 2013.
Long data is data in which each row represents one observation per subject, so each subject will be represented by multiple rows. This data format is useful for comparing changes over time or making other comparisons across subjects.
Open the Population, Latin, and Caribbean Countries, 2010–2019, long format spreadsheet.
Notice the data is no longer organized into columns by year. All of the years are now in one column.
Find Argentina’s population data in rows 12-21. Each row contains one year of Argentina’s population data.
In this reading, you will explore how data is transformed and the differences between wide and long data. Data transformation is the process of changing the data’s format, structure, or values. As a data analyst, there is a good chance you will need to transform data at some point to make it easier for you to analyze it.
Data transformation usually involves:
Adding, copying, or replicating data
Deleting fields or records
Standardizing the names of variables
Renaming, moving, or combining columns in a database
Joining one set of data with another
Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (.csv) file.
Goals for data transformation might be:
Data organization: better organized data is easier to use
Data compatibility: different applications or systems can then use the same data
Data migration: data with matching formats can be moved from one system to another
Data merging: data with the same organization can be merged together
Data enhancement: data can be displayed with more detailed fields
Data comparison: apples-to-apples comparisons of the data can then be made
Mario is a plumber who owns a plumbing company. After years in the business, he buys another plumbing company. Mario wants to merge the customer information from his newly acquired company with his own, but the other company uses a different database. So, Mario needs to make the data compatible. To do this, he has to transform the format of the acquired company’s data. Then, he must remove duplicate rows for customers they had in common. When the data is compatible and together, Mario’s plumbing company will have a complete and merged customer database.
To make it easier to create charts, you may also need to transform long data to wide data. Consider the following example of transforming stock prices (collected as long data) to wide data.
Long data is data where each row contains a single data point for a particular item. In the long data example below, individual stock prices (data points) have been collected for Apple (AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates.
Long data example: Stock prices
Wide data is data where each row contains multiple data points for the particular items identified in the columns.
Wide data example: Stock prices
With data transformed to wide data, you can create a chart comparing how each company's stock changed over the same period of time.
You might notice that all the data included in the long format is also in the wide format. But wide data is easier to read and understand. That is why data analysts typically transform long data to wide data more often than they transform wide data to long data. The following table summarizes when each format is preferred:
By now, you’ve learned a lot about different data types and data structures. In this activity, you will work with datasets from Kaggle, an online community of people passionate about data. To start this activity, you’ll create a Kaggle account, set up a profile, and explore Kaggle notebooks.
Every data analyst has a data community that they rely on for help, support, and inspiration. Kaggle can help you build your own data community.
Kaggle has millions of users in all stages of their data career, from beginners to data scientists with decades of experience. The Kaggle community brings people together to develop their data analysis skills, share datasets and interactive notebooks, and collaborate on solving real-life data problems.
Check out this brief introductory video to learn more about Kaggle.
By the time you complete this activity, you will be able to use many of Kaggle’s key features. This will enable you to create notebooks and browse data, which is important for completing and sharing data projects in your career as a data analyst.
Follow the instructions to complete each step of the activity. Then answer the questions at the end of the activity before going to the next course item.
To get started, follow these steps to create a Kaggle account.
Note: Kaggle frequently updates its user interface. The latest changes may not be reflected in the screenshots, but the principles in this activity remain the same. Adapting to changes in software updates is an essential skill for data analysts, and we encourage you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
1. Go to kaggle.com
2. Click on the Register button at the top-right of the Kaggle homepage. You can register with your Google credentials or your personal email address.
3. Once you’re registered and logged in to Kaggle, click on the Account icon at the top-right of your screen. In the menu that opens, click the Your Profile button.
4. On your profile page, click on the Edit Profile button. Enter any information you’d like to share with the Kaggle community. Your profile will be public, so only enter the information you’re comfortable sharing.
5. If you want some inspiration, check out the profile of Kaggle’s Community Advocate, Jesse Mostipak!
Now that you’ve created an account and set up your profile, you can check out some notebooks on Kaggle. Kagglers use notebooks to share datasets and data analyses.
First, go to the Navigation bar on the left side of your screen. Then, click on the Code icon. This takes you to the Code home page.
On the Code home page, you’ll notice links to notebooks created by other Kagglers.
To begin, feel free to scroll through the list and click on notebooks that interest you. As you explore, you may come across unfamiliar terms and new information: That’s fine! Kagglers come from diverse backgrounds and focus on different areas of data analysis, data science, machine learning, and deep learning.
Once you’re familiar with the Code home page, you can narrow your search results by typing a word in the search bar or by using the filter feature.
For example, enter Beginner in the search bar to show notebooks tagged as beginner-friendly. Or, click on the Filter icon, the triangle shape on the right side of the search bar. You can filter results by tags, programming language, output, and other options. Filter to Datasets to show notebooks that use one of the tens of thousands of public datasets available on Kaggle.
If you’re looking for specific suggestions, check out the following notebooks:
gganimatehttps://www.kaggle.com/mrisdal/space-is-the-place by Meg Risdal
Getting staRted in R by Rachael Tatman
Writing Hamilton Lyrics with TensorFlow/R by Ana Sofia Uzsoy
Dive into dplyr (tutorial #1) by Jesse Mostipak
Spend some time checking out a couple of notebooks to get an idea of the work that Kagglers share online—and that you’ll be able to create by the time you’ve finished this course!
Now, take a look at a specific notebook: Dive into dplyr (tutorial #1) by Jesse Mostipak. Follow these steps to learn how to edit notebooks:
1. Click on the link to open up the notebook. It contains the dataset you’ll work with later on.
2. Click on the Copy and Edit button at the top-right to make a copy of the notebook in your account. Now, the notebook appears in Edit mode. Edit mode lets you make changes to the notebook if you want.
This notebook is private. If you want to share your work, you can choose to make it public. When you copy and edit another Kaggler’s work, always make meaningful changes to the notebook before publishing it. That way, you’re not misrepresenting someone else’s work as your own.
3. Take a moment to explore the Edit mode of the notebook.
Some of this may seem unfamiliar—and that’s just fine. By the end of this course, you’ll know how to create a notebook like this from scratch!
Now, you can check out the data!
In this notebook, you’ll find the data in a box labeled Data at the top-right of your screen. In the box, there’s an input folder with the title: palmer-archipelago-antarctica-penguin-data. Follow these instructions to explore the datasets and learn more about the data within them:
1. Click on this title. Two .csv files appear: penguins_lter.csv and penguins_size.csv. Click on one of them. At the bottom of the notebook, you’ll now find an interactive data table with all the information from the dataset.
2. Click on the other .csv file. This opens a second tab with the second dataset.
3. Take a moment to check out each dataset.
4. Sort the data in each column by clicking on the horizontal bars to the right of each column name.
5. Click on the button that says 10 of 17 columns to change the columns that are visible in the table.
In the dropdown menu, there’s a checkmark next to the name of each column that appears in the table. Checking or unchecking one of these boxes will change what data is presented.
Congratulations! You’ve explored several ways to interact with the dataset. This will help you get familiar with the Kaggle interface. You can save the notebook you worked in for future reference. Coming up, you’ll learn more about other ways you can use Kaggle.
Question 1
Fill in the blank: A data type is a specific kind of data _____ that tells what kind of value the data is.
attribute
frame
model
point
1 point
Question 2
What are the key characteristics of a text, or string, data type? Select all that apply.
Sequence of characters and punctuation
Contains textual information
Only two possible values
Has numerical percentages
1 point
Question 3
In a data table, where are fields contained?
Rows
Charts
Favorites
Columns
1 point
Question 4
When using long data, each subject has data in multiple rows. This is because each row represents what?
One observation per subject
Multiple values
True or false data points
Data in different formats
1 point
Agenda: A list of scheduled appointments
Audio file: Digitized audio storage usually in an MP3, AAC, or other compressed format
Boolean data: A data type with only two possible values, usually true or false
Continuous data: Data that is measured and can have almost any numeric value
Cookie: A small file stored on a computer that contains information about its users
Data element: A piece of information in a dataset
Data model: A tool for organizing data elements and how they relate to one another
Digital photo: An electronic or computer-based image usually in BMP or JPG format
Discrete data: Data that is counted and has a limited number of values
External data: Data that lives, and is generated, outside of an organization
Field: A single piece of information from a row or column of a spreadsheet; in a data table, typically a column in the table
First-party data: Data collected by an individual or group using their own resources
Long data: A dataset in which each row is one time point per subject, so each subject has data in multiple rows
Nominal data: A type of qualitative data that is categorized without a set order
Ordinal data: Qualitative data with a set order or scale
Ownership: The aspect of data ethics that presumes individuals own the raw data they provide and have primary control over its usage, processing, and sharing
Pixel: In digital imaging, a small area of illumination on a display screen that, when combined with other adjacent areas, forms a digital image
Population: In data analytics, all possible data values in a dataset
Record: A collection of related data in a data table, usually synonymous with row
Sample: In data analytics, a segment of a population that is representative of the entire population
Second-party data: Data collected by a group directly from its audience and then sold
Social media: Websites and applications through which users create and share content or participate in social networking
String data type: A sequence of characters and punctuation that contains textual information (Refer to Text data type)
Structured data: Data organized in a certain format such as rows and columns
Text data type: A sequence of characters and punctuation that contains textual information (also called string data type)
United States Census Bureau: An agency in the U.S. Department of Commerce that serves as the nation’s leading provider of quality data about its people and economy
Unstructured data: Data that is not organized in any easily identifiable manner
Video file: A collection of images, audio files, and other data usually encoded in a compressed format such as MP4, MV4, MOV, AVI, or FLV
Wide data: A dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject
Question 1
Fill in the blank: Bias is a _____ preference in favor of or against a person, group of people, or thing.
Status: [object Object]
1 point
conscious or subconscious
sensible or insensible
fair or unfair
standard or substandard
Question 2
Which of the following are examples of sampling bias? Select all that apply.
Status: [object Object]
1 point
An election poll only interviews people with college degrees.
A teacher gives higher grades to essays written in their own writing style.
A clinical study includes three times more men than women.
A survey of students does not include homeschooled students.
Question 3
What is the term for the tendency to search for or interpret information in a way that validates pre-existing beliefs?
Status: [object Object]
1 point
Interpretation bias
Observer bias
Confirmation bias
Sampling bias
Question 4
Which of the following terms are also ways of describing observer bias? Select all that apply.
Status: [object Object]
1 point
Experimenter bias
Perception bias
Research bias
Spectator bias
Question 1
Fill in the blank: Data is considered _____ when it is accurate, complete, and unbiased information that has been vetted and proven fit for use.
comprehensive
original
reliable
current
1 point
Question 2
Which of the following are usually good data sources? Select all that apply.
Academic papers
Governmental agency data
Vetted public datasets
Social media sites
1 point
Question 3
To determine if a data source is cited, ask which of the following questions? Select all that apply.
Who created this dataset?
When was this data last refreshed?
Has this dataset been properly cleaned?
Is this dataset from a credible organization?
1 point
Question 4
A junior data analyst learns that the dataset they have been given is six years old. After looking into this further, they also discover that the age of the data is making the information irrelevant to their project. What good data source principle have they used to evaluate the dataset?
Comprehensive
Reliable
Original
Current
1 point
You have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person's identity.
Data anonymization is the process of protecting people's private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.
Organizations have a responsibility to protect their data and the personal information that data might contain. As a data analyst, you might be expected to understand what data needs to be anonymized, but you generally wouldn't be responsible for the data anonymization itself. A rare exception might be if you work with a copy of the data for testing or development purposes. In this case, you could be required to anonymize the data before you work with it.
Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized:
Telephone numbers
Names
License plates and license numbers
Social security numbers
IP addresses
Medical records
Email addresses
Photographs
Account numbers
For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized. Imagine a world where we all had access to each other’s addresses, account numbers, and other identifiable information. That would invade a lot of people’s privacy and make the world less safe. Data anonymization is one of the ways we can keep data private and secure!
Question 1
What are data ethics?
Well-founded standards of right and wrong that dictate how data is collected, shared, and used
Established methods for ensuring data is clean, well-organized, and appropriate for a project
Approved strategies data professionals use to safeguard the privacy and security of a dataset
Long-standing techniques for confirming that data is always used to benefit society
1 point
Question 2
What concept states that all data-processing activities and algorithms should be completely explainable and understood by the individual who provides their data?
Privacy
Ownership
Currency
Transaction transparency
1 point
Question 3
A data analyst removes personally identifying information from a dataset. What task are they performing?
Data anonymization
Data sorting
Data collection
Data visualization
1 point
Question 4
Before completing a survey, an individual acknowledges reading information about how and why the data they provide will be used. What is this concept called?
Discretion
Currency
Privacy
Consent
1 point
Just like data privacy, open data is a widely debated topic in today’s world. Data analysts think a lot about open data, and as a future data analyst, you need to understand the basics to be successful in your new role.
In data analytics, open data is part of data ethics, which has to do with using data ethically. Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to:
Be available and accessible to the public as a complete dataset
Be provided under terms that allow it to be reused and redistributed
Allow universal participation so that anyone can use, reuse, and redistribute the data
Data can only be considered open when it meets all three of these standards.
One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by the public, open data, too.
Third-party data is collected by an entity that doesn’t have a direct relationship with the data. You might remember learning about this type of data earlier. For example, third parties might collect information about visitors to a certain website. Doing this lets these third parties create audience profiles, which helps them better understand user behavior and target them with more effective advertising.
Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important to keep this data safe. PII can include a person’s address, credit card information, social security number, medical records, and more.
Everyone wants to keep personal information about themselves private. Because third-party data is readily available, it is important to balance the openness of data with the privacy of individuals.
Luckily for data analysts, there are lots of trustworthy resources available for open data. It is important to remember that even reputable data needs to be constantly evaluated, but these websites are a useful starting point:
U.S. government data site: Data.gov is one of the most comprehensive data sources in the US. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations.
U.S. Census Bureau: This open data source offers demographic information from federal, state, and local governments, and commercial entities in the U.S. too.
Open Data Network: This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.
Google Cloud Public Datasets: There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.
Dataset Search: The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets.
Question 1
In the last activity, you got set up on Kaggle and explored the Notebooks feature. In this activity, we will work with a different feature of the Kaggle platform: datasets.
Kaggle has tens of thousands of datasets that are available for public use. Anyone can upload a dataset to Kaggle. If they choose to make it public, other Kagglers can use that dataset to create their own projects.
First, you’ll take a tour of a specific dataset. Then, you’ll have a chance to choose your own datasets to work with. Finally, you’ll use what you’ve learned in this module to determine the kind of data in your datasets, and whether the data is biased or unbiased.
By the time you complete this activity, you will be able to use many of the helpful features Kaggle has to offer. This will enable you to find data for projects and engage with the data community, which is important for developing skills and networking in your career as a data analyst.
Let’s explore the datasets feature!
1. To start, log in to your Kaggle account.
Note: Kaggle frequently updates its user interface. The latest changes may not be reflected in the screenshots, but the principles in this activity remain the same. Adapting to changes in software updates is an essential skill for data analysts, and we encourage you to practice troubleshooting. You can also reach out to your community of learners on the discussion forum for help.
2. Then, click on the Datasets icon in the vertical Navigation bar on the left. This takes you to the Datasets home page. From here, you can create a new dataset or search for datasets created by other Kagglers.
3. Now, check out a specific dataset. Enter Animal Crossing in the search bar to find datasets related to the Nintendo video game Animal Crossing.
4. There’s more than one option, so click on the Animal Crossing New Horizons Catalog. This takes you to the landing page for this dataset.
Header: The header at the top of the page contains the following information about the dataset:
Its title
A brief description of its contents
The name of its creator
When it was last updated
Its current version
Badge: In the top-right corner of the header, you’ll find three more items:
A badge in the shape of a circle
An icon in the shape of a caret symbol ( ^ )
A number
The badge is related to Kaggle’s progression system. If you want, you can read more about it here.
Upvotes: Clicking on the caret lets you “upvote” the dataset. The number shows the number of times this dataset has been upvoted by the Kaggle community.
Tabs: Beneath the header is a bar with three tabs: Data Card, Code, and Discussion. Take a moment to click on each of these tabs and explore their contents. Afterwards, navigate back to the Datasets tab!
Now, you can move down the page. You’ll find a box that contains three terms: Usability, License, and Tags.
Usability shows how complete the dataset webpage is (and not the dataset itself). Kaggle encourages the community to add information to the dataset webpage to make the dataset itself easier to understand. For example, a brief description or a column header. Hover your cursor over the Usability score to discover what the dataset page contains.
Licenses govern how a dataset can be used. Click on the license name to learn more about that specific license.
Expected update frequency will report when the data set was updated with new attributes, amendments, or expansions of the data. You may also see when any expected future dates are scheduled for a data set update. In this case, there will never be an update scheduled.
The next box down contains a detailed description of the dataset. Kagglers often include information on where the dataset came from and how the dataset was prepared.
And last—but certainly not least—is the Data Explorer!
The Data Explorer menu shows that the Animal Crossing dataset contains 30 .csv files. If you click on a file name, the window to the right will display information from that specific file. Try clicking on umbrellas.csv to check it out!
Notice that the Data Explorer has three viewing options: Detail, Compact, and Column. For now, we’ll focus on the Detail tab.
The description at the top of the Detail tab shows that the umbrellas.csv file contains data on all the umbrellas in the video game. Let’s check out the columns. Each column header has three items:
A small icon on the left that shows the data type
The name of the column
An icon with three bars that lets you sort the data if you click on it
Below each column header is a box that contains a summary of the data. This lets you quickly get an idea of what’s in the dataset. For example, the summary for the Name column shows there are 67 unique values for the umbrella names. The summary for the DIY column shows that 9 of the umbrella recipes are DIY, or “do it yourself.” Take a moment to explore the summaries for the other columns.
And that completes our tour! That’s a lot of information. Feel free to go back and review.
After you’ve explored a dataset, you can link it to a Kaggle notebook or download it to access it for your own use. Linking a dataset to a Kaggle notebook means you create a new notebook from the existing dataset so that it is available for you to use.
Now, you’ll get a chance to choose your own datasets to work with! Use the following steps to find datasets that interest you:
1. When you’re ready, click on the Data icon on the left to return to the Datasets landing page.
2. Note that datasets can exist in a variety of formats. If you want to make sure your dataset is in a .csv format, click on the Filter button on the right side of the Datasets search bar. Then, choose CSV from the menu.
3. Find 2-3 datasets that you’re interested in exploring further.
4. Create notebooks from them, download them, or check them out in the Data Explorer. Keep these datasets in mind for your upcoming reflection.
Here are the options to create a notebook or download the dataset:
Create a Kaggle notebook: To link a dataset to a Kaggle notebook, you click on the New Notebook button in the dataset header. This will create a notebook in your Kaggle account that links to the dataset.
Download the dataset: To download a copy of the dataset to your computer, click on the Download button in the dataset header at the top of the page.
Open the file in Google Sheets: To open a Google Sheets view of the file, click on the download icon at the top-right of the Data Explorer. You can then download the file.
Question 1
Fill in the blank: Openness refers to _____ access, usage, and sharing of data.
Status: [object Object]
1 point
limited
disclosed
protected
free
Question 2
What is the preferred method for open data to be made available?
Status: [object Object]
1 point
A convenient and modifiable internet download
A secure password-protected file
A compressed file format that keeps file size small
A print copy that is easily shared by anyone
Question 3
What are the main benefits of open data? Select all that apply.
Status: [object Object]
1 point
Restricts data access to certain groups of people
Combines data from different fields of knowledge
Good data is more widely available
Increases the amount of data available for purchase
Question 4
What are the key aspects of universal participation? Select all that apply.
Status: [object Object]
1 point
No one can place restrictions on data to discriminate against a person or group.
Certain groups of people must share their private data.
All corporations are allowed to sell open data.
Everyone must be able to use, reuse, and redistribute open data.
Bad data source: A data source that is not reliable, original, comprehensive, current, and cited (ROCCC)
Bias: A conscious or subconscious preference in favor of or against a person, group of people, or thing
Confirmation bias: The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Consent: The aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it
Cookie: A small file stored on a computer that contains information about its users
Currency: The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions
Data anonymization: The process of protecting people's private or sensitive data by eliminating identifying information
Data bias: When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction
Data ethics: Well-founded standards of right and wrong that dictate how data is collected, shared, and used
Data interoperability: A key factor leading to the successful use of open data among companies and governments
Data privacy: Preserving a data subject’s information any time a data transaction occurs
Ethics: Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues
Experimenter bias: The tendency for different people to observe things differently (also called observer bias)
Fairness: A quality of data analysis that does not create or reinforce bias
First-party data: Data collected by an individual or group using their own resources
General Data Protection Regulation of the European Union (GDPR): Policy-making body in the European Union created to help protect people and their data
Good data source: A data source that is reliable, original, comprehensive, current, and cited (ROCCC)
Interpretation bias: The tendency to interpret ambiguous situations in a positive or negative way
Observer bias: The tendency for different people to observe things differently (also called experimenter bias)
Open data: Data that is available to the public
Openness: The aspect of data ethics that promotes the free access, usage, and sharing of data
Sampling bias: Overrepresenting or underrepresenting certain members of a population as a result of working with a sample that is not representative of the population as a whole
Transaction transparency: The aspect of data ethics that presumes all data-processing activities and algorithms should be explainable and understood by the individual who provides the data
Unbiased sampling: When the sample of the population being measured is representative of the population as a whole
Databases enable analysts to manipulate, store, and process data. This helps them search through data a lot more efficiently to get the best insights.
A relational database is a database that contains a series of tables that can be connected to form relationships. Basically, they allow data analysts to organize and link data based on what the data has in common.
In a non-relational table, you will find all of the possible variables you might be interested in analyzing all grouped together. This can make it really hard to sort through. This is one reason why relational databases are so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use across an entire database.
Normalization is a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.
Tables in a relational database are connected by the fields they have in common. You might remember learning about primary and foreign keys before. As a quick refresher, a primary key is an identifier that references a column in which each value is unique. In other words, it's a column of a table that is used to uniquely identify each record within that table. The value assigned to the primary key in a particular row must be unique within the entire table. For example, if customer_id is the primary key for the customer table, no two customers will ever have the same customer_id.
By contrast, a foreign key is a field within a table that is a primary key in another table. A table can have only one primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables in a relational database, which helps organize and connect data across multiple tables in the database.
Some tables don't require a primary key. For example, a revenue table can have multiple foreign keys and not have a primary key. A primary key may also be constructed using multiple columns of a table. This type of primary key is called a composite key. For example, if customer_id and location_id are two columns of a composite key for a customer table, the values assigned to those fields in any given row must be unique within the entire table.
As you've been learning, Structured Query Language (SQL) is a type of query language that enables data analysts to communicate with a database. So, a data analyst will use SQL to create a query to view the specific data that they want from within a larger dataset. In a relational database, data analysts can write queries to get data from the related tables. SQL is a powerful tool for working with databases—which is why you are going to learn more about it coming up!
As a data analyst, you'll use data to answer questions and solve problems. When you analyze data and draw conclusions, you are generating insights that can influence business decisions, drive positive change, and help your stakeholders meet their goals.
Before you begin an analysis, it’s important to inspect your data to determine if it contains the specific information you need to answer your stakeholders’ questions. In any given dataset, it may be the case that:
The data is not there (you have sandwich data, but you need pizza data)
The data is insufficient (you have pizza data for June 1-7, but you need data for the entire month of June)
The data is incorrect (your pizza data lists the cost of a slice as $250, which makes you question the validity of the dataset)
Inspecting your dataset will help you pinpoint what questions are answerable and what data is still missing. You may be able to recover this data from an external source or at least recommend to your stakeholders that another data source be used.
In this reading, imagine you’re a data analyst inspecting spreadsheet data to determine if it’s possible to answer your stakeholders’ questions.
You are a data analyst working for an ice cream company. Management is interested in improving the company's ice cream sales.
The company has been collecting data about its sales—but not a lot. The available data is from an internal data source and is based on sales for 2019. You’ve been asked to review the data and provide some insight into the company’s ice cream sales. Ideally, management would like answers to the following questions:
What is the most popular flavor of ice cream?
How does temperature affect sales?
How do weekends and holidays affect sales?
How does profitability differ for new versus returning customers?
You can download the data to follow along with this reading. To use the template for the sales data, click the link below and select “Use Template.”
Link to template: Ice Cream Sales
OR
If you don’t have a Google account, you can download the spreadsheets directly from the attachments below:
To discover the most popular flavor, you first need to define what is meant by "popular." Is the most popular flavor the one that generated the most revenue in 2019? Or is it the flavor that had the largest number of units sold in 2019? Sometimes your measurement choices are limited by what data you have—you can review your spreadsheet to find out if either of these definitions of “popular” make sense based on the available data.
Click the flavors tab on your spreadsheet to view the relevant data. The flavors sheet has three columns and 209 rows of data. The column headers are week, units sold, and flavor. This dataset did not come with a data description, so you have to figure out the significance of the columns on your own. Based on the data, you deduce that these columns provide information about the number of units sold for each ice cream flavor, by week, in 2019
In this case, you can discover what the most popular flavor is by using units sold as your measure. In particular, you can use the units sold column to calculate the total number of units sold during the year for each flavor. Unfortunately, the dataset does not provide the annual sales amount by flavor. In this case, your next step would be to ask your stakeholders if the annual sales per flavor data is available from another source. If not, you can add a statement about the current data’s limitations to your analysis.
To explore your second question, you click the temperatures tab and check out the data. The temperatures sheet has two columns and 366 rows of data. The column headers are temperature and sales. The data may show total 2019 sales per temperature (for instance, the first entry might sum up $39.69 in sales for three separate days that each had a high of 60 degrees). Or, the data may show a snapshot of sales and temperature for each day in 2019 (for instance, the first entry might refer to a single day with a high of 60 degrees and $39.69 in sales).
So, which is it? It’s probably a daily snapshot because there are 365 entries for temperature, and multiple rows with the same temperature and different sales values. This implies that each entry is for a single day and not a summary of multiple days. However, without more information, you can’t be certain. Plus, you don’t know if the current data is listed in consecutive order by date or in a different order. Your next step would be to contact the owner of the dataset for clarification.
If it turns out that temperature does affect sales, you’ll be able to offer your stakeholders an insight such as the following: “When daily highs are above X degrees, average ice cream sales increase by Y amount. So the business should plan on increasing inventory during these times to maximize sales.”
Next, you click on the sales tab to view the data about dates of sale. The sales sheet has two columns and 366 rows of data. The column headers are date and sales. This data is most likely total daily sales in 2019, as sales are recorded for each date in 2019.
You can use this data to determine whether a specific date falls on a weekend or holiday and add a column to your sheet that reflects this information. Then, you can find out whether sales on the weekends and holidays are greater than sales on other days. This will be useful to know for inventory planning and marketing purposes.
Your dataset does not contain sales data related to new customers. Without this data, you won’t be able to answer your final question. However, it may be the case that the company collects customer data and stores it in a different data table.
If so, your next step would be to find out how to access the company’s customer data. You can then join the revenue sales data to the customer data table to categorize each sale as from a new or returning customer and analyze the difference in profitability between the two sets of customers. This information will help your stakeholders develop marketing campaigns for specific types of customers to increase brand loyalty and overall profitability.
When working on analytics projects, you won’t always have all the necessary or relevant data at your disposal. In many of these cases, you can turn to other data sources to fill in the gaps.
Despite the limitations of your dataset, it’s still possible to offer your stakeholders some valuable insights. For next steps, your best plan of action will be to take the initiative to ask questions, identify other relevant datasets, or do some research on your own. No matter what data you’re working with, carefully inspecting your data makes a big impact on the overall quality of your analysis.
Question 1
Fill in the blank: A relational database contains a series of _____ that can be connected to form relationships.
Status: [object Object]
1 point
tables
cells
fields
spreadsheets
Question 2
What is the term for an identifier that references a database column in which each value is unique?
Status: [object Object]
1 point
Primary key
Foreign key
Relation
Field
Question 3
What process do data professionals use to eliminate data redundancy, increase data integrity, and reduce complexity in a database?
Status: [object Object]
1 point
Composition
Normalization
Iteration
Manipulation
Question 4
Fill in the blank: When using a relational database, data analysts write _____ to request data from the related tables.
Status: [object Object]
1 point
programs
queries
relationships
keys
Data analytics, by design, is a field that thrives on collecting and organizing data. In this reading, you’ll learn about metadata and the type of information it can provide. In addition, you’ll explore examples of metadata.
Explore a data file by opening any file on your computer or a document in your home or workplace. What is it? Where did it come from? Is it useful? How do you know? This is where metadata comes in to provide a deeper understanding of the data. To put it simply, metadata is data about data. In database management, metadata provides information about other data and helps data analysts interpret the contents of the data within a database.
Regardless of whether you’re working with a large or small quantity of data, metadata is the mark of a knowledgeable analytics team. Metadata helps people communicate about data across the business and makes it easier to reuse data. In essence, metadata tells the who, what, when, where, which, why, and how of data.
Before examining metadata examples, it’s important to understand what type of information metadata typically provides:
File or document type: What type of file or document are you examining?
Date, time, and creator: When was it created? Who created it? When was it last modified?
Title and description: What is the name of the item you are examining? What type of content does it contain?
Geolocation: If you’re examining a photo, where was it taken?
Tags and categories: What is the general overview of the item that you have? Is it indexed or described in a specific way?
Who last modified it and when: Were any changes made to the file? If yes, when were the most recent modifications made?
Who can access or update it: If you’re examining a dataset, is it public? Are special permissions needed to customize or modify it?
In today’s digital world, metadata is everywhere! Here are some examples—with accompanying images—of where you might find metadata.
Whenever a photo is captured with a camera, metadata such as filename, date, time, geolocation, and the type of device on which it was taken are gathered and saved with it. The metadata of the following photo is displayed as a pop-up alongside the photo.
When an email is sent or received, it contains metadata such as subject line, sender, recipient, date sent, and time sent.
Emails also contain hidden metadata that includes server names, IP addresses, HTML format, and software details. This image includes hidden email metadata such as the message ID and when the email was created.
Spreadsheets and documents are already filled with a considerable amount of data, so it’s no surprise that they also include metadata such as title, author, creation date, number of pages, and user comments. Additionally, spreadsheet metadata includes tab names, tables, and columns. In the following example, the image demonstrates the metadata for an electronically created Google Sheet:
Every web page has a number of standard metadata fields such as tags and categories, the site creator’s name, web page title and description, and time of creation. Results of search engine queries that you might make on a daily basis are metadata!
Non-digital items can have metadata, too! Every book has standard metadata that will inform you of its title, author’s name, a table of contents, publisher information, copyright description, index, and a brief description of the book’s contents. Audiobook metadata also includes this data, as well as metadata specific to the audiobook such as narrator and recording length.
Metadata can be found in photos, emails, spreadsheets, websites, and much more! In your daily life, you use metadata to stay organized. As a data analyst, you’ll use metadata to understand the content and context of your data, as well as how it’s structured. Metadata provides data analysts with information about a data’s file type, title, geolocation, who created it, who last modified it, and who has access to it. As a data analyst, it’s important to keep accurate records of metadata to ensure that you are able to find, use, preserve, and reuse data in the future. Remember, it will be your responsibility to manage and make use of data in its entirety; metadata is as important as the data itself.
As you’re learning, metadata is data about data. It clearly describes how and when data was collected and how it’s organized. Metadata puts data into context and makes the data more understandable. This helps data analysts use data to solve problems and make informed business decisions.
In this reading, you’ll learn more about the benefits of metadata, metadata repositories, and metadata of external databases.
Data analysts use reliable and high-quality data to identify the root causes of any problems that might occur during analysis and to improve their results. If the data being used to solve a problem or to make a data-driven decision is unreliable, there’s a good chance the results will be unreliable as well.
Metadata helps data analysts confirm their data is reliable by making sure it is:
Accurate
Precise
Relevant
Timely
It does this by helping analysts ensure that they’re working with the right data and that the data is described correctly. For example, a data analyst completing a project with data from 2022 can use metadata to easily determine if they should use data from a particular file.
Data analysts thrive on consistency and aim for uniformity in their data and databases, and metadata helps make this possible. For example, to use survey data from two different sources, data analysts use metadata to make sure the same collection methods were applied in the survey so that both datasets can be compared reliably.
When a database is consistent, it’s easier to discover relationships between the data inside the database and data that exists elsewhere. When data is uniform, it is:
Organized: Data analysts can easily find tables and files, monitor the creation and alteration of assets, and store metadata.
Classified: Data analysts can categorize data when it follows a consistent format, which is beneficial in cleaning and processing data.
Stored: Consistent and uniform data can be efficiently stored in various data repositories. This streamlines storage management tasks such as managing a database.
Accessed: Users, applications, and systems can efficiently locate and use data.
Together, these benefits empower data analysts to effectively analyze and interpret their data.
Metadata repositories help data analysts ensure their data is reliable and consistent.
Metadata repositories are specialized databases specifically created to store and manage metadata. They can be kept in a physical location or a virtual environment—like data that exists in the cloud.
Metadata repositories describe where the metadata came from and store that data in an accessible form with a common structure. This provides data analysts with quick and easy access to the data. If data analysts didn’t use a metadata repository, they would have to select each file to look up its information and compare the data manually, which would waste a lot of time and effort.
Data analysts also use metadata repositories to bring together multiple sources for data analysis. Metadata repositories do this by describing the state and location of the data, the structure of the tables inside the data, and who accessed the user logs.
Data analysts use both second-party and third-party data to gain valuable insights and make strategic, data-driven decisions. Second-party data is data that’s collected by a group directly from the group’s audience and then sold. Third-party data is provided by outside sources that didn’t collect it directly. The providers of this data are not its original collectors and do not have a direct relationship with any individuals to whom the data belongs. The outside providers get the data from websites or other programs that pull it from the various platforms where it was originally generated.
Data analysts should understand the metadata of external databases to confirm that it is consistent and reliable. In some cases, they should also contact the owner of the third-party data to confirm that it is accessible and available for purchase. Confirming that the data is reliable and that the proper permissions to use it have been obtained are best practices when using data that comes from another organization.
Metadata helps data analysts make data-driven decisions more quickly and efficiently. It also ensures that data and databases are reliable and consistent.
Metadata repositories are used to store metadata—including data from second-party and third-party companies. These repositories describe the state and location of the metadata, the structure of the tables inside it, and who has accessed the repository. Data analysts use metadata repositories to ensure that they use the right data appropriately.
Question 1
A large company has several databases across its many departments. What kind of metadata describes how many locations contain a certain piece of data?
Representative
Administrative
Descriptive
Structural
1 point
Question 2
A large metropolitan high school gives each of its students an ID number to differentiate them in its database. What kind of metadata are the ID numbers?
Representative
Administrative
Descriptive
Structural
1 point
Question 3
An international nonprofit organization wants to merge third-party data with its own data. Which of the following actions will help make this process successful? Select all that apply.
Use metadata to standardize the datasets.
Use metadata to evaluate the third-party data’s quality and credibility.
Replace the incoming data’s metadata with its own company metadata.
Alter the internal metadata to more closely reflect the incoming metadata.
1 point
Question 4
Fill in the blank: Data _____ is a process data professionals use to ensure the formal management of their organization’s data assets.
storage
sourcing
governance
organization
1 point
In an earlier lesson, you explored comma separated values (.csv) files and how to import them into a Google Sheet or download them from a website. Data analysts use .csv files often, so throughout this course you will use .csv files to transfer data into data analysis programs for further analysis and visualization. .csv files are plain text files with an organized table structure that includes rows and columns. The values in each row are separated by commas. The table structure makes .csv files easy to understand, edit, manipulate, and use for data analysis.
A major advantage of .csv files is their widespread compatibility. They can be imported and exported by a vast range of data analysis tools and software programs.
To use .csv files and upload them to data analysis programs you will first need to download them to your local device. Downloading a .csv file from a website can vary depending on your operating system or internet browser. Here are some ways you can download a .csv file:
Click the download link or .csv attachment: Locate the link for the .csv file or attachment on the website. Click on it, and the download process will start.
Right-click and Save: Right-click on the data table or element containing the .csv data. Choose Save as… or a similar option. Name the file and make sure the extension on the file is “.csv”.
Force download: You can use the Alt key on your keyboard while clicking the link. This will trigger the download, and you will be able to find the .csv file in your Downloads folder.
Note: When using the Chrome browser or ChromeOS, .csv files may open in a new tab instead of downloading to your machine. If this happens, follow these instructions:
Select File from the menu bar, then select Save as Google Sheets. This will open the .csv file as a Google Sheet.
Select File from the menu bar, then select Download from the dropdown menu, then select Comma Separated Values (.csv).
You will often need to upload .csv files during the data analysis process. Here is how you do this:
Locate the upload option: Each data analysis platform will have a designated button, menu option, or drag and drop area labeled Upload or Import. This is where you will upload your .csv file.
Choose your .csv file: Click Upload or Import on the platform you are using to open your file explorer. Select your .csv file. If you just downloaded a .csv file from the web, it will be located in your computer’s Downloads folder.
Initiate the upload: Once you've selected your .csv file, click Upload or Import.The platform may display a progress bar or message indicating that the upload is complete.
Note: Some platforms have restrictions on the file size or format of .csv files. Make sure your .csv files adhere to these requirements before uploading.
Data analysis programs help us extract insights and knowledge from data. Using .csv files is essential in data analysis. Understanding how to easily download data from the web or add your data to these programs will allow you to complete data cleaning, visualizations, analysis, and so much more!
This reading outlines the steps the instructor performs in the following video, Importing data from spreadsheets and databases. The video teaches you how to import a .csv file into a Google Sheet so you can analyze the data.
Keep this guide open as you watch the video. It can serve as a reference if you need additional context or clarification while following the video steps. This is not a graded activity, but you can complete these steps to practice the skills demonstrated in the video.
To follow along with the examples in this video, open a blank spreadsheet. Additionally, download the following .csv file:
Example Spreadsheet - Entertainment Expenses - Sheet1
Sometimes, you’ll need to import the data from a .csv file into a Google Sheet. A .csv file saves data in a table format. Follow the steps below to bring the data from a .csv file into a new spreadsheet.
Note: The .csv file you import will differ from the .csv file the instructor imports in the video.
In the menu, select File then Import. The Import file window will pop up.
Select Upload then Browse to select the .csv file to import. For this example, select the .csv file you downloaded in the previous section
Next select the Import location. You can:
Create a new spreadsheet
Insert the .csv data as a new sheet
Replace spreadsheet
Replace current sheet
Append (add) the data to the current spreadsheet
Replace the data starting with a specific cell.
Next, select the Separator type. Google Sheets defaults to automatically detecting separator, or delimiter, type. To manually set the delimiter type, select the dropdown menu under Separator type and choose the separator.
Next, determine if you would like the text to be imported with or without formatting. In the box next to Convert text to numbers, dates, and formulas, keep the checkmark if you want text data to be formatted.
Select Import data. The data in the .csv file will be loaded into your sheet, and you can begin using it.
You can download public data from the internet, such as data from the World Health Organization and the Global Health Observatory.
Note: The Global Health Observatory’s website has been updated since this video was filmed. Follow these instructions to download the .csv file the instructor uses.
Navigate to the Global Health Observatory workforce statistics database.
Scroll to navigate to the Medical doctors table.
Then, scroll over EXPORT DATA in .csv format in the table. Right-click, then select Save link as...
The .csv file will download to your computer as data.csv.
Note: If you already have a .csv file named data.csv, your computer will add a number to the file name.
Medical doctors (number), General medical practitioners (number), and Specialist medical practitioners (number). The EXPORT DATA in CSV format is also in the screenshot
Follow the steps below to bring the .csv file you downloaded, data.csv, into a new spreadsheet.
Open a blank spreadsheet.
In the menu, select File, then Import. The Import file window will pop up.
Select Upload then Browse to select the .csv file to import.
Next, select the Import location. You can:
Create a new spreadsheet
Insert the .csv data as a new sheet
Replace spreadsheet
Replace current sheet
Append (add) the data to the current spreadsheet
Replace the data starting with a specific cell.
Next, select the Separator type.
Next, determine if you would like the text to be imported with or without formatting. In the box next to Convert text to numbers, dates, and formulas, keep the checkmark if you want text data to be formatted.
Select Import data.
The data in the .csv file will be loaded into your sheet, and you can review and title it.
As you’ve learned, you can import data from some data sources, like .csv files into a Google spreadsheet from the File menu. Keep in mind that, when you use this method, data that is updated in the .csv will not automatically be updated in the Google Sheet. Instead, it will need to be manually—and continually—updated in the Google Sheet. In some situations, such as when you want to be able to keep track of changes you’ve made, this method is ideal. In other situations, you might need to keep the data the same in both places, and using data that doesn’t update automatically can be time-consuming and tedious. Further, trying to maintain the same dataset in multiple places can cause errors later on.
Fortunately, there are tools to help you automate data imports so you don’t need to continually update the data in your current spreadsheet. Take a small general store as an example. The store has three cash registers handled by three clerks. At the end of each day, the owner wants to determine the total sales and the amount of cash in each register. Each clerk is responsible for counting their money and entering their sales total into a spreadsheet. The owner has the spreadsheets set up to import each clerks’ data into another spreadsheet, where it automates and calculates the total sales for all three registers. Without this automation, each clerk would have to take turns entering their data into the owner’s spreadsheet. This is an example of a dynamic method of importing data, which saves the owner and clerks time and energy. When data is dynamic, it is interactive and automatically changes and updates over time.
In the following sections you’ll learn how to import data into Google Sheets dynamically.
In Google Sheets, the IMPORTRANGE function can import all or part of a dataset from another Google Sheet.
To use this function, you need two pieces of information:
The URL of the Google Sheet from which you’ll import data.
The name of the sheet and the range of cells you want to import into your Google Sheet.
Once you have this information, open the Google Sheet into which you want to import data and select the cell into which the first cell of data should be copied. Enter = to indicate you will enter a function, then complete the IMPORTRANGE function with the URL and range you identified in the following manner: =IMPORTRANGE("URL", "sheet_name!cell_range"). Note that an exclamation point separates the sheet name and the cell range in the second part of this function.
An example of this function is:
=IMPORTRANGE("https://docs.google.com/thisisatestabc123", "sheet1!A1:F13")
Note: This URL is for syntax purposes only. It is not meant to be entered into your own spreadsheet.
Once you’ve completed the function, a box will pop up to prompt you to allow access to the Google Sheet from which you’re importing data. You must allow access to the spreadsheet containing the data the first time you import it into Google Sheets. Replace it with a spreadsheet’s URL that you have created so you can control access by selecting the Allow access button.
Refer to the Google Help Center's IMPORTRANGE page for more information about the syntax. You’ll also learn more about this later in the program.
Importing HTML tables is a basic method to extract data from public web pages. This process is often called “scraping.” Web scraping made easy introduces how to do this with Google Sheets or Microsoft Excel.
In Google Sheets, you can use the IMPORTHTML function to import the data from an HTML table (or list) on a web page. This function is similar to the IMPORTRANGE function. Refer to the Google Help Center's IMPORTHTML page for more information about the syntax.
Sometimes data displayed on the web is in the form of a comma- or tab-delimited file.
You can use the IMPORTDATA function in a Google Sheet to import data into a Google Sheet. This function is similar to the IMPORTRANGE function. Refer to Google Help Center's IMPORTDATA page for more information and the syntax.
Open data helps create a lot of public datasets that you can access to make data-driven decisions. Here are some resources you can use to start searching for public datasets on your own:
The Google Cloud Public Datasets allow data analysts access to high-demand public datasets, and make it easy to uncover insights in the cloud.
The Dataset Search can help you find available datasets online with keyword searches.
Kaggle has an Open Data search function that can help you find datasets to practice with.
Finally, BigQuery hosts 150+ public datasets you can access and use.
Global Health Observatory data: You can search for datasets from this page or explore featured data collections from the World Health Organization.
The Cancer Imaging Archive (TCIA) dataset: Just like the earlier dataset, this data is hosted by the Google Cloud Public Datasets and can be uploaded to BigQuery.
1000 Genomes: This is another dataset from the Google Cloud Public resources that can be uploaded to BigQuery.
National Climatic Data Center: The NCDC Quick Links page has a selection of datasets you can explore.
NOAA Public Dataset Gallery: The NOAA Public Dataset Gallery contains a searchable collection of public datasets.
UNICEF State of the World’s Children: This dataset from UNICEF includes a collection of tables that can be downloaded.
CPS Labor Force Statistics: This page contains links to several available datasets that you can explore.
The Stanford Open Policing Project: This dataset can be downloaded as a .csv file for your own use.
Question 1
What are some key benefits of open-data initiatives? Select all that apply.
Status: [object Object]
1 point
Help educate citizens about important issues
Limit opportunities for collaboration
Make government activities more transparent
Support innovation and economic growth
Question 2
What type of file saves data in a table format?
Status: [object Object]
1 point
Compatible scientific variables (.csv)
Comma-separated values (.csv)
Calculated spreadsheet values (.csv)
Cell-structured variables (.csv)
Question 3
Bringing data from a .csv file into a spreadsheet is an example of what process?
Status: [object Object]
1 point
Filing data
Editing data
Normalizing data
Importing data
Question 4
In Google Sheets, what function enables a data analyst to specify a range of cells in one spreadsheet to be duplicated in another?
Status: [object Object]
1 point
SPECIFY
CELLRANGE
DUPLICATE
IMPORTRANGE
By now, you have been introduced to cleaning data in spreadsheets, as well as core spreadsheet skills such as sorting and filtering. In this activity, you will use sorting and filtering to clean up a dirty dataset.
Data cleaning corrects or removes incorrect, missing, and faulty data. Cleaning data is of critical importance because an analysis based on dirty data can lead to wrong conclusions and bad decisions. The cleaner your data, the better your results.
Review the following scenario. Then complete the step-by-step instructions.
For this activity, imagine you are a data analyst working for the superintendent of a large public school district in Portugal. The superintendent wants to know what factors affect student grades in core subjects and what changes can be made to improve student performance. Your team is going to analyze performance data on high school student achievement in two Portuguese public schools, Gabriel Pereira (GP) and Mouzinho da Silveira (MS). The data was collected by the school district by means of academic reports and student surveys. The data includes information such as:
Student grades
Student background information
Student study time
Student participation in extracurricular activities
However, before analyzing the data, it’s important to make sure the data is clean. Analyzing bad or dirty data could cause the school district to reach the wrong conclusions and implement ineffective changes. Your assignment is to help clean the data.
By the time you complete this activity, you will be able to sort data in different ways, apply filters to remove incorrect data and fill in missing data, and convert text data to number format. Cleaning data is a critical phase of the data analysis process. Sorting and filtering are useful techniques for data cleaning, and are also key skills that you will draw on throughout your career as a data analyst.
Follow the instructions to complete each step of the activity. Then answer the questions at the end of the activity before going to the next course item.
To get started, access the spreadsheet that contains the data. Click the link and make a copy of the spreadsheet.
Or, if you don’t have a Google account, you may download the dataset directly from the attachment below:
It’s important to make sure your data is clean so that your eventual analysis will be correct. The first thing to do is check the values in the columns most relevant to your analysis and find out if there is anything for you to clean. In this example, the superintendent’s main objective is to determine what factors drive student performance. To begin answering this question, the columns you want to focus on first are school, age, reason, Medu, Fedu. You can use sorting and filtering to clean the data in each of these columns.
Because you have data from two schools, Gabriel Pereira (GP) and Mouzinho da Silveira (MS), you can start by sorting the data by school. Then, you can also sort by age to discover the age ranges of the students for each school. Sorting involves arranging data into a meaningful order to make it easier to understand, analyze, and visualize.
To start, rename your spreadsheet. In the upper left corner, click Untitled Spreadsheet and enter a new name. You can use the name student_performance_data or a similar name that describes the data your spreadsheet contains.
Now, sort by school. Because you want to sort on multiple columns, you need to select all the data in your spreadsheet. Click the blank rectangle above row 1 and to the left of column A. This lets you select all the data on your sheet.
3. Next, from the menu bar, select Data, then Sort range. (Note: For some versions of Google Sheets, the selection Advanced range sorting options may appear on the Data drop-down menu instead of Sort range).
4. In the pop-up window, select Data has header row. Now you can choose specific column headers to sort by.
5. In the Sort by dropdown, choose the header school. Then, click A → Z to sort in ascending order.
6. You also want to sort for age. Before you can sort by age, you need to click Add another sort column to choose a second column header.
7. In the Sort by dropdown, choose the header age. This time, click Z → A to sort in descending order. This way, the oldest students will be listed first.
Your popup window should appear like this:
8. Once both selections have been made, click Sort.
Now, If you scroll through the data, you’ll notice that the age range of the students at Gabriel Pereira (GP) is 15-22 years, and the age range of the students at Mouzinho da Silveira (MS) is 15-20 years. It appears that both schools have similar age ranges, but the GP school has students that are a little older.
By sorting the data, you’ve discovered a potential problem with the data. Because this dataset represents high school student achievement, any age older than 18 may indicate that a mistake was made when entering that student's age. You now know what age data may need to be researched and corrected. Your next step is to ask the superintendent about the legitimate age range for students in public high school. Then, you’ll know what age data is incorrect and should be removed.
The superintendent tells you that the maximum age limit for which public education is provided is 19 years old and that the age range should be 15-19 for both schools. Any student outside this age range should be deleted from the dataset.
To clean your data, you need to remove the ages 20, 21, and 22 from your dataset. You can start by applying a filter to the age column. Filtering is the process of showing only the data that meets a specified criteria while hiding the rest. Filtering makes it easier to find data that you need.
First, apply a filter to the age column. Select the age column by clicking the letter at the top of the column (C).
Then, from the menu bar, select Data, then Create a filter.
You can now inspect the values in the age column by going to the top of the column and clicking the Filter icon ().
In Google Sheets, there are nine possible values for the field (15, 16, 17, 18, 19, 20, 21, and 22). You may notice that all the values have check marks. Filter this column for the values you want to select by unchecking all the other values (15, 16, 17, 18, and 19).
5. Then, click OK. This will single out the rows that contain the ages 20, 21, and 22. After you apply the filter, there should be nine such rows (seven for the GP school and two for the MS school).
6. To delete the nine rows, first select them by clicking their row numbers.
7. Then, from the menu bar, select Edit and Delete selected rows.
8. Click the Filter icon at the top of the age column to inspect the values once again. Now that you’ve removed the three incorrect ages (20, 21, and 22), there are five ages remaining (15, 16, 17, 18, and 19). The remaining ages are legitimate and can be used for analysis.
9. Finally, turn off the filter. From the menu bar, select Data and Turn off filter.
Filling in missing data is an important part of data cleaning. It’s your job to fill in these blank spaces in your data with accurate values.
The superintendent wants to know what factors influence student performance, and a student’s reason for choosing a specific school will be important to know for analysis. The reason column shows the main reason a student chose to enroll in a specific school, according to their survey response: for example, because of the school’s reputation, or because it offers certain courses, etc. So, you need to make sure the reason column is complete and without blanks.
Start by applying a filter across the entire spreadsheet. Click on any cell in the sheet. Then, from the menu bar, select Data and Create a filter.
All the cells are now highlighted, and there are filters at the top of every column containing data. Click the Filter icon on the reason column (K).
You may notice that the data values in the reason column include blanks. Filter this column for blanks by unchecking all the other values (course, home, reputation).
4. Then, click OK. Now, your sheet shows all the blank rows in the reason column.
5. To clean your data, you need to find a good way to fill in these missing values. In this case, you cannot know what each missing value should be (that is, without a new survey, you can’t discover each student’s reason for choosing a specific school). So, you can replace the missing values with the value none_given. To do this while the column is still filtered for blanks, enter none_given in the first empty cell (K38). Then, press Enter.
6. Select cell K38 again. A small blue square, known as the fill handle, appears in the bottom-right corner of the cell. Double click the fill handle to fill all the other blank cells with the value none_given.
7. Finally, turn off the filter. From the menu bar, select Data and Turn off filter. If you scroll down the reason column, you should find that the value none_given has replaced all the blanks in the reason column.
During the data analysis process, it's sometimes necessary to change text data (words) to numeric data (numbers). For example, some statistical packages like those used to perform machine learning will only accept numeric data values as input.
In this case, the superintendent wants to know if a parent’s education level is a significant factor in student performance. The relevant data is in the Medu and Fedu columns--which, respectively, refer to the level of education of a student’s mother and father. Currently, the data is in text format. For the purposes of analysis, it will be useful to know the average education level of each student’s parents. To make this calculation, you first need to convert the data in the Medu and Fedu columns to number format.
To do this, you can match specific number values to the text data in each column. Start with the Medu column. If you click on the Filter icon at the top of the Medu column (G), you’ll notice the column contains the text data shown in the table below. You can use the following numeric codes for each piece of text data:
5. Then, click Replace all.
6. While still in the popup window, repeat this process (steps 4-5) for the other four educational levels: primary education (4th grade), 5th to 9th grade, secondary education, and higher education.
7. After replacing all five educational levels with numeric values, click Done to close the pop-up window.
8. Check out your spreadsheet. All the cells in the Medu column now display numeric values.
9. Change the text data in the Fedu column (H) in the same way.
Be sure to save a copy of the spreadsheet template you used to complete this activity. You can use it for further practice or to help you work through your thought processes for similar tasks in a future data analyst role.
Now that you’ve sorted and filtered data in a spreadsheet and been introduced to databases, pause for a moment to think about what you’re learning. In this self-reflection, you will consider the knowledge you’ve gained about spreadsheets and databases and reflect on their similarities and differences. Then, you’ll respond to several questions.
This self-reflection will help you develop insights into your own learning and prepare you to apply your knowledge of data preparation to situations you might encounter as a data analyst. As you answer questions—and come up with questions of your own—consider the concepts, practices, and principles to help refine your understanding and reinforce your learning. You’ve done the work, so make sure to get the most out of it!
Follow the instructions to complete each step of the activity. Then answer the question at the end of the activity before going to the next course item.
Spreadsheets and databases have many things in common. For example, both can be used to store and organize data. However, databases can be relational, whereas spreadsheets cannot. This means that spreadsheets are better-suited for self-contained data, where the data exists in one place.
Here are some questions to consider about spreadsheet and databases:
How is data stored?
How are each of them used to interact with data?
How powerful are they relative to one another?
What are the pros and cons of each when sorting data?
What are the pros and cons of each when filtering data?
As you consider each of these questions or come up with your own questions, compile them into a simple table to compare and contrast spreadsheets and databases. Use a spreadsheet to create your table. Start with a column titled Question. Then, add each question to a row in the table, modifying it as needed to allow you to compare and contrast spreadsheets and databases in a table. Then, add two more columns titled Spreadsheets and Databases.
As you’ve been learning, BigQuery is a database you can use to access, explore, and analyze data from many sources. Now, you’ll begin using BigQuery, which will help you gain SQL knowledge by typing out commands and troubleshooting errors. This reading will guide you through the process of setting up your very own BigQuery account.
Note: Working with BigQuery is not a requirement of this program. Additional resources for other SQL database platforms are also provided at the end of this reading if you choose to use them instead.
BigQuery offers a variety of account tiers to cater to various user needs and has two free-of-charge entry points, a sandbox account and a free-of-charge trial account. These options allow you to explore the program before selecting the best choice to suit your needs. A sandbox account allows you to practice writing queries and to explore public datasets free of charge, but it has quotas and limits, as well as some additional restrictions. If you prefer to use BigQuery with the standard limits, you can set up a free-of-charge trial account instead. The free-of-charge trial is a trial period prior to paying for a subscription. In this instance, there is no automatic charge, but you will be asked for payment information when you create the account.
This reading provides instructions for setting up either account type. An effective first step is to begin with a sandbox account and switch to a free-of-charge trial account when needed to run the SQL presented upcoming courses.
The sandbox account is available at no cost, and anyone with a Google account can use it. However, it does have some limitations. For instance, you are limited to a maximum of 12 projects at a time. This means that, to create a 13th project, you'll need to delete one of your existing 12 projects. Additionally, the sandbox account doesn't support all operations you’ll do in this program. For example, there are limits on the amount of data you can process and you can’t insert new records into a database or update the values of existing records. However, a sandbox account is perfect for most program activities, including all of the activities in this course. Additionally, you can convert your sandbox account into a free-of-charge trial account at any time.
Set up your sandbox account
To set up a sandbox account:
Visit the BigQuery sandbox documentation page.
Log in to your preferred Google account by selecting the profile icon in the BigQuery menu bar.
Select the Go to BigQuery button on the documentation page.
You'll be prompted to select your country and read the terms of service agreement.
This will bring you to the SQL Workspace, where you'll be conducting upcoming activities. By default, BigQuery creates a project for you.
After you set up your account, the name of the project will be in the banner in your BigQuery console.
If you wish to explore more of BigQuery's capabilities with fewer limitations, consider the Google Cloud Free Trial. It provides you with $300 in credit for Google Cloud usage during the first 90 days. If you're primarily using BigQuery for SQL queries, you're unlikely to come close to this spending limit. After you've used up the $300 credit or after 90 days, your free trial will expire, and you will only be able to use this account if you pay to do so. Google won't automatically charge your payment method when the trial ends. However, you'll need to set up a payment option with Google Cloud. This means that you’ll need to enter your financial information. Rest assured, it won't charge you unless you consciously opt to upgrade to a paid account. If you're uncomfortable providing payment information, don't worry; you can use the BigQuery sandbox account instead.
Set up your free-of-charge trial
Go to the BigQuery page.
Select Try BigQuery free.
Log in using your Google email, or create an account free of charge if you don't have one. Click here to create an account.
Select your country, a description of your organization or needs, and the checkbox to accept the terms of service, Then select CONTINUE.
Enter your billing information and select START MY FREE TRIAL.
After you set up your account, your first project, titled My First Project will be in the banner.
With either a sandbox or free-of-charge trial account, you have the flexibility to upgrade to a paid account at any time. If you upgrade, all your existing projects will be retained and transferred to your new account. If you started with a free-of-charge trial, but choose not to upgrade when it ends, you can switch to a sandbox account. However, note that projects from your trial won't transfer to your sandbox. Essentially, creating a sandbox is like starting from scratch.
It’s easiest to follow along with the course activities if you use BigQuery, but you may use other SQL platforms, if you prefer. If you decide to practice SQL queries on other database platforms, here are some resources to get started:
BigQuery offers multiple account options. Keep the following in mind when you choose an account type:
Account tiers: BigQuery provides various account tiers to cater to a wide range of user requirements. Whether you're starting with a sandbox account or exploring a paid account with the free-of-charge trial option, BigQuery offers flexibility to choose the option that aligns best with your needs and budget.
Sandbox limitations: While a sandbox account is a great starting point, it comes with some limitations, such as a cap on the number of projects and restrictions on data manipulation operations like inserting or updating records, which you will encounter later in this program. Be aware of these limitations if you choose to work through this course using a sandbox account.
Easy setup and upgrades: Getting started with any BigQuery account type is quick and easy. And if your needs evolve, you have the flexibility to modify your account status at any time. Additionally, projects can be retained even when transitioning between account types.
Choose the right BigQuery account type to match your specific needs and adapt as your requirements change!
BigQuery is a data warehouse on the Google Cloud Platform used to query and filter large datasets, aggregate results, and perform complex operations. Throughout this program, you’re going to use BigQuery to practice your SQL skills and collect, prepare, and analyze data. At this point, you have set up your own account. Now, explore some of the important elements of the SQL workspace. This will prepare you for the upcoming activities in which you will use BigQuery. Note that BigQuery updates its interface frequently, so your console might be slightly different from what is described in this reading. That’s okay; use your troubleshooting skills to find what you need!
When you log in to BigQuery using the landing page, you will automatically open your project space. This is a high-level overview of your project, including the project information and the current resources being used. From here, you can check your recent activity.
Navigate to your project’s BigQuery Studio by selecting BigQuery from the navigation menu and BigQuery Studio from the dropdown menu.
Once you have navigated to BigQuery from the project space, most of the major components of the BigQuery console will be present: the Navigation pane, the Explorer pane, and the SQL Workspace.
On the console page, find the Navigation pane. This is how you navigate from the project space to the BigQuery tool. This menu also contains a list of other Google Cloud Project (GCP) data tools. During this program, you will focus on BigQuery, but it’s useful to understand that the GCP has a collection of connected tools data professionals use every day.
The Explorer pane lists your current projects and any starred projects you have added to your console. It’s also where you’ll find the + ADD button, which you can use to add datasets.
This button opens the Add dialog that allows you to open or import a variety of datasets.
Starring bigquery-public-data will enable you to search for and add public datasets by scrolling in the Explorer pane or by searching for them in the Explorer search bar.
For example, you might want to select a different public dataset. If you select the second dataset, "austin_311," it will expand to list the table stored in it, “311_service_requests.”
When you select a table, its information is displayed in the SQL Workspace. Select the 311_service_requests table to examine several tabs that describe it, including:
Schema, which displays the column names in the dataset
Details, which contains additional metadata, such as the creation date of the dataset
Preview, which shows the first rows from the dataset
Additionally, you can select the Query button from the menu bar in the SQL Workspace to query this table.
The final menu pane in your console is the SQL Workspace. This is where you will actually write and execute queries in BigQuery.
The SQL Workspace also gives you access to your personal and project history, which stores a record of the queries you’ve run. This can be useful if you want to return to a query to run it again or use part of it in another query.
In addition to offering access to public datasets, BigQuery also gives you the ability to upload your own data directly into your workspace. Access this feature by opening the + ADD menu again or by clicking the three vertical dots next to your project’s name in the Explorer pane. This will give you the option to create your own dataset and upload your own tables. You will have the opportunity to upload your own data in an upcoming activity to practice using this feature!
BigQuery's SQL workspace allows you to search for public datasets, run SQL queries, and even upload your own data for analysis. Whether you're working with public datasets, running SQL queries, or uploading your own data, BigQuery’s SQL workspace offers a range of features to support all kinds of data analysis tasks. Throughout this program, you will be using BigQuery to practice your SQL skills, so being familiar with the major components of your BigQuery console will help you navigate it effectively in the future!
This reading provides you with the steps the instructor performs in the following video, BigQuery in action. The video focuses on how to create a query to view a small section of data from a large dataset.
Keep this guide open as you watch the video. It can serve as a helpful reference if you need additional context or clarification while following the video steps. This is not a graded activity, but you can complete these steps to practice the skills demonstrated in the video.
To follow along with the examples in this video, log in to your BigQuery account and follow the instructions to star bigquery-public-data in The Explorer pane section of the previous reading, Get Started with BigQuery.
A database is a collection of data stored in a computer system. Query languages such as SQL enable communication between databases and data analysts. You discovered earlier that a relational database is made up of several tables that may be joined together to create relationships. Primary and foreign keys serve as representations of these relationships. To extract data from these tables, data analysts use queries. To learn more about that, explore BigQuery in action:
Log in to BigQuery and go to your console. You should find the Welcome to your SQL Workspace! landing page open. Select COMPOSE A NEW QUERY In the Bigquery console. Make sure that no tabs are open so that the entire workspace is displayed, including the Explorer pane.
Enter sunroof in the search bar. In the search results, expand sunroof_solar and then select the solar_potential_by_postal_code dataset.
Observe the Schema tab of the Explorer pane to explore the table fields.
Select the Preview tab to view the regions, states, yearly sunlight, and more.
In order to view the entire dataset, you will need to write a query.
The first step is finding out the complete, correct path to the table you want to work with. Select the ellipses (three vertical dots) by the dataset solar_potential_by_postal_code, then select Query. A new tab will populate on your screen. Select the tab. The path to the table should be written inside two backticks.
Select the full path by highlighting the text including the backticks and copy it. (Note: You can also get the full path to the project, database, and table directly by clicking the ellipses next to the table's name in the Explorer panel on the left and selecting Copy ID.)
Now, click on the plus sign to create a new query. Notice that BigQuery doesn’t automatically generate a SELECT statement in this window. Enter SELECT and add a space after it.
Put an asterisk * after SELECT to indicate you want to return the entire dataset. The asterisk lets the database know to include all columns. Without this shortcut, you would have to manually enter every column name!
Next, press the Enter/Return key and enter FROM on the second line. FROM indicates where the data is coming from. After FROM, add another space.
Paste in the path to the table that you copied earlier. It will read `bigquery-public-data.sunroof_solar.solar_potential_by_postal_code`
Execute the query by selecting the RUN button.
Important!
Many of the public databases on BigQuery are living records and, as such, are periodically updated with new data. Throughout this course (and others in this certificate program), if your results differ from those you encounter in videos or screenshots, there's a good chance it is due to a data refresh. You can verify when a table has been refreshed by selecting it from the Explorer panel and clicking Details. You'll find the date the table was created, when it was last modified, as well as other useful information.
If the project doesn’t require every field to be completed, you can use SQL to see a particular piece, or pieces, of data. To do this, specify a certain column name in the query.
For example, you might only need data from Pennsylvania. You’d begin your query the same way you just did in the previous examples: Click on the plus sign, enter SELECT, add a space, an asterisk (*), and then press Enter/Return.
Enter FROM and then paste `bigquery-public-data.sunroof_solar.solar_potential_by_postal_code`. Press Enter/Return.
This time, add WHERE. It will be on the same line as the FROM statement. Add a space and enter state_name with a space before state and a space after name. state_name is a column name in the table.
Because you only want data from Pennsylvania, add = and 'Pennsylvania' on the same line as state_name. In SQL, single quotes represent the beginning and ending of a string.
Execute the query with the RUN button.
Review the data on solar potential for Pennsylvania. Scroll through the query results.
Keep in mind that SQL queries can be written in a lot of different ways and still return the same results. You might discover other ways to write these queries!
You have recently been introduced to BigQuery, a data warehouse on Google Cloud that data analysts use to query, filter large datasets, aggregate results, and perform complex operations. In this activity, you will explore the BigQuery interface; upload public data to your console; and write some simple SQL queries using SELECT, FROM, and WHERE.
By the time you complete this activity, you will be more familiar with writing queries in the BigQuery interface. This will enable you to practice SQL, which is important for working with databases in your career as a data analyst.
Follow the instructions to complete each step of the activity. Then answer the questions at the end of the activity before going to the next course item.
For this activity, you will need a BigQuery account. If you haven’t made one already, follow the instructions from the Using BigQuery reading.
Once you have your account, start exploring!
1. Log in to BigQuery.
2. Select the Go to console button on the BigQuery homepage. This will open a new tab with your console.
3. Take a moment to explore your console. The Explorer menu includes a search bar you can use to find resources, pinned projects, and the + ADD button for adding data. The Editor welcome page is where you will navigate to a query editor, try sample data, add local data files, add Google cloud storage, or add other external connections. You can also find your job history, query history, and saved queries here.
Step 3: Access public data in BigQuery
In order to start writing queries, you will need some data to work with. Once you’re familiar with the BigQuery interface, you can access a public dataset directly from your console.
1. Select the search bar in the Explorer pane.
2. Enter “london bicycle” in the search box and press enter; this will return the london_bicycles database from the Greater London Authority. Select the database for more details. If you cannot find it, make sure you're searching in all projects. The london_bicycles database is in the bigquery-public-data project.
3. Select the arrow to the left of the london_bicycles database name. This expands the dataset to reveal two table names: cycle_hire and cycle_stations. Select the cycle_hire table name within the Explorer pane.
This will pull the cycle_hire schema into the console. Take a moment to explore the field names and the associated information.
4. Now, select the PREVIEW tab to find a sample of the data that you’ll be working with.
Once you have finished previewing the data, write a query!
So far, you’ve learned three basic parts of a query: SELECT, FROM, and WHERE. As a refresher:
SELECT is the section of a query that indicates what data you want SQL to return to you.
FROM is the section of a query that indicates which table the desired data comes from. You must provide a full path to the table. The path includes the project name, database name, and table name, each separated by a period.
WHERE is the section of a query that indicates any filters you’d like to apply to your table.
Now, construct a simple command using the basic parts of a query you have already learned! For example, you can select a specific column from the cycle_hire table, such as the end_station_name column.
1. Select the Blue + button or QUERY - In new tab to start a new query.
2. Start your query with a SELECT clause, and indicate which column you want to select from the table; in this case, you’ll input end_station_name.
3. After you have indicated which column you are selecting, write your FROM clause. Specify the table you are querying from by inputting the following location: `bigquery-public-data.london_bicycles.cycle_hire`;
The completed query should appear like this:
4. Run your completed query by selecting the blue RUN button.
This query may take a few seconds to execute. Once it has finished, you will find the list of station names you requested under the Query Results console pane.
After running the first basic query, try answering a specific question about the data. For example, how many bike trips lasted for 20 minutes or longer?
1. Select the Blue + button or QUERY - In new tab to start a new query. Start with your SELECT statement again. This time, include the two columns duration and start_station_name in the query. The data in these columns will tell where the trip started and the duration of the trip. Be sure to separate each column name with a comma.
2. Next, add your FROM statement. You will be using the same table as the previous query: FROM `bigquery-public-data.london_bicycles.cycle_hire`;. Note: The backticks around the table in this line of code are optional.
3. Finally, add a WHERE statement to specify that you want to filter for only bike rides 20 minutes or longer. If you check the preview of this data, you might notice that the duration is recorded in seconds, so you’ll specify 1200 seconds in your query. Write that as WHERE duration >= 1200;
Your completed query should be written like this:
This query may take a few seconds to execute. Once it has finished, you will find a list of rides from this table that fit your criteria. There are millions of rows with bike trips that are 20 minutes or longer!
If you’re comfortable using queries to answer questions, try creating and running queries to complete the tasks below:
What is the name of the station whose start_station_id is 111?
Return all the rental_ids, station IDs, and station names that bike_id 1710 started from.
What is the bike_model of bike_id 58782?
Use the solutions doc to check your work: Intro to BigQuery Solutions
Or download the file directly here:
Save this reading for future reference. Feel free to download a .pdf version of this reading below:
DAC3-In-depth-guide_-SQL-best-practices.pdf
These best practices include guidelines for entering SQL queries, developing documentation, and examples that demonstrate these practices. This is a great resource to have handy when you are using SQL yourself; you can just go straight to the relevant section to review these practices. Think of it like a SQL field guide!
With SQL, capitalization usually doesn’t matter. You could enter SELECT or select or SeLeCT. They all work! But if you use capitalization as part of a consistent style your queries will look more professional.
To enter SQL queries like a pro, it is always a good idea to use all caps for clause starters (e.g. SELECT, FROM, WHERE, etc.). Functions should also be in all caps (e.g. SUM()). Column names should be all lowercase (refer to the section on snake_case later in this guide). Table names should be in CamelCase (refer to the section on CamelCase later in this guide). This helps keep your queries consistent and easier to read while not impacting the data that will be pulled when you run them. The only time that capitalization does matter is when it is inside quotes (more on quotes below).
Vendors of SQL databases may use slightly different variations of SQL. These variations are called SQL dialects. Some SQL dialects are case sensitive. BigQuery is one of them. Vertica is another. But most, like MySQL, PostgreSQL, and SQL Server, aren’t case sensitive. This means if you searched for country_code = ‘us’, it will return all entries that have 'us', 'uS', 'Us', and 'US'. This isn’t the case with BigQuery. BigQuery is case sensitive, so that same search would only return entries where the country_code is exactly 'us'. If the country_code is 'US', BigQuery wouldn’t return those entries as part of your result.
For the most part, it also doesn’t matter if you use single quotes ' ' or double quotes " " when referring to strings. For example, SELECT is a clause starter. If you put SELECT in quotes like 'SELECT' or "SELECT", then SQL will treat it as a text string. Your query will return an error because your query needs a SELECT clause.
But there are two situations where it does matter what kind of quotes you use:
When you want strings to be identifiable in any SQL dialect
When your string contains an apostrophe or quotation marks
Within each SQL dialect there are rules for what is accepted and what isn’t. But a general rule across almost all SQL dialects is to use single quotes for strings. This helps get rid of a lot of confusion. So if we want to reference the country US in a WHERE clause (e.g. country_code = 'US'), then use single quotes around the string 'US'.
The second situation is when your string has quotes inside it. Suppose you have a column favorite_food in a table called FavoriteFoods and the other column corresponds to each friend.
SQL understands text strings as either starting with a single quote ' or double quote ". Since this string starts with double quotes, SQL will expect another double quote to signal the end of the string. This keeps the apostrophe safe, so it will return "Shepherd's pie" and not 'Shepherd'.
As you get more comfortable with SQL, you will be able to read and understand queries at a glance. But it never hurts to have comments in the query to remind yourself of what you are trying to do. And if you share your query, it also helps others understand it.
For example:
The first two columns are named f0 and f1 because they weren’t named in the above query. SQL defaults to f0, f1, f2, f3, and so on. We named the last two columns total_tickets and number_of_purchases so these column names show up in the query results. This is why it is always good to give your columns useful names, especially when using functions. After running your query, you want to be able to quickly understand your results, like the last two columns we described in the example.
On top of that, you might notice how the column names have an underscore between the words. Names should never have spaces in them. If total_tickets had a space and looked like total tickets then SQL would throw a syntax error because it wouldn't know what to do with the second word (tickets). So, spaces are bad in SQL names. Never use spaces.
The best practice is to use snake_case. This means that 'total tickets', which has a space between the two words, should be entered as total_tickets with an underscore instead of a space.
You can also use CamelCase capitalization when naming your table. CamelCase capitalization means that you capitalize the start of each word, like a two-humped (Bactrian) camel. So the table TicketsByOccasion uses CamelCase capitalization. Please note that the capitalization of the first word in CamelCase is optional; camelCase is also used. Some people differentiate between the two styles by calling CamelCase, PascalCase, and reserving camelCase for when the first word isn't capitalized, like a one-humped (Dromedary) camel; for example, ticketsByOccasion.
At the end of the day, CamelCase is a style choice. There are other ways you can name your tables, including:
All lower or upper case, like ticketsbyoccasion or TICKETSBYOCCASION
With snake_case, like tickets_by_occasion
Keep in mind, the option with all lowercase or uppercase letters can make it difficult to read your table name, so it isn’t recommended for professional use.
The second option, snake_case, is technically okay. With words separated by underscores, your table name is easy to read, but it can get very long because you are adding the underscores. It also takes more time to enter. If you use this table a lot, it can become a chore.
In summary, it is up to you to use snake_case or CamelCase when creating table names. Just make sure your table name is easy to read and consistent. Also be sure to find out if your company has a preferred way of naming their tables. If they do, always go with their naming convention for consistency.
As a general rule, you want to keep the length of each line in a query <= 100 characters. This makes your queries easy to read. For example, check out this query with a line with >100 characters:
Now it is much easier to understand what you are trying to do in the SELECT clause. Sure, both queries will run without a problem because indentation doesn’t matter in SQL. But proper indentation is still important to keep lines short. And it will be valued by anyone reading your query, including yourself!
If you make comments that take up multiple lines, you can use -- for each line. Or, if you have more than two lines of comments, it might be cleaner and easier is to use /* to start the comment and */ to close the comment. For example, you can use the -- method like below:
In SQL, it doesn’t matter which method you use. SQL ignores comments regardless of what you use: #, --, or /* and */. So it is up to you and your personal preference. The /* and */ method for multi-line comments usually looks cleaner and helps separate the comments from the query. But there isn’t one right or wrong method.
When you join a company, you can expect each company to use their own SQL platform and SQL dialect. The SQL platform they use (e.g. BigQuery, MySQL, or SQL Server) is where you will enter and run your SQL queries. But keep in mind that not all SQL platforms provide native script editors to enter SQL code. SQL text editors give you an interface where you can enter your SQL queries in an easier and color-coded way. In fact, all of the code we have been working with so far was entered with an SQL text editor!
If your SQL platform doesn’t have color coding, you might want to think about using a text editor like Sublime Text or Atom. This section shows how SQL is displayed in Sublime Text. Here is a query in Sublime Text:
Recently, you’ve been thinking about identifying good data sources that would be useful for analysis. You also spent some time in a previous activity exploring a public dataset in BigQuery and writing some basic SQL queries. In addition to using public data in BigQuery, your future data career will involve importing data from other sources. In this activity, you will create a custom table and dataset that you’ll load into a new table and query.
By the time you complete this activity, you will be able to load your own data into BigQuery for analysis. This will enable you to import your own data sources into BigQuery, which is a skill that will enable you to more effectively analyze data from different sources.
Follow the instructions to complete each step of the activity. Then answer the questions at the end of the activity before going to the next course item.
To get started, download the baby names data zip file. This file contains about 7 MB of data about popular baby names from the U.S. Social Security Administration website.
Select the link to the baby names data zip file to download it.
Link to baby names data: names.zip
Unzip the file you downloaded onto your computer to access it on BigQuery. Once you have unzipped the file, find a .pdf file titled NationalReadMe that contains more information about the dataset. This dataset tracks the popularity of baby names for each year; you can find text files labeled by the year they contain. Open yob2014.txt to preview the data. You will notice that it’s a .csv file with three columns. Remember where you saved this folder so you can reference it later.
Before uploading your .txt file and creating a table to query, you will need to create a dataset to upload your data into and store your tables.
1. From the BigQuery console, go to the Explorer pane in your workspace and select the three dots next to your project to open a menu. From here, select Create dataset. Note that unless you have already specified your own project name, a unique name is assigned to your project by BigQuery, typically in the format of two words and a number, separated by hyphens (e.g. loyal-glass-371423 in the image below).
Question 1
Fill in the blank: When using SQL, the _____ clause can be used to filter a dataset of customers to only include people who have made a purchase in the past month.
Status: [object Object]
1 point
WHERE
SELECT
FILTER
FROM
Question 2
Which cases are most often used for column names in a database table and represent a SQL best practice? Select all that apply.
Status: [object Object]
1 point
Sentence case
Snake case
Lowercase
Camel case
Question 3
A database table is named WebTrafficAnalytics. What type of case is this?
Status: [object Object]
1 point
Snake case
Lowercase
Camel case
Sentence case
Question 4
What can be removed from the following query without preventing it from running or changing the results?
SELECT *
FROM `Uni_dataset.new_table`
WHERE ID = 'Lawrence'
Status: [object Object]
1 point
SELECT
Backticks (`)
The asterisk (*)
WHERE
Administrative metadata: Metadata that indicates the technical source of a digital asset
CSV (comma-separated values) file: A delimited text file that uses a comma to separate values
Data governance: A process for ensuring the formal management of a company’s data assets
Descriptive metadata: Metadata that describes a piece of data and can be used to identify it at a later point in time
Foreign key: A field within a database table that is a primary key in another table (Refer to primary key)
FROM: The section of a query that indicates where the selected data comes from
Geolocation: The geographical location of a person or device by means of digital information
Metadata: Data about data
Metadata repository: A database created to store metadata
Naming conventions: Consistent guidelines that describe the content, creation date, and version of a file in its name
Normalized database: A database in which only related data is stored in each table
Notebook: An interactive, editable programming environment for creating data reports and showcasing data skills
Primary key: An identifier in a database that references a column in which each value is unique (Refer to foreign key)
Redundancy: When the same piece of data is stored in two or more places
Schema: A way of describing how something, such as data, is organized
SELECT: The section of a query that indicates the subset of a dataset
Structural metadata: Metadata that indicates how a piece of data is organized and whether it is part of one or more than one data collection
WHERE: The section of a query that specifies criteria that the requested data must meet
World Health Organization: An organization whose primary role is to direct and coordinate international health within the United Nations system
Every data analyst’s goal is to conduct efficient data analysis. One way to increase the efficiency of your analyses is to streamline processes that help save time and energy in the long run. Meaningful, logical, and consistent file names help data analysts organize their data and automate their analysis process. When you use consistent guidelines to describe the content, date, or version of a file and its name, you’re using file naming conventions.
In this reading, you’ll learn more about best practices for file naming conventions and file organization.
File-naming conventions help you organize, access, process, and analyze data because they act as quick reference points to identify what’s in a file. One important practice is to decide on file naming conventions—as a team or company—early in a project. This will prevent you from spending time updating file names later, which can be a time-consuming process. In addition, you should align your project’s file names with your team’s or company’s existing file-naming conventions. You don’t want to spend time learning a new file-naming convention each time you look up a file in a new project!
It's also critical to ensure that file names are meaningful, consistent, and easy-to-read. File names should include:
The project’s name
The file creation date
Revision version
Consistent style and order
Further, file-naming conventions should act as quick reference points to identify what is in the file. Because of this, they should be short and to the point.
In the following sections, you’ll explore each part of a sales report file name that follows an established naming convention, SalesReport_20231125_v02. This example will help you understand the key parts of a strong file name and why they’re important.
Giving a file a meaningful name to describe its contents makes searching for it straightforward. It also makes it easy to understand the type of data the file contains.
In the example, the file name includes the text SalesReport, a succinct description of what the file contains: a sales report.
Knowing when a file was created can help you understand if it is relevant to your current analysis. For example, you might want to analyze only data from 2023.
In the example, the year is described as 20231125. This reads as the sales report from November 25, 2023 following the year, month, and day (YYYYMMDD) format of the international date standard. Keep in mind that different countries follow different date conventions, so make sure you know the date standard your company follows.
Including a revision version helps ensure you’re working with the correct file. You wouldn’t want to make edits to an old version of a file without realizing it! When you include revision numbers in a file name, lead with a zero. This way, if your team reaches more than nine rounds of revisions, double digits are already built into your convention.
In the example, the version is described as v02. The v is short for the version of the file, and the number following the v indicates which round of revisions the file is currently in.
Make sure the information you include in a file name follows a consistent order. For example, you wouldn’t want version three of the sales report in the example to be titled 20231125_v03_SalesReport. It would be difficult to find and compare multiple documents.
When you use spaces and special characters in a file name, software may not be able to recognize them, which causes problems and errors in some applications. An alternative is to use hyphens, underscores, and capital letters. The example includes underscores between each piece of information, but your team could choose to use hyphens between year, month, and date, too: SalesReport_2023_11_25_v02.
To ensure all team members use the agreed-upon file naming conventions, create a text file as a sample that includes all of the naming conventions on a project. This can benefit new team members to help them quickly get up to speed or a current team member who just needs a refresher on the file naming conventions.
To keep your files organized, create folders and subfolders—in a logical hierarchy—to ensure related files are stored together and can be found easily later. A hierarchy is a way of organizing files and folders. Broader-topic folders are located at the top of the hierarchy, and more specific subfolders and files are contained within those folders. Each folder can contain other folders and files. This allows you to group related files together and makes it easier to find the files you need. In addition, it’s a best practice to store completed files separately from in-progress files so the files you need are easy to find. Archive older files in a separate folder or in an external storage location.
Use consistent, meaningful file-naming conventions throughout your project to save you and your team time by making data easy to find and use. File-naming conventions should be agreed upon by all team members before starting a project and should describe the project by including its name, the date, and the revision version. Document this information in a location that team members can access.
Question 1
Fill in the blank: Naming _____ are consistent guidelines used to describe the content, date, or version of a file.
conventions
attributes
references
descriptors
1 point
Question 2
What are the key goals of foldering in data analytics? Select all that apply.
Keep project-related files together
Organize files into subfolders
Transfer files from one place to another
Assign metadata about the folders
1 point
Question 3
What is the process of structuring folders broadly at the top, then breaking down those folders into more specific topics?
Assigning naming conventions
Producing a backup
Developing metadata
Creating a hierarchy
1 point
Question 4
Which of the following examples would be the most effective file name?
May30-2019_AirportAdCampaignResults_Terminals3-5_InclCustSurveyResponses_Ideas
CampaignData_03
Data_519
AirportCampaign_2013_10_09_V01
1 point
Data security means protecting data from unauthorized access or corruption by putting safety measures in place. Usually the purpose of data security is to keep unauthorized users from accessing or viewing sensitive data. Data analysts have to find a way to balance data security with their actual analysis needs. This can be tricky-- we want to keep our data safe and secure, but we also want to use it as soon as possible so that we can make meaningful and timely observations.
In order to do this, companies need to find ways to balance their data security measures with their data access needs.
Luckily, there are a few security measures that can help companies do just that. The two we will talk about here are encryption and tokenization.
Encryption uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the encryption; so if you have the key, you can still use the data in its original form.
Tokenization replaces the data elements you want to protect with randomly generated data referred to as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the complete original data, the user or application needs to have permission to use the tokenized data and the token mapping. This means that even if the tokenized data is hacked, the original data is still safe and secure in a separate location.
Encryption and tokenization are just some of the data security options out there. There are a lot of others, like using authentication devices for AI technology.
As a junior data analyst, you probably won’t be responsible for building out these systems. A lot of companies have entire teams dedicated to data security or hire third party companies that specialize in data security to create these systems. But it is important to know that all companies have a responsibility to keep their data secure, and to understand some of the potential systems your future employer might use.
However, one thing you absolutely can do to help strike the right balance is to use version control best practices. Version control enables all collaborators within a file to track changes over time. You can understand who made what changes to a file, when they were made, and why.
Here's a simple example: Perhaps you're working on a project with a team of other people. You are all collaborating within the same set of files, but each person is responsible for a different part of the project. Without version control, it would be very difficult to keep track of who made what changes to the files and when. This would lead to confusion and, even worse, people accidentally overwriting each other's work! Version control is essential for data analytics professionals because it allows users to effectively collaborate with others and experiment with new ideas without fear of losing their work.
Now that you have learned about the importance of data security, you can pause for a moment and think about what you are learning. In this self-reflection, you will consider your thoughts about data privacy, collaboration, and version control, then respond to brief questions.
This self-reflection will help you develop insights into your own learning and prepare you to apply your knowledge of data privacy to your experience with Kaggle. The data privacy skills you’re developing will apply to many of the tools you’ll use throughout your career as a data analytics professional. This is an example of a transferable skill: your understanding of data privacy in Kaggle can be applied to other tools in the future.
As you answer questions—and come up with questions of your own—you will consider concepts, practices, and principles to help refine your understanding and reinforce your learning. You’ve done the hard work, so make sure to get the most out of it: This reflection will help your knowledge stick!
Follow the steps to learn how Kaggle helps you manage your data. Then answer the question at the end of the activity before going to the next course item.
On Kaggle, you can upload your own datasets and keep them private. This means that they are visible and accessible by only you. You also have the option to add collaborators to your dataset, whom you can add as viewers or editors. Viewers are able to see your private dataset and editors are able to make changes to your private dataset.
You can share the link to your private dataset so anyone with the link is able to view it. If you don’t want this feature, you can disable it in the Settings tab of your dataset.
Note: If you have a private dataset on Kaggle and you choose to make it public, you will not be able to make the dataset private again. The only option you would have is to delete the dataset from Kaggle completely.
Any notebooks that you create on Kaggle are private by default. Like in datasets, you can add collaborators as viewers or editors. You can also make a notebook public, which will share it with the entire Kaggle community.
If you add collaborators to your Kaggle notebook, they can make changes to it. You want to make sure you communicate and coordinate with your collaborators because the last person who saves the notebook will overwrite all of the previous work. If you’d like more fine-grained control of changes to your code, a system like GitHub provides more version control.
As for version control, Kaggle has its own style of letting you keep records of your progress. You can read all of the details in this post, but think back to when you’ve done some work in a Kaggle notebook and clicked on the Save Version button.
When you clicked this button then clicked Save, you did it without changing anything. But you also have the option to add a short descriptive note about what changes you’ve made.
This can be helpful when you’ve made changes to your notebook but want to go back to an earlier version. To do this, go to Edit mode and click on the number next to the Save Version text at the top of your notebook.
This will open a navigation bar on the right side of the screen and list out all of the versions of your notebook. When you click on different versions of your notebook, the left side of the screen will populate with the code and text from that version.
Then, once the version has run, your screen will appear like this:
From this screen you can also open the version in Viewer mode, pin a version as the default, or even change the version name. Pinning a version as the default can be helpful when you have a working version of your notebook available to the Kaggle community, but want to make changes and updates that might not work the first time you implement them. This allows you to safely make changes behind the scenes while sharing with the Kaggle community the most recent working version of your notebook.
Question 1
Fill in the blank: Data security involves adopting _____ in order to protect data from unauthorized access or corruption.
Status: [object Object]
1 point
safety measures
metadata strategy
data validation
foldering procedures
Question 2
What data-security measure uses a unique algorithm to alter data and make it inaccessible without the algorithm?
Status: [object Object]
1 point
Password-protection
AI authentication
Encryption
Hidden tabs
Question 3
When using tokenization as a safety measure, what is replaced as a randomly generated token?
Status: [object Object]
1 point
The data elements to be protected
The user’s location
The line of code from a query
The user’s search history
Question 4
What data-security practice enables all collaborators within a file to track changes, such as who made what edits to the file, when they were made, and why?
Status: [object Object]
1 point
Version control
Password-protection
Foldering
Authentication
Access control: Features such as password protection, user permissions, and encryption that are used to protect a spreadsheet
Data security: Protecting data from unauthorized access or corruption by adopting safety measures
Inbox: Electronic storage where emails received by an individual are held
Signing up with LinkedIn is simple. Just follow these simple steps:
Browse to linkedin.com
Click Join now or Join with resume.
If you clicked Join now:
Enter your email address and a password and click Agree & Join (or click Join with Google to link to a Google account).
Enter your first and last name and click Continue.
Enter your country/region, your postal code, and location with the area (this helps LinkedIn find job opportunities near you).
Enter your most recent job title, or select I’m a student.
If you entered your most recent job title, select your employment type and enter the name of your most recent company.
If you selected self-employed or freelance, LinkedIn will ask for your industry.
Click confirm your email address. You will receive an email from LinkedIn.
To confirm your email address, click Agree & Confirm in your email.
LinkedIn will then ask if you are looking for a job. Click the answer that applies. If you select Yes, LinkedIn will help you start looking for job opportunities.
If you clicked Join with resume:
Click Upload your resume and select the file to upload.
Follow any of the steps under Join Now that are relevant.
The Join with resume option saves you some time because it auto-fills most of the information from your resume. And just like that, your initial profile is now ready!
It is a good idea to take your time filling out every section of your profile. This helps recruiters find your profile and helps people you connect with get to know you better. Start with your photo. Here are some tips to help you choose a great picture for your new profile:
Choose an image that looks like you: You want to make sure that your profile is the best representation of you and that includes your photo. You want a potential connection or potential employer to be able to recognize you from your profile picture if you were to meet.
Use your industry as an example: If you are having trouble deciding what is appropriate for your profile image, look at other profiles in the same industry or from companies you are interested in to get a better sense of what you should be doing.
Choose a high-resolution image: The better the resolution, the better impression it makes, so make sure the image you choose isn’t blurry. The ideal image size for a LinkedIn profile picture is 400 x 400 pixels. Use a photo where your face takes up at least 60% of the space in the frame.
Remember to smile: Your profile picture is a snapshot of who you are as a person so it is okay to be serious in your photo. But smiling helps put potential connections and potential employers at ease.
Connections are a great way to keep up to date with your previous coworkers, colleagues, classmates, or even companies you want to work with. The world is a big place with a lot of people. So here are some tips to help get you started.
Connect to people you know personally.
Add a personal touch to your invitation message. Instead of just letting them know you would like to connect, let them know why.
Make sure your profile picture is current so people can recognize you.
Add value. Provide them with a resource, a website link, or even some content they might find interesting in your invitation to connect.
LinkedIn is a great place to find great people and great ideas. From technology to marketing, and everything in between, there are all kinds of influencers and thought leaders active on LinkedIn. If you have ever wanted to know the thoughts of some of the most influential and respected minds in a certain field, LinkedIn is a great place to start. Following your favorite people takes only a few minutes. You can search for people or companies individually, or you can use these lists as starting points.
LinkedIn Top Voices 2020: Data Science & AI
On LinkedIn, letting recruiters and potential employers know that you are in the market for a new job is simple. Just follow these steps:
Click the Me icon at the top of your LinkedIn homepage.
Click View profile.
Click the blue Open to button prompting a drop-down menu and under Intro, select Finding a new job.
Provide the requested job title and location requests within the pop-up window.
When you are done updating, click on the Save button which will add the tags to your public profile.
Make sure to select the appropriate filters for the new positions you might be looking for and update your profile to better fit the role that you are applying for.
Add to your profile to keep it complete, current, and interesting. For example, remember to add the Google Data Analytics Certificate to your profile after you complete the program!
A connection is someone you know and trust on a personal or professional basis. Your connections are who make up your network. And when it comes to your network, it is important to remember quality over quantity. So don’t focus on how many connections you have. Instead, make sure that everyone you connect with adds value to your network, and vice versa.
Adding connections on LinkedIn is easy. You invite people to join your network, and they accept your invitation. When you send an invitation, you can attach a personal note. Personal notes are highly recommended.
A great way to increase the number of your connections is to invite classmates, friends, teachers, or even members of a club or organization you are in. LinkedIn also gives suggestions for connections based on your profile information. Here's an example (template) that you can use to connect with a former co-worker:
Cold requests on LinkedIn are invitations to connect with people you don’t know personally or professionally. When you start to build your network, it is best to connect with people you already know. But cold requests might be the only way to connect with people who work at companies you are interested in. You can learn a lot about a company’s culture and job openings from current employees. As a best practice, send cold requests rarely and only when there is no other way to connect.
Recommendations on LinkedIn are a great way to have others vouch for you. Ask people to comment on your past performance, how you handled a challenging project, or your strengths as a data analyst. You can choose to accept, reject, show, or hide recommendations in your profile.
Here are some tips for asking for a recommendation:
Reach out to a variety of people for a 360-degree view: supervisors, co-workers, direct reports, partners, and clients
Personalize the recommendation request with a custom message
Suggest strengths and capabilities they can highlight as part of your request
Be willing to write a recommendation in return
Read the recommendation carefully before you accept it into your profile
Sometimes the hardest part of getting a recommendation is creating the right request message. Here's an example (template) that you can use to ask for a recommendation:
Ask a few connections to recommend you and highlight why you should be hired. Recommendations help prospective employers get a better idea of who you are and the quality of your work.
When you write thoughtful posts and respond to others genuinely, people in and even outside your network will be open and ready to help you during your job search.
Now that you have been introduced to the Kaggle platform, you can pause for a moment and apply what you are learning. In this self-reflection, you will consider your thoughts about your online presence and respond to brief questions.
This self-reflection will help you develop insights into your own learning and prepare you to connect your knowledge of Kaggle to your goals for your online presence. As you answer questions—and come up with questions of your own—you will consider concepts, practices, and principles to help refine your understanding and reinforce your learning. You’ve done the hard work, so make sure to get the most out of it: This reflection will help your knowledge stick!
Follow the instructions to complete each step of the activity. Then answer the question at the end of the activity before going to the next course item.
As you get more familiar with the career options available to data analysts, you’ll find that it’s important to have an online presence. By engaging with the data community online, you can ask questions, learn new skills, and demonstrate your achievements to potential employers.
You’ve already covered several ways that you can build your online presence, from LinkedIn to GitHub to Medium. To develop a stronger connection to the data community and interact with fellow data professionals and enthusiasts, you can also build an online presence on Kaggle.
In addition to datasets, Kaggle has micro-courses, competitions, forums, and a large community of users. With the Kaggle Progression System, you track your progress and growth within the platform and demonstrate your data skills and savvy to employers and colleagues.
To start on Kaggle, you should set up a profile. Your Kaggle profile page is a collection of all of your work and accomplishments on Kaggle. Check out an example profile page to explore how a profile can be used to share information: Heads or Tails
As you browse the Kaggle profile for Heads or Tails, you can also observe the work they did to reach high ranks in each category. They also include links. For example, they include a link from their profile to their personal blog and videos. This allows them to share their Kaggle profile with prospective employers in order to highlight their skills and progress.
Building up your Kaggle profile, or any other form of social media presence, takes time. Being patient with your progress and engaging consistently will draw more attention to your work.
Before moving into the next section, take a few minutes to look at different Kaggler profiles, and think about what they’re doing that inspires you. If you encounter a profile of someone whose work you’d like to follow, click the Follow User button on their profile.
To make it easier to get started on Kaggle, begin by reviewing how the Kaggle community operates. You can do this by reading through their Community Guidelines and checking out their Getting Started on Kaggle YouTube playlist. Then, spend time exploring through the Kaggle Winner's blog posts = and checking out the Learn section. When you are ready to reflect, copy the url of a blog post that resonates with you, and be ready to share it in the upcoming discussion activity.
In this reading, you will be introduced to online and in-person opportunities to connect with other data analysts. This is part of how you develop professional relationships, which is very important when you are just starting out in your career
If you spend a few hours on social media every day you might be totally comfortable connecting with other data analysts online. But, where should you look if you don’t know any data analysts?
Even if you aren’t on social media and just created your LinkedIn profile yesterday, you can still use your online presence to find and network with other data analysts.
Knowing where to look is key. Here are some suggestions on where to start online:
Subscriptions to newsletters like Data Elixir. Not only will this give you a treasure trove of useful information on a regular basis, but you will also learn the names of data science experts who you can follow, or possibly even connect with if you have good reason to.
Hackathons (competitions) like those sponsored by Kaggle, one of the largest data science and machine learning communities in the world. Participating in a hackathon might not be for everyone. But after joining a community, you typically have access to forums where you can chat and connect with other data analysts.
Meetups, or online meetings that are usually local to your geography. Enter a search for ‘data science meetups near me’ to see what results you get. There is usually a posted schedule for upcoming meetings so you can attend virtually to meet other data analysts. Find out more information about meetups happening around the world.
Platforms like LinkedIn and Twitter. Use a search on either platform to find data science or data analysis hashtags to follow. You can also post your own questions or articles to generate responses and build connections that way. At the time of this writing, the LinkedIn #dataanalyst hashtag had 11,842 followers, the #dataanalytics hashtag had 98,412 followers, and the #datascience hashtag had 746,945 followers. Many of the same hashtags work on Twitter and even on Instagram.
Webinars may showcase a panel of speakers and are usually recorded for convenient access and playback. You can see who is on a webinar panel and follow them too. Plus, a lot of webinars are free. One interesting pick is the Tableau on Tableau webinar series. Find out how Tableau has used Tableau in its internal departments.
In-person gatherings are super valuable in a digitized world. They are a great way to meet people. A lot of online relationships start from in-person gatherings and are carried on after people return home. Many organizations that sponsor annual gatherings also offer virtual meetings and resources during the rest of the year.
Here are a few suggestions to find in-person gatherings in your area:
Conferences usually present innovative ideas and topics. The cost of conferences vary, and some are pricey. But lots of conferences offer discounts to students and some conferences like Women in Analytics aim to increase the number of under-represented groups in the field. Leading research and advisory companies such as Gartner also sponsor conferences for data and analytics. The KDNuggets list of meetings and online events for AI, analytics, big data, data science, and machine learning is useful.
Associations or societies gather members to promote a field like data science. Many memberships are free. The KDNuggets list of societies and groups for analytics, data mining, data science, and knowledge discovery is useful.
User communities and summits offer events for users of data analysis tools; this is a chance to learn from the best. Have you seen the Tableau community?
Non-profit organizations that promote the ethical use of data science and might offer events for the professional advancement of their members. The Data Science Association is one example.
Your connections will help you increase your knowledge and skills. Making and keeping connections is also important to those already working in the field of data analytics. So look for online communities that promote data analysis tools or advance data science. And if available where you live, look for meetups to connect with more people face-to-face. Take advantage of both routes for the best of both worlds! It is easier to have a conversation and exchange information in-person, but the key advantage of online connections is that they aren’t limited to where you live. Online communities might even connect you to an international crowd.