The total length of the videos in this section is approximately 25 minutes. You will also spend time answering short questions and practicing SQL while completing this section.
This module was created by Kily Wong '16 and updated by Fridah Ntika '25
Note that the youtube videos embedded here were not made by the QAI.
SQL (pronounced S-Q-L or See-Quel) is a structured query language that is used to access and manage data that is stored within relational database management systems (RDBMS). It is now used in both industry and academic settings. With the introduction of new open-source SQL database solutions, the use of SQL has grown in popularity. While SQL is frequently used in conjunction with SQL Server, there has been increasing demand for the superior statistical computing power of R. Thus, while Microsoft is moving to introduce R into the SQL server, CRAN has also developed a way for people to use SQL within R.
Question 1: What is data?
An image
An information file
Someone's age
Someone's name
All of the above
All of the above
Question 2: Where might we find database management systems?
Con Edison
Amazon
Wellesley College
All of the above
All of the above. A database management system is a collection of programs which allow users to access databases, manipulate data, and analyze data in meaningful ways. If there is ever a need for a systematic collection of information about people or objects, there is likely to be a DBMS.
Question 3: What type of DBMS defines database relationships in the form of tables and predefined data types?
Hierarchical DBMS
Relational DBMS
Network DBMS
Object-oriented relational DBMS
Relational DBMS. There are various types of database management systems. Four major types of DBMS include: hierarchical, relational, network, and object-oriented relational. The hierarchical DMBS employs the parent child relationship and is rarely used (tree and nodes). Network DMBS supports many relationships and results in very complex models. Relational DMBS defines database relationships in forms of tables. This is the most frequently used type of DBMS. Object-oriented relational DMBS supports the storage of new data.
Question 4: What is SQL (structured query language) capable of?
It is used to insert entries in a database
It is used to search within a database
It is used to make updates to entries in a database
It is used to delete database records
All of the above
All of the above. SQL allows users to insert, search, update, and delete database records. SQL can help in optimizing and maintaining databases as well.
Before diving in and writing SQL statements in R, it's useful to understand the syntax of how the SQL language works. In order to see how SQL statements function in general, try playing around in this sandbox.
Suggested Tutorials (These exercises are part of what you are assigned to work on, if you choose SQL this week. Please write down the commands you tried before checking them against the solution doc that's below.):
I. Try writing your own SQL statements using the following:
SELECT - select all entries (within the Customers table)
count & where - count the number of customers who are from Mexico (within the Customers table)
and/or - select all the customers who live in the UK and are from London (within the Customers table)
group by - count the number of orders that have the same order ID and group them by the specific order ID (within the OrderDetails table)
avg - compute the average price by product name (within the Products table)
II. Manipulating a data set/database
insert - Enter a new customer into the Customers table whose Customer ID is 0, customer's name is Kily Wong, contact name is Casey Pattanayak, Address is 21 Wellesley College Road, postal code is 02481, Country is US
delete - delete the entry that you just created (where the customer's name is Kily Wong)
update - update an entry within the Suppliers table -- modify the entry Exotic Liquid with the following changes: the new phone number is 212-886-4593, and the new contact name is Lee Mitchell
order by - order the entries within the Customers table by Country
III. Venturing into data mining
Create your own relational database by creating a series of tables and then extracting various information from each of them
The document below shows solutions to these suggested exercises.
Below is a screenshot of the Customers table you encountered in the exercises above. Use it to answer the following two questions
Question 5: Which line of code correctly selects customers from Mexico?
SELECT from customers where Country='Mexico'
SELECT CustomerName from customers where Country='Mexico'
SELECT CustomerName where Country='Mexico' from customers
SELECT from customers
SELECT customers where Country='Mexico'
SELECT CustomerName from customers where Country='Mexico'
Question 6: Which command correctly counts the number of customers from the UK?
select count(CustomerName) as NumberofCustomers where Country='UK' from customers
count(CustomerName) as NumberofCustomers from Customers where Country="UK"
count(CustomerName) from Customers where Country="UK"
select count(CustomerName) as NumberofCustomers from Customers where Country="UK"
count(CustomerName)
select count(CustomerName) as NumberofCustomers from Customers where Country="UK"
Question 7: Which lines of code correctly inserts a new value into the customers' table?
Insert into Customers ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');
Insert Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');
Insert into Customers (CustomerID, CustomerName, ContactName, Address, City, PostalCode, Country) Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');
Insert Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US') into Customers (CustomerID , CustomerName, ContactName, Address, City, PostalCode, Country)
None of the above
Insert into Customers (CustomerID, CustomerName, ContactName, Address, City, PostalCode, Country) Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');
Question 8: Select the solution which correctly identifies the function(s) and object(s) in this statement: Select ProductName Avg(Price) as "PriceAverage" from Products
Function(s): Select, Avg; Object(s): Price
Function(s): Avg; Object(s): Price
Function(s): Select, Avg; Object(s): ProductName, Price
Function(s): Select; Object(s): Avg(Price)
Function(s): Select; Object(s): ProductName
Function(s): Avg; Object(s): Price
When working with SQL in RStudio, there are several interconnected packages, terminologies, and components that one has to understand before diving into it.
A database server is a system whose responsibility is to store data and respond to SQL requests. Large datasets are not stored in programs like R due to memory and storage capacity. The data lives in the database, and the database server is responsible for organizing it, searching it, and returning results. SQL is the language used to communicate with the database server, telling it which data to retrieve, filter, group, or summarize. When you type SQL commands in R, R sends your request to the database server, which in turn acts on the query and sends back the result(s) as R dataframes.
The DBI (Database Interface) package serves as the communication layer between R and database servers, sending the SQL queries and receiving the results. DBI works with database-specific drivers to connect to the correct database system, i.e., RPostgres for PostgreSQL databases, RSQLite for SQLite databases, and RMySQL for MySQL databases. These drivers act as translators between DBI and the specific database system you're using.
Databases, which we were introduced to in the first video of this module, come in different types. PostgreSQL is a powerful and popular server-based database mainly used for large applications. It supports complex queries and is appreciated for its strong data integrity rules. MySQL is another popular and fast server-based database widely used to power web applications. SQLite, however, is a serverless database that stores data as a single file locally on your computer, making it specifically ideal for smaller projects and learning. While PostgreSQL and MySQL can support many users at any time, SQLite can only support a limited number of users.
Once you are set on the database system you want to use, the next thing to decide is the approach to work with SQL in RStudio.
You can write SQL queries directly in RStudio using DBI's dbGetQuery() function. With this, you can use SQL syntax such as SELECT, HAVING, FROM, GROUPBY, and so on.
The dplyr package provides an R-friendly way to execute SQL queries using functions like select(), filter() instead of WHERE, mean() instead of AVG, summarise(), group_by(), and so on. When you use these functions, the dbplyr package steps in and translates dplyr code to SQL behind the scenes. This is because the dplyr package was originally just designed for dataframes.
Using the sqldf package. This package allows you to work with data already loaded in R's memory, without needing a database connection. As the user, we only need to write our SQL queries referencing the names of our dataframes as though they were database tables.
Remember to always disconnect from your database using dbDisconnect() when finished to free up resources.
It's now your turn to practice using SQL in R. So, here are some tasks to complete:
Try several of the SQL commands in the readings to view the database in various ways.
Run some of the SQL queries shown as examples in the Baumer reading. Edit them and rerun to make sure that the results change in the way you expected.
Think of a question you’d like to answer using the airlines' data, the way we did with the ATUS data. Attempt to answer it using the SQL commands introduced in the reading.
Also, here are some additional resources:
Readings
Data Science, the SQL (https://st47s.com/SDS261/Notes/)
Modern Data Science with R (mdsr-book.github.io/mdsr3e/15-sqlI.html#sec-dplyr-sql )
Books
1. Learning SQL (2nd Edition) by Alan Beaulieu (Review)
2. SQL in 10 Minutes, Sams Teach Yourself (4th Edition) by Ben Forta (Review)
Interview Practice Platform
Online Learning Platforms
1. W3schools is a great database that houses a plethora of information on a full host of SQL queries that one might want to explore. While the presentation of commands and descriptions lends itself more to self-guided learning, it is best used as a more general help index or support resource.
2. Khan Academy provides structured, fully-narrated, and rather comprehensive lessons that span three tracks. Beginning with an introduction to what SQL is and why someone might want to use it, all the way to further self-guided learning, the interactive track allows users to jump around and choose to watch videos or simply test out lines of code in sandboxes with prescribed task goals.
3. Codeschool presents a more introductory series in addition to this particular track. As the sequel to the original sequence, this track is more advanced and includes a collection of videos, questions, and practice examples that users can code through. Although the instruction is written with an entertaining premise, this platform does expect users to complete the lessons sequentially and does not allow for as much self-guided learning.
4. Codecademy - Developed in partnership with Periscope Data, Codecademy provides a self-guided course in SQL, which includes lessons, projects, and quizzes. Instructions and coding tasks are straightforward. The only disadvantages are that there is some difficulty in learning the code since the debugging is not the most intuitive, and one cannot skip around the lessons. The total estimated course time is three hours.
Since some people have reported that they were asked about SQL during technical interviews, this section of the SQL tutorial is meant to help students review some concepts that might come up:
I. SQL Joins - allows you to return data from two relational tables
Sample questions & solutions
1) What is the difference between JOIN and OUTER JOIN?
Inner join is the most common and simplest to understand. This query will return all of the records that are in common across two given tables (in a Venn diagram, think of the part between the two intersecting circles: A∩B). Meanwhile, the full outer join or outer join will return all records from both tables (in a Venn diagram, think of taking information from both circles: AUB).
2) Given the 2 tables below, how many rows and columns would the following statement output? Go back to the sandbox and try the following tutorials:
a) Create a new table using the existing Customers and Orders tables to display the following three columns of data: OrderID, CustomerName, OrderDate.
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate FROM Orders INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID ORDER BY Customers.CustomerName;
b) Create a new table using the existing Customers and Orders table to display the following five columns of data: Customer Name, Contact Name, Country, OrderID, OrderDate. (This function is not supported in the sandbox)
SELECT Customers.CustomerName, Customers.Country, Orders.OrderID, Orders.OrderDate, Orders.OrderDate FROM Customers FULL OUTER JOIN Orders ON Customers.CustomerID=Orders.CustomerID ORDER BY Customers.CustomerName;