SQL

The total length of the videos in this section is approximately 15 minutes.You will also spend time answering short questions and practicing SQL while completing this section.

This module was created by Kily Wong '16.

Note that the youtube videos embedded here were not made by the QAI.

What is SQL?

SQL (pronounced S-Q-L or See-Quel) is a structured query language that is used to access and manage data that is stored within relational database management systems (RDBMS). It is now used in both industry and academic settings. With the introduction of new open-source SQL database solutions, the use of SQL has grown in popularity. While SQL is frequently used in conjunction with SQL Server, there has been increasing demand for the superior statistical computing power of R. Thus, while Microsoft is moving to introduce R into the SQL server, CRAN has also developed a way for people to use SQL within R.

Overview video

Question 1: What is data?

  • An image

  • An information file

  • Someone's age

  • Someone's name

  • All of the above

Show answer

All of the above

Question 2: Where might we find database management systems?

  • Facebook

  • Con Edison

  • Amazon

  • Wellesley College

  • All of the above

Show answer

All of the above. A database management system is a collection of programs which allow users to access databases, manipulate data, and analyze data in meaningful ways. If there is ever a need for a systematic collection of information about people or objects, there is likely to be a DBMS.

Question 3: What type of DBMS defines database relationships in the form of tables and predefined data types?

  • Hierarchical DBMS

  • Relational DBMS

  • Network DBMS

  • Object-oriented relational DBMS

Show answer

Relational DBMS. There are various types of database management systems. Four major types of DBMS include: hierarchical, relational, network, and object-oriented relational. The hierarchical DMBS employs the parent child relationship and is rarely used (tree and nodes). Network DMBS supports many relationships and results in very complex models. Relational DMBS defines database relationships in forms of tables. This is the most frequently used type of DBMS. Object-oriented relational DMBS supports the storage of new data.

Question 4: What is SQL (structured query language) capable of?

  • It is used to insert entries in a database

  • It is used to search within a database

  • It is used to make updates to entries in a database

  • It is used to delete database records

  • All of the above

Show answer

All of the above. SQL allows users to insert, search, update, and delete database records. SQL can help in optimizing and maintaining databases as well.

How to create tables

How to query tables

Aggregating data

Next: Practice a little

Before diving in and writing SQL statements in R, it's useful to understand the syntax of how the SQL language works. In order to see how SQL statements function in general, try playing around in this sandbox.

Suggested Tutorials (These exercises are part of what you are assigned to work on, if you choose SQL this week. Please write down the commands your tried before checking them against the solution doc that's below.):

I. Try writing your own SQL statements using the following:

  • select - select all entries (within the Customers table)

  • count & where - count the number of customers who are from Mexico (within the Customers table)

  • and/or - select all the customers who live in the UK and are from London (within the Customers table)

  • group by - count the number of orders that have the same order ID and group them by the specific order ID (within the OrderDetails table)

  • avg - compute the average price by product name (within the Products table)

II. Manipulating a data set/database

  • insert - Enter a new customer into the Customers table whose Customer ID is 0, customer's name is Kily Wong, contact name is Casey Pattanayak, Address is 21 Wellesley College Road, postal code is 02481, Country is US

  • delete - delete the entry that you just created (where the customer's name is Kily Wong)

  • update - update an entry within the Suppliers table -- modify the entry Exotic Liquid with the following changes: the new phone number is 212-886-4593 and the new contact name is Lee Mitchell

  • order by - order the entries within the Customers table by Country

III. Venturing out to data mining

  • create your own relational database by creating a series of tables and then extracting various information from each of them


The document below shows solutions to these suggested exercises.

Questions 5 and 6 involve the table called customers from the sandbox.

Here is a small sample of the database.

Question 5: Which line of code correctly selects customers from Mexico?

  • select from customers where Country="Mexico"

  • select CustomerName from customers where Country="Mexico"

  • select CustomerName where Country="Mexico" from customers

  • select from customers

  • select customers where Country="Mexico"

Show answer

select CustomerName from customers where Country="Mexico"

Question 6: Which command correctly counts the number of customers from the UK?

  • select count(CustomerName) as NumberofCustomers where Country="UK" from customers

  • count(CustomerName) as NumberofCustomers from Customers where Country="UK"

  • count(CustomerName) from Customers where Country="UK"

  • select count(CustomerName) as NumberofCustomers from Customers where Country="UK"

  • count(CustomerName)

Show answer

select count(CustomerName) as NumberofCustomers from Customers where Country="UK"

Question 7: Which lines of code correctly inserts a new value into the customers' table?

  • Insert into Customers ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');

  • Insert Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');

  • Insert into Customers (CustomerID, CustomerName, ContactName, Address, City, PostalCode, Country) Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');

  • Insert Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US') into Customers (CustomerID , CustomerName, ContactName, Address, City, PostalCode, Country)

  • None of the above

Show answer

Insert into Customers (CustomerID, CustomerName, ContactName, Address, City, PostalCode, Country) Values ('12', 'Wendy Wellesley', 'Tom Brady', '21 Wellesley College Road', 'Wellesley', '02481', 'US');

Question 8: Select the solution which correctly identifies the function(s) and object(s) in this statement: Select ProductName Avg(Price) as "PriceAverage" from Products

  • Function(s): Select, Avg; Object(s): Price

  • Function(s): Avg; Object(s): Price

  • Function(s): Select, Avg; Object(s): ProductName, Price

  • Function(s): Select; Object(s): Avg(Price)

  • Function(s): Select; Object(s): ProductName

Show answer

Function(s): Avg; Object(s): Price

If you can, install an R package to run SQL

Try the following:

1. Open a new R script within the statistical package R or RStudio

2. Type and execute the following lines of code:

install.packages("sqldf")

library(sqldf)

If you get an error message, you can try downloading the add-on that you are prompted to add on (this worked for Prof. P!), or update R, or google the error and see if someone else has fixed it. Unfortunately, solutions for all error types related to the package sqldf() in R are not currently available. This package is known to be a little buggy.

Continue onward if you have successfully installed the sqldf package.

3. Read the brief description regarding the slight variation regarding how to type the SQL code within the R function and begin exploring!

If you are already familiar with SQL, you will notice that the syntax of SQLDF is slightly different from the standard SQL. However, while it is not hard to adjust to, see R's manual for line-by-line guidance. In general, you'll notice that the SQL statements are simply nested within the R function.

Comparison of code:

In R: sqldf('select * from iris where "Sepal.Width" > 3')

In SQL: SELECT * FROM iris WHERE Sepal.Width > 3


Practice exercises:

Part I. Try writing your own SQL statements using the following: (This is part of what you are asked to submit on your assignment - okay to omit if you couldn't open the package)

  • select & where - select all entries from the iris data set where the sepal width is greater than 3

  • count & group by - count all the entries within each species type (within iris dataset)

  • and - select entries where the sepal width is greater than 3 and the petal width is greater than 2 (within iris dataset)

  • or - select entries where the species is either setosa or virginica (within iris dataset)

  • avg - compute the average sepal.width according to species (within iris dataset)

Part II. Manipulating a data set/database

  • insert - does not work with data sets included in R package

  • delete - does not work with data sets included in R package

  • update - does not work with data sets included in R package

  • order by - order all entries by sepal width from smallest to greatest (within iris dataset)

Part III. Venturing out to data mining

  • create your own relational database by creating a series of tables and then extracting various information from each of them

Remember, the tutorials only give you a very limited sampling of the commands and functions that SQL is capable of. Be sure to explore the many options and see which features fit your needs best!


The R solutions are in the file below, so that you can check your own work.

That's all! Please remember to submit your lecture survey.


Also, here are some additional resources:

Books

1. Learning SQL (2nd Edition) by Alan Beaulieu (Review)

2. SQL in 10 Minutes, Sams Teach Yourself (4th Edition) by Ben Forta (Review)

Online Learning Platforms

1. W3school is a great database which houses a plethora of information on a full host of SQL queries that one might want to explore. While the presentation of commands and descriptions lend itself more to self-guided learning, it is best used as a more general help index or support resource.

2. Khan Academy provides structured fully-narrated and rather comprehensive lessons that spans three tracks. Beginning with an introduction to what is SQL and why might someone want to use it all the way to further self-guided learning, the interactive track allows users to jump around and choose to watch videos or simply test out lines of code in sandboxes with prescribed task goals.

3. Codeschool presents a more introductory series in addition to this particular track. As the sequel to the original sequence, this track is more advanced and includes a collection of videos, questions, and practice examples that users can code through. Although the instruction is written with an entertaining premise, this platform does expect users to complete the lessons sequentially and does not allow for as much self-guided learning.

4. Codeacademy - Developed in partnership with Periscope Data, codeacademy provides a self-guided course in SQL which includes lessons, projects, and quizzes. Instructions and coding tasks are straightforward. The only disadvantages are that there is some difficulty in learning the code since the debugging is not the most intuitive and one cannot skip around the lessons. The total estimated course time is three hours.


Technical Job Interviews

Since some people have reported that they were asked about SQL during technical interviews, this section of the SQL tutorial is meant to help students review some concepts that might come up:

I. SQL Joins - allows you to return data from two relational tables

Sample questions & solutions

1) What is the difference between JOIN and OUTER JOIN?

Inner join is the most common and simplest to understand. This query will return all of the records that are in common across two given tables (in a Venn diagram, think of the part between the two intersecting circles: A∩B). Meanwhile, the full outer join or outer join will return all records from both tables (in a Venn diagram, think of taking information from both circles: AUB).

2) Given the below 2 tables, how many rows and columns would the following statement output? Go back to the sandbox and try the following tutorials:

a) Create a new table using the existing Customers and Orders tables to display the following three columns of data: OrderID, CustomerName, OrderDate.

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate FROM Orders INNER JOIN Customers ON Orders.CustomerID=Customers.CustomerID ORDER BY Customers.CustomerName;

b) Create a new table using the existing Customers and Orders table to display the following five columns of data: Customer Name, Contact Name, Country, OrderID, OrderDate. (This function is not supported in the sandbox)

SELECT Customers.CustomerName, Customers.Country, Orders.OrderID, Orders.OrderDate, Orders.OrderDate FROM Customers FULL OUTER JOIN Orders ON Customers.CustomerID=Orders.CustomerID ORDER BY Customers.CustomerName;