Sql novice survey

August 2018

My learnings of SQL

Why should I learn SQL?

Firstly, as a data scientist you should have wide knowledge of many software languages. SQL (Structured Query Language) is used to handle data held in a 'relational database management system' (RDBMS). For various industries, data is stored in 'data warehouses' is the form of 'databases'. A database is collection of tables. SQL can be used to access and manipulate that data.

Can I learn SQL?

You can learn SQL as there are many versions (Oracle, IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite) and some are open source. For example, my laptop has sqlite3 installed when I installed miniconda. It is worth installing http://sqlitebrowser.org/ to help visualize the database.

SQL Lesson

We will use the software carpentry lesson on SQL http://swcarpentry.github.io/sql-novice-survey/ which provides a database contains information on William Dyer, Frank Pabodie and Valentina Roerich's expedition to the South Pole.

sqlitebrowser

Open the app. Click 'Open Database'. It shows there are 4 Tables (Person, Site, Survey and Visited). Once you've navigated to a table click the 'Browse Data' tab. This looks like:

The full database looks like:

SQL_sfc

sqlite3

Open the database

$ sqlite3 survey.db
sqlite> .tables -- Shows the tables in the database

sqlite> .help -- For a list of commands

-- Change display settings
sqlite> .mode column -- Left aligned columns. Can replace column with other options
sqlite> .header on -- Display column headers

-- Query two columns in one of the tables
sqlite> SELECT family, personal FROM Person; -- ; means run. It's case insensitive.
Dyer        William   
Pabodie     Frank     
Lake        Anderson  
Roerich     Valentina 
Danforth    Frank

-- Select all columns using *
sqlite> SELECT * FROM Person;
id          personal    family    
----------  ----------  ----------
dyer        William     Dyer      
pb          Frank       Pabodie   
lake        Anderson    Lake      
roe         Valentina   Roerich   
danforth    Frank       Danforth

-- Use DISTINCT to return unique values
sqlite> SELECT DISTINCT quant FROM Survey;
quant     
----------
rad       
sal       
temp
-- Can use to look for unique pairs (or more) of values
sqlite> --SELECT DISTINCT taken, quant FROM Survey;

-- Databases may not be stored in any particular order. You can use ORDER BY and select a column to order by the column alphabetically or numeric
sqlite> SELECT * FROM Person ORDER BY id;
id          personal    family    
----------  ----------  ----------
danforth    Frank       Danforth  
dyer        William     Dyer      
lake        Anderson    Lake      
pb          Frank       Pabodie   
roe         Valentina   Roerich 
-- Use DESC to order by descending
sqlite> SELECT * FROM Person ORDER BY id DESC;
id          personal    family    
----------  ----------  ----------
roe         Valentina   Roerich   
pb          Frank       Pabodie   
lake        Anderson    Lake      
dyer        William     Dyer      
danforth    Frank       Danforth 

-- You can order columns separately if they share the same index
-- ORDER BY taken then for similar taken values ORDER BY person in DESC
sqlite> SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC;

-- Filter the data using a WHERE command
-- Find data at a particular site
sqlite> SELECT * FROM Visited WHERE site='DR-1';
id          site        dated     
----------  ----------  ----------
619         DR-1        1927-02-08
622         DR-1        1927-02-10
844         DR-1        1932-03-22

-- Build WHERE commands using AND and the < operator
sqlite> SELECT * FROM Visited WHERE site='DR-1' AND dated<'1930-01-01';
id          site        dated     
----------  ----------  ----------
619         DR-1        1927-02-08
622         DR-1        1927-02-10
sqlite> SELECT * FROM Site WHERE (lat > -48) AND (lat < 48);
-- and the or operator
sqlite> SELECT * FROM Survey WHERE person='lake' OR person='roe';

-- You can use the operators to test for values outside a range e.g. salinity should be between 0 and 1
sqlite> SELECT * FROM Survey WHERE quant='sal' AND ((reading > 1.0) OR (reading < 0.0));

-- A note on dates: ISO-8601 “YYYY-MM-DD HH:MM:SS.SSSS, days since November 24, 4714 BC or seconds since January 1, 1970

-- Instead of using multiple WHERE commands you can use IN and specify a set
sqlite> SELECT * FROM Survey WHERE person IN ('lake', 'roe');
-- Use the AND with OR but being careful with the paranthesis
sqlite> SELECT * FROM Survey WHERE quant='sal' AND (person='lake' OR person='roe');
-- use DISTINCT with WHERE
sqlite> SELECT DISTINCT person, quant FROM Survey WHERE person='lake' OR person='roe';

-- Use LIKE and % as a wildcard to find strings
sqlite> SELECT * FROM Visited WHERE site LIKE 'DR%';
id          site        dated     
----------  ----------  ----------
619         DR-1        1927-02-08
622         DR-1        1927-02-10
734         DR-3        1930-01-07
735         DR-3        1930-01-12
751         DR-3        1930-02-26
752         DR-3                  
844         DR-1        1932-03-22

-- Note: On testing: test on a subset of data (put it into a database) or create a random database and test on that.

-- Modify values in the queury using
sqlite> SELECT 1.05 * reading FROM Survey WHERE quant='rad';

-- To convert Fahrenheit to Celsius and round to two decimal places
sqlite> SELECT taken, round(5*(reading-32)/9, 2) FROM Survey WHERE quant='temp';
taken       round(5*(reading-32)/9, 2)
----------  --------------------------
734         -29.72
-- Rename the header using as
sqlite> SELECT taken, round(5*(reading-32)/9, 2) as Celsius FROM Survey WHERE quant='temp';
taken       Celsius   
----------  ----------
734         -29.72

-- Concat strings using ||
sqlite> SELECT personal || ' ' || family as full_name FROM Person;
full_name   
------------
William Dyer

-- UNION combines the result of two queries (as appending rows)
sqlite> SELECT * FROM Person WHERE id='dyer' UNION SELECT * FROM Person WHERE id='roe';

-- use instr(STRING, STRINGTOCHECKFOR) and substr(STRING, STARTINDEX, [LENGTH]) to find string in values. Note: index starts at 1
sqlite> SELECT site FROM Visited;
site      
----------
DR-1          
DR-3           
MSK-4     
sqlite> SELECT instr(site, '-') FROM Visited;
instr(site, '-')
----------------
3               
3                             
4     
sqlite> SELECT DISTINCT substr(site, 1, instr(site, '-') - 1) AS MajorSite FROM Visited;     
MajorSite 
----------
DR        
MSK

-- Databases use 'null' (not zero, False or an empty string) for missing values.
-- To check for null
sqlite> SELECT * FROM Visited WHERE dated IS NULL;
sqlite> SELECT * FROM Visited WHERE dated IS NOT NULL;
-- Can use null is WHERE
sqlite> SELECT * FROM Survey WHERE quant='sal' AND (person!='lake' OR person IS NULL);

-- Use min, max, avg, count [=len], sum as aggregation functions. Automatically ignores nulls.
sqlite> SELECT min(dated) FROM Visited;

-- may return a 'null' if you used a bad WHERE clause
sqlite> SELECT person, max(reading), sum(reading) FROM Survey WHERE quant='missing';
person      max(reading)  sum(reading)
----------  ------------  ------------
                                      
-- Write multiple lines as
sqlite> SELECT person, count(reading), round(avg(reading), 2)
        FROM Survey
        WHERE quant='rad';

-- To do the operation over multiple values (i.e. you might do a for loop over it) you can use GROUP BY
sqlite> SELECT person, count(reading), round(avg(reading), 2)
        FROM Survey
        WHERE quant='rad'
        GROUP BY person;
person      count(reading)  round(avg(reading), 2)
----------  --------------  ----------------------
dyer        2               8.81                  
lake        2               1.82                  
pb          3               6.66                  
roe         1               11.25 
-- You can GROUP BY over other columns as well
sqlite> SELECT   person, quant, count(reading), round(avg(reading), 2)
        FROM     Survey
        WHERE    person IS NOT NULL
        GROUP BY person, quant
        ORDER BY person, quant;

-- Write a simple database a do an average of it:
sqlite> SELECT AVG(a) FROM (
    SELECT 1 AS a
    UNION ALL SELECT NULL
    UNION ALL SELECT 5);

-- use group_concat to concat rows as a column
sqlite> SELECT group_concat(personal) FROM Person;
group_concat(personal)                
--------------------------------------
William,Frank,Anderson,Valentina,Frank

-- Use JOIN to join tables. But copies a lot multiple times (cross product).
sqlite> SELECT * FROM Site JOIN Visited;
-- Use smarter joining to keep certain pairs and you can drop the column which is duplicated
sqlite> SELECT Site.lat, Site.long, Visited.dated
        FROM   Site JOIN Visited
        ON     Site.name=Visited.site;
-- Join more than two tables on use ON to filter out combinations that don't make sense.

-- Databases contain primary keys and foreign keys e.g. Person.id is primary key (first column) and Survey.person is foreign key.

-- SQL has a 'rowid' to print the index
sqlite> SELECT rowid, * FROM Person;
rowid       id          personal    family    
----------  ----------  ----------  ----------
1           dyer        William     Dyer

-- See https://sql-joins.leopard.in.ua/ for visualizations of JOINs

-- Every value in a database should be atomic (it cannot be broken down further e.g. 'Ray' and 'Bell' instead of 'Ray Bell'.

-- Every record should have a unique primary key.

-- You can use EXTRACT (https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm) and STRFTIME (https://www.sqlite.org/lang_datefunc.html) to manipulate DATE values.

-- Create a table using CREATE TABLE and drop it using DROP TABLE.
sqlite> CREATE TABLE Site(name text, lat real, long real);
sqlite> DROP TABLE Site;

-- Note: data types. All are familiar except for BLOB (binary large object e.g. image).
-- 0 and 1 are Boolean.

-- It is best to provide more info when creating a table
sqlite> 
CREATE TABLE Survey(
    taken   integer not null, -- where reading taken
    person  text,             -- may not know who took it
    quant   real not null,    -- the quantity measured
    reading real not null,    -- the actual reading
    primary key(taken, quant),
    foreign key(taken) references Visited(id),
    foreign key(person) references Person(id)
);

-- To add data to the table use INSERT, UPDATE and DELETE.
-- To insert a row to Site
sqlite> INSERT INTO Site VALUES('DR-1', -49.85, -128.57);

-- Insert values into one table from another
sqlite> CREATE TABLE JustLatLong(lat text, long text);
sqlite> INSERT INTO JustLatLong SELECT lat, long FROM Site;

-- Update lat and long associated with a name
sqlite> UPDATE Site SET lat=-47.87, long=-122.40 WHERE name='MSK-4';
-- Update null with unknown
sqlite> UPDATE Survey SET person="unknown" WHERE person IS NULL;

-- Delete an id from Person
sqlite> DELETE FROM Person WHERE id = 'danforth';

-- Referential integrity (don't delete a value if it is a foreign key to another table).
-- Cascading delete if the idea to delete a value and the corresponding foreign keys.

-- LAST BIT ON SAVING RESULTS TO .CSV

sqlite> .exit

SQL and Python

Python has a module sqlite3 to allow access to a database. There is also other libraries here https://awesome-python.com/#database and a blog post here https://pynative.com/python-mysql-tutorial/. This has the advantage of saving data in a database as a python object e.g. list which you can do analysis and visualization with.

import sqlite3

connection = sqlite3.connect("survey.db")
cursor = connection.cursor() # Keep track where we are in the database
cursor.execute("SELECT Site.lat, Site.long FROM Site;") # The SQL command
results = cursor.fetchall() # Fetch all rows as a list. Each entry is a row.
# Loop over each tuple in the list
for r in results:
    print(r)
    # (-49.85, -128.57)
    # (-47.15, -126.72)
    # (-48.87, -123.4)
cursor.close()
connection.close()

You can create a function from queries using some python syntax. e.g. take a user's ID and return their name

import sqlite3

def get_name(database_file, person_id):
    query = "SELECT personal || ' ' || family FROM Person WHERE id='" + person_id + "';"

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query)
    results = cursor.fetchall()
    cursor.close()
    connection.close()

    return results[0][0]

print("Full name for dyer:", get_name('survey.db', 'dyer'))
# Full name for dyer: William Dyer

However, users can do a SQL injection attack by adding something like 'DROP' into their query. You should instead use a '?' and input it in a list:

import sqlite3

def get_name(database_file, person_id):
    query = "SELECT personal || ' ' || family FROM Person WHERE id=?;" # ? here. Expects one item

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query, [person_id]) # Note the person_id here
    results = cursor.fetchall()
    cursor.close()
    connection.close()

    return results[0][0]

print("Full name for dyer:", get_name('survey.db', 'dyer'))
# Full name for dyer: William Dyer

You can insert values into a database but you have to add a connection.commit()

import sqlite3

def add_name(database_file, new_person):
    query = "INSERT INTO Person VALUES (?, ?, ?);"

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query, list(new_person))
    cursor.close()
    connection.commit() # Note the commit here
    connection.close()


def get_name(database_file, person_id):
    query = "SELECT personal || ' ' || family FROM Person WHERE id=?;"

    connection = sqlite3.connect(database_file)
    cursor = connection.cursor()
    cursor.execute(query, [person_id])
    results = cursor.fetchall()
    cursor.close()
    connection.close()

    return results[0][0]

# Insert a new name
add_name('survey.db', ('barrett', 'Mary', 'Barrett'))
# Check it exists
print("Full name for barrett:", get_name('survey.db', 'barrett'))

You can create a database from scratch in python. For example, this will put 100,000 random values into a database

import sqlite3 
# import random number generator 
from numpy.random import uniform 

random_numbers = uniform(low=10.0, high=25.0, size=100000) 

connection = sqlite3.connect("original.db") 
cursor = connection.cursor() cursor.execute("CREATE TABLE Pressure (reading float not null)") 
query = "INSERT INTO Pressure values (?);" 

for number in random_numbers: 
    cursor.execute(query, [number]) cursor.close() 
# save changes to file for next exercise 
connection.commit() 
connection.close()