August 2018
Firstly, as a data scientist you should have wide knowledge of many software languages. SQL (Structured Query Language) is used to handle data held in a 'relational database management system' (RDBMS). For various industries, data is stored in 'data warehouses' is the form of 'databases'. A database is collection of tables. SQL can be used to access and manipulate that data.
You can learn SQL as there are many versions (Oracle, IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite) and some are open source. For example, my laptop has sqlite3 installed when I installed miniconda. It is worth installing http://sqlitebrowser.org/ to help visualize the database.
We will use the software carpentry lesson on SQL http://swcarpentry.github.io/sql-novice-survey/ which provides a database contains information on William Dyer, Frank Pabodie and Valentina Roerich's expedition to the South Pole.
Open the app. Click 'Open Database'. It shows there are 4 Tables (Person, Site, Survey and Visited). Once you've navigated to a table click the 'Browse Data' tab. This looks like:
The full database looks like:
Open the database
$ sqlite3 survey.dbsqlite> .tables -- Shows the tables in the databasesqlite> .help -- For a list of commands-- Change display settingssqlite> .mode column -- Left aligned columns. Can replace column with other optionssqlite> .header on -- Display column headers-- Query two columns in one of the tablessqlite> SELECT family, personal FROM Person; -- ; means run. It's case insensitive.Dyer William Pabodie Frank Lake Anderson Roerich Valentina Danforth Frank-- Select all columns using *sqlite> SELECT * FROM Person;id personal family ---------- ---------- ----------dyer William Dyer pb Frank Pabodie lake Anderson Lake roe Valentina Roerich danforth Frank Danforth-- Use DISTINCT to return unique valuessqlite> SELECT DISTINCT quant FROM Survey;quant ----------rad sal temp-- Can use to look for unique pairs (or more) of valuessqlite> --SELECT DISTINCT taken, quant FROM Survey;-- Databases may not be stored in any particular order. You can use ORDER BY and select a column to order by the column alphabetically or numericsqlite> SELECT * FROM Person ORDER BY id;id personal family ---------- ---------- ----------danforth Frank Danforth dyer William Dyer lake Anderson Lake pb Frank Pabodie roe Valentina Roerich -- Use DESC to order by descendingsqlite> SELECT * FROM Person ORDER BY id DESC;id personal family ---------- ---------- ----------roe Valentina Roerich pb Frank Pabodie lake Anderson Lake dyer William Dyer danforth Frank Danforth -- You can order columns separately if they share the same index-- ORDER BY taken then for similar taken values ORDER BY person in DESCsqlite> SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC;-- Filter the data using a WHERE command-- Find data at a particular sitesqlite> SELECT * FROM Visited WHERE site='DR-1';id site dated ---------- ---------- ----------619 DR-1 1927-02-08622 DR-1 1927-02-10844 DR-1 1932-03-22-- Build WHERE commands using AND and the < operatorsqlite> SELECT * FROM Visited WHERE site='DR-1' AND dated<'1930-01-01';id site dated ---------- ---------- ----------619 DR-1 1927-02-08622 DR-1 1927-02-10sqlite> SELECT * FROM Site WHERE (lat > -48) AND (lat < 48);-- and the or operatorsqlite> SELECT * FROM Survey WHERE person='lake' OR person='roe';-- You can use the operators to test for values outside a range e.g. salinity should be between 0 and 1sqlite> SELECT * FROM Survey WHERE quant='sal' AND ((reading > 1.0) OR (reading < 0.0));-- A note on dates: ISO-8601 “YYYY-MM-DD HH:MM:SS.SSSS, days since November 24, 4714 BC or seconds since January 1, 1970-- Instead of using multiple WHERE commands you can use IN and specify a setsqlite> SELECT * FROM Survey WHERE person IN ('lake', 'roe');-- Use the AND with OR but being careful with the paranthesissqlite> SELECT * FROM Survey WHERE quant='sal' AND (person='lake' OR person='roe');-- use DISTINCT with WHEREsqlite> SELECT DISTINCT person, quant FROM Survey WHERE person='lake' OR person='roe';-- Use LIKE and % as a wildcard to find stringssqlite> SELECT * FROM Visited WHERE site LIKE 'DR%';id site dated ---------- ---------- ----------619 DR-1 1927-02-08622 DR-1 1927-02-10734 DR-3 1930-01-07735 DR-3 1930-01-12751 DR-3 1930-02-26752 DR-3 844 DR-1 1932-03-22-- Note: On testing: test on a subset of data (put it into a database) or create a random database and test on that.-- Modify values in the queury usingsqlite> SELECT 1.05 * reading FROM Survey WHERE quant='rad';-- To convert Fahrenheit to Celsius and round to two decimal placessqlite> SELECT taken, round(5*(reading-32)/9, 2) FROM Survey WHERE quant='temp';taken round(5*(reading-32)/9, 2)---------- --------------------------734 -29.72-- Rename the header using assqlite> SELECT taken, round(5*(reading-32)/9, 2) as Celsius FROM Survey WHERE quant='temp';taken Celsius ---------- ----------734 -29.72-- Concat strings using ||sqlite> SELECT personal || ' ' || family as full_name FROM Person;full_name ------------William Dyer-- UNION combines the result of two queries (as appending rows)sqlite> SELECT * FROM Person WHERE id='dyer' UNION SELECT * FROM Person WHERE id='roe';-- use instr(STRING, STRINGTOCHECKFOR) and substr(STRING, STARTINDEX, [LENGTH]) to find string in values. Note: index starts at 1sqlite> SELECT site FROM Visited;site ----------DR-1 DR-3 MSK-4 sqlite> SELECT instr(site, '-') FROM Visited;instr(site, '-')----------------3 3 4 sqlite> SELECT DISTINCT substr(site, 1, instr(site, '-') - 1) AS MajorSite FROM Visited; MajorSite ----------DR MSK-- Databases use 'null' (not zero, False or an empty string) for missing values.-- To check for nullsqlite> SELECT * FROM Visited WHERE dated IS NULL;sqlite> SELECT * FROM Visited WHERE dated IS NOT NULL;-- Can use null is WHEREsqlite> SELECT * FROM Survey WHERE quant='sal' AND (person!='lake' OR person IS NULL);-- Use min, max, avg, count [=len], sum as aggregation functions. Automatically ignores nulls.sqlite> SELECT min(dated) FROM Visited;-- may return a 'null' if you used a bad WHERE clausesqlite> SELECT person, max(reading), sum(reading) FROM Survey WHERE quant='missing';person max(reading) sum(reading)---------- ------------ ------------ -- Write multiple lines assqlite> SELECT person, count(reading), round(avg(reading), 2) FROM Survey WHERE quant='rad';-- To do the operation over multiple values (i.e. you might do a for loop over it) you can use GROUP BYsqlite> SELECT person, count(reading), round(avg(reading), 2) FROM Survey WHERE quant='rad' GROUP BY person;person count(reading) round(avg(reading), 2)---------- -------------- ----------------------dyer 2 8.81 lake 2 1.82 pb 3 6.66 roe 1 11.25 -- You can GROUP BY over other columns as wellsqlite> SELECT person, quant, count(reading), round(avg(reading), 2) FROM Survey WHERE person IS NOT NULL GROUP BY person, quant ORDER BY person, quant;-- Write a simple database a do an average of it:sqlite> SELECT AVG(a) FROM ( SELECT 1 AS a UNION ALL SELECT NULL UNION ALL SELECT 5);-- use group_concat to concat rows as a columnsqlite> SELECT group_concat(personal) FROM Person;group_concat(personal) --------------------------------------William,Frank,Anderson,Valentina,Frank-- Use JOIN to join tables. But copies a lot multiple times (cross product).sqlite> SELECT * FROM Site JOIN Visited;-- Use smarter joining to keep certain pairs and you can drop the column which is duplicatedsqlite> SELECT Site.lat, Site.long, Visited.dated FROM Site JOIN Visited ON Site.name=Visited.site;-- Join more than two tables on use ON to filter out combinations that don't make sense.-- Databases contain primary keys and foreign keys e.g. Person.id is primary key (first column) and Survey.person is foreign key.-- SQL has a 'rowid' to print the indexsqlite> SELECT rowid, * FROM Person;rowid id personal family ---------- ---------- ---------- ----------1 dyer William Dyer-- See https://sql-joins.leopard.in.ua/ for visualizations of JOINs-- Every value in a database should be atomic (it cannot be broken down further e.g. 'Ray' and 'Bell' instead of 'Ray Bell'.-- Every record should have a unique primary key.-- You can use EXTRACT (https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm) and STRFTIME (https://www.sqlite.org/lang_datefunc.html) to manipulate DATE values.-- Create a table using CREATE TABLE and drop it using DROP TABLE.sqlite> CREATE TABLE Site(name text, lat real, long real);sqlite> DROP TABLE Site;-- Note: data types. All are familiar except for BLOB (binary large object e.g. image).-- 0 and 1 are Boolean.-- It is best to provide more info when creating a tablesqlite> CREATE TABLE Survey( taken integer not null, -- where reading taken person text, -- may not know who took it quant real not null, -- the quantity measured reading real not null, -- the actual reading primary key(taken, quant), foreign key(taken) references Visited(id), foreign key(person) references Person(id));-- To add data to the table use INSERT, UPDATE and DELETE.-- To insert a row to Sitesqlite> INSERT INTO Site VALUES('DR-1', -49.85, -128.57);-- Insert values into one table from anothersqlite> CREATE TABLE JustLatLong(lat text, long text);sqlite> INSERT INTO JustLatLong SELECT lat, long FROM Site;-- Update lat and long associated with a namesqlite> UPDATE Site SET lat=-47.87, long=-122.40 WHERE name='MSK-4';-- Update null with unknownsqlite> UPDATE Survey SET person="unknown" WHERE person IS NULL;-- Delete an id from Personsqlite> DELETE FROM Person WHERE id = 'danforth';-- Referential integrity (don't delete a value if it is a foreign key to another table).-- Cascading delete if the idea to delete a value and the corresponding foreign keys.-- LAST BIT ON SAVING RESULTS TO .CSVsqlite> .exitPython has a module sqlite3 to allow access to a database. There is also other libraries here https://awesome-python.com/#database and a blog post here https://pynative.com/python-mysql-tutorial/. This has the advantage of saving data in a database as a python object e.g. list which you can do analysis and visualization with.
import sqlite3connection = sqlite3.connect("survey.db")cursor = connection.cursor() # Keep track where we are in the databasecursor.execute("SELECT Site.lat, Site.long FROM Site;") # The SQL commandresults = cursor.fetchall() # Fetch all rows as a list. Each entry is a row.# Loop over each tuple in the listfor r in results: print(r) # (-49.85, -128.57) # (-47.15, -126.72) # (-48.87, -123.4)cursor.close()connection.close()You can create a function from queries using some python syntax. e.g. take a user's ID and return their name
import sqlite3def get_name(database_file, person_id): query = "SELECT personal || ' ' || family FROM Person WHERE id='" + person_id + "';" connection = sqlite3.connect(database_file) cursor = connection.cursor() cursor.execute(query) results = cursor.fetchall() cursor.close() connection.close() return results[0][0]print("Full name for dyer:", get_name('survey.db', 'dyer'))# Full name for dyer: William DyerHowever, users can do a SQL injection attack by adding something like 'DROP' into their query. You should instead use a '?' and input it in a list:
import sqlite3def get_name(database_file, person_id): query = "SELECT personal || ' ' || family FROM Person WHERE id=?;" # ? here. Expects one item connection = sqlite3.connect(database_file) cursor = connection.cursor() cursor.execute(query, [person_id]) # Note the person_id here results = cursor.fetchall() cursor.close() connection.close() return results[0][0]print("Full name for dyer:", get_name('survey.db', 'dyer'))# Full name for dyer: William DyerYou can insert values into a database but you have to add a connection.commit()
import sqlite3def add_name(database_file, new_person): query = "INSERT INTO Person VALUES (?, ?, ?);" connection = sqlite3.connect(database_file) cursor = connection.cursor() cursor.execute(query, list(new_person)) cursor.close() connection.commit() # Note the commit here connection.close()def get_name(database_file, person_id): query = "SELECT personal || ' ' || family FROM Person WHERE id=?;" connection = sqlite3.connect(database_file) cursor = connection.cursor() cursor.execute(query, [person_id]) results = cursor.fetchall() cursor.close() connection.close() return results[0][0]# Insert a new nameadd_name('survey.db', ('barrett', 'Mary', 'Barrett'))# Check it existsprint("Full name for barrett:", get_name('survey.db', 'barrett'))You can create a database from scratch in python. For example, this will put 100,000 random values into a database
import sqlite3 # import random number generator from numpy.random import uniform random_numbers = uniform(low=10.0, high=25.0, size=100000) connection = sqlite3.connect("original.db") cursor = connection.cursor() cursor.execute("CREATE TABLE Pressure (reading float not null)") query = "INSERT INTO Pressure values (?);" for number in random_numbers: cursor.execute(query, [number]) cursor.close() # save changes to file for next exercise connection.commit() connection.close()