April 2020
Please configure your cluster to use Databricks Runtime version 6.2 which includes:
%sqlSELECT * FROM People10M%sqlDESCRIBE People10Mwhich women were born after 1990?
%sqlSELECT firstName, middleName, lastName, birthDateFROM People10MWHERE year(birthDate) > 1990 AND gender = 'F'Spark provides a number of built-in functions, many of which can be used directly from SQL. These functions can be used in the WHERE expressions to filter data and in SELECT expressions to create derived columns.
The following SQL statement finds women born after 1990; it uses the year function, and it creates a birthYear column on the fly.
%sqlSELECT firstName, middleName, lastName, year(birthDate) as birthYear, salary FROM People10MWHERE year(birthDate) > 1990 AND gender = 'F'How many women were named Mary in each year?
%sqlSELECT year(birthDate) as birthYear, count(*) AS totalFROM People10MWHERE firstName = 'Mary' AND gender = 'F'GROUP BY birthYearORDER BY birthYearCompare popularity of two names from 1990
%sqlSELECT year(birthDate) as birthYear, firstName, count(*) AS totalFROM People10MWHERE (firstName = 'Dorothy' or firstName = 'Donna') AND gender = 'F' AND year(birthDate) > 1990GROUP BY birthYear, firstNameORDER BY birthYear, firstNameTemporary views assign a name to a query that will be reused as if they were tables themselves.
%sqlCREATE OR REPLACE TEMPORARY VIEW TheDonnas AS SELECT * FROM People10M WHERE firstName = 'Donna'
%sqlSELECT * FROM TheDonnasCreate more complex query from People10M table
%sqlCREATE OR REPLACE TEMPORARY VIEW WomenBornAfter1990 AS SELECT firstName, middleName, lastName, year(birthDate) AS birthYear, salary FROM People10M WHERE year(birthDate) > 1990 AND gender = 'F'
%sqlSELECT birthYear, count(*) FROM WomenBornAfter1990 WHERE firstName = 'Mary' GROUP BY birthYear ORDER BY birthYear Create a temporary view called Top10FemaleFirstNames that contains the 10 most common female first names in the People10M table. The view must have two columns:
firstName - the first nametotal - the total number of rows with that first name%sqlCREATE OR REPLACE TEMPORARY VIEW Top10FemaleFirstNames AS SELECT firstName, COUNT(firstName) AS total FROM People10M WHERE gender = 'F' GROUP BY firstName ORDER BY total DESC, firstName LIMIT 10
%sqlSELECT * FROM Top10FemaleFirstNames https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
Get average salary
%sqlSELECT avg(salary) AS averageSalary FROM People10MConvert that value to an integer using the SQL round() function:
%sqlSELECT round(avg(salary)) AS averageSalary FROM People10Madd in max and min
%sqlSELECT max(salary) AS max, min(salary) AS min, round(avg(salary)) AS average FROM People10MHow many of the first names appear in Social Security data files?
%sqlSELECT * FROM SSANamesHow many distinct names there are in each of our tables
%sqlSELECT count(DISTINCT firstName) FROM People10M
%sqlSELECT count(DISTINCT firstName)FROM SSANamesIntroduce two more temporary views, each one consisting of distinct names, the join will be easier to read/write
%sqlCREATE OR REPLACE TEMPORARY VIEW SSADistinctNames AS SELECT DISTINCT firstName AS ssaFirstName FROM SSANames;CREATE OR REPLACE TEMPORARY VIEW PeopleDistinctNames AS SELECT DISTINCT firstName FROM People10MJoin the two tables together to get the answer
%sqlSELECT firstName FROM PeopleDistinctNames INNER JOIN SSADistinctNames ON firstName = ssaFirstNameHow many are there?
%sqlSELECT count(*) FROM PeopleDistinctNames INNER JOIN SSADistinctNames ON firstName = ssaFirstName%sqlSELECT count(firstName) FROM PeopleDistinctNamesWHERE firstName IN ( SELECT ssaFirstName FROM SSADistinctNames)Some of the salaries in the People10M table are negative. Convert all the negative salaries to positive ones, and then sort the top 20 people by their salary
Create a temporary view called PeopleWithFixedSalaries, where all the negative salaries have been converted to positive numbers.
%sqlDROP TABLE IF EXISTS PeopleWithFixedSalaries;CREATE OR REPLACE TEMPORARY VIEW PeopleWithFixedSalaries AS SELECT firstName, middleName, lastName, gender, birthDate, ssn, abs(salary) AS salary FROM People10MCreate another view called PeopleWithFixedSalariesSorted where:
salary in ascending order%sqlDROP TABLE IF EXISTS PeopleWithFixedSalariesSorted;CREATE OR REPLACE TEMPORARY VIEW PeopleWithFixedSalariesSorted AS SELECT * FROM PeopleWithFixedSalaries ORDER BY salary LIMIT 20As a refinement, assume that all salaries under $20,000 represent bad rows and filter them out.
Additionally, categorize each person's salary into $10K groups.
Create a temporary view called PeopleWithFixedSalaries20K where:
PeopleWithFixedSalariessalary10k, that should be the salary in groups of 10,000. For example:%sqlDROP TABLE IF EXISTS PeopleWithFixedSalaries20K;CREATE OR REPLACE TEMPORARY VIEW PeopleWithFixedSalaries20K AS SELECT *, round(salary / 10000) AS salary10k FROM PeopleWithFixedSalaries WHERE salary >= 20000Using the People10M table, count the number of females named Caren who were born before March 1980.
Starting with the table People10M, create a temporary view called Carens where:
totalgender)firstName)birthDate)%sqlDROP TABLE IF EXISTS Carens;CREATE TEMPORARY VIEW Carens AS SELECT count(*) AS total FROM People10M WHERE birthDate < '1980-03-01' AND firstName = 'Caren' AND gender = 'F'Use the SSANames table to find the most popular first name for girls in 1885, 1915, 1945, 1975, and 2005.
Create a temporary view called HistoricNames where:
HistoricNames is created using a single SQL query.firstNameyeartotal%sqlCREATE OR REPLACE TEMPORARY VIEW HistoricNames ASSELECT firstName, year, totalFROM SSANamesNATURAL INNER JOIN ( SELECT year, gender, max(total) AS total FROM SSANames GROUP BY year, gender) AS max_namesWHERE gender='F' AND year IN (1885, 1915, 1945, 1975, 2005)ORDER BY yearThe Databricks File System (DBFS) is the built-in, Azure-blob-backed, alternative to the Hadoop Distributed File System (HDFS).
Creating a table from an existing file in DBFS allows you to access the file as if it were a Spark table. It does not copy any data.
Create a table from the ip-geocode.parquet file
Create a table from an existing DBFS file with a simple SQL CREATE TABLE statement
%sqlCREATE DATABASE IF NOT EXISTS databricks;USE databricks;CREATE TABLE IF NOT EXISTS IPGeocode USING parquet OPTIONS ( path "dbfs:/mnt/training/ip-geocode.parquet" )It is often better to use a "personal" database
spark.sql(f"USE {databaseName}")One common format is CSV (comma-separated-values) for which you can specify:
Look at the first couple of lines of a file
%fs head /mnt/training/bikeSharing/data-001/day.csv --maxBytes=492Spark can create a table from that CSV file, as well.
As you can see above:
%sqlCREATE TABLE IF NOT EXISTS BikeSharingDay USING csv OPTIONS ( path "/mnt/training/bikeSharing/data-001/day.csv", inferSchema "true", header "true" )
%sqlSELECT * FROM BikeSharingDay
%sqlSELECT * FROM BikeSharingDayDownload a file
Select Data from the sidebar, and click the junk database
Select the + icon to create a new table
Click the Create Table with UI button
In the drop-down dialog, select a cluster
Click the Preview Table button
Another dialog will drop down. Choose the junk database
Select the First row is header checkbox
Click the Create Table button
Drop the table to ensure other users don't have a name conflict when uploading their tables
%sqlDROP TABLE IF EXISTS state_income