SQLite: Handling Large Structured Data

Overview

Storing your data in SQLite, format allows you to get benefits of database, and at the same time simple and storage of data in a file on a disk.

From developer's website: "SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. The SQLite file format is stable, cross-platform, and backwards compatible and the developers pledge to keep it that way through at least the year 2050."

Some use-cases

Benefits:

Major benefits of SQLite comparing to MySQL (PostgreSQL, etc)

Limits

Command line (CLI) example

Create environment

mkdir projects/sqlite-testcd projects/sqlite-testconda create -p ./cenvconda activate ./cenvconda install -y sqlite

Then follow this SQLite example.

sqlite3 db_file.sqlitecreate table tbl1(one varchar(10), two smallint);insert into tbl1 values('hello!',10);insert into tbl1 values('goodbye', 20);select * from tbl1;

Now Close session (Ctrl-D).

Reopen session to check if changes are saved

sqlite3 db_file.sqliteselect * from tbl1;

R example

Install

Here we use conda, as a great way to keep everything isolated and reproducible.

Note: conda will install pre-compiled packages. Which is good (faster) and bad (not fully optimized for a specific hardware)

Alternative: install packages to a local directory or use renv as described here

mkdir /scratch/$USER/projects/myTempProjectcd  /scratch/$USER/projects/myTempProject
module load anaconda3/2020.07
conda create -p ./cenv -c conda-forge r=4.1conda activate ./cenvconda install -c r r-rsqliteconda install -c r r-tidyverseconda install -c conda-forge r-remotesconda install -c r r-featherconda install -c r r-nycflights13

Note:

window functions (row_number in particular) require newer version of rsqlite

(https://github.com/r-dbi/RSQLite/issues/268)

Rremotes::install_github("r-dbi/RSQLite")## update ALL

Save list of installed packages for reproducibility

## conda list --export > requirements.txt

Use

Many examples can be found here:

library(tidyverse)library(DBI)# Create RSQLite database file with name "allData"con <- dbConnect(RSQLite::SQLite(), "allData")

Copy data frame to database (dplyr)

copy_to(con, nycflights13::flights, "fl", temporary=FALSE)

Or copy data to database using DBI

dbCreateTable(con, "fl", nycflights13::flights, temporary = FALSE)dbAppendTable(con, "fl", nycflights13::flights)

Connect to a specific table

dbListTables(con)df_con <- tbl(con, "fl")## check number of rowsdf_con %>% count()

Subset

df_temp <- df_con %>% filter( row_number() %in%  c(1, 3) ) %>% collect

Save as feather

feather::write_feather(df_temp, paste0("file_", ind, ".feather"))

Alternative: read csv file to SQLite directly

If you already have a large csv file on disk, and you don't want to read it to RAM, you can read it to SQLite file directly

conda install -c conda-forge r-sqldf Rlibrary(sqldf)## create data filesqldf("attach allData as new")## read file directly from csv to sqliteread.csv.sql(file = "test.tab", sql = "create table states_data as select * from file", dbname = "allData")

UI for SQLite - SQLiteStudio

Once you have SQLite file, you can easily transfer it to your own laptop and explore it using SQLiteStudio, if you like to use UI instead of terminal