SQLite: Handling Large Structured Data
Overview
Storing your data in SQLite, format allows you to get benefits of database, and at the same time simple and storage of data in a file on a disk.
From developer's website: "SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. The SQLite file format is stable, cross-platform, and backwards compatible and the developers pledge to keep it that way through at least the year 2050."
Some use-cases
You think you need MySQL, PostreSQL, etc for your ML project. Usually you don't
You have to deal with hundreds of GB of table-stuctured data (or larger) and your script (for whatever reason) can't be made parallel.
You would request a lot of RAM and work with data slowly..... This would be a waste of RAM.
It is better in this case to request smaller amount of RAM and read data (efficiently) from disk - for example using SQLite
Benefits:
You are not limited by RAM any longer
comparing to other file formats SQLite is very good in selecting certain lines (especially if you use indexing)
you can use familiar dplyr syntax or execute SQL queries directly
dplyr is an interface for working with data in a database, not for modifying remote tables.
DBI package allows to both read and modify tables
SQLite is actually faster for common data analysis tasks than other popular databases.
You can have multiple threads accessing an SQLite database simultaneously (for read operations. Writing is more tricky)
Merging/Joining datasets on disk
Major benefits of SQLite comparing to MySQL (PostgreSQL, etc)
You control your own data (sqlite file). You don't depend on any service like MySQL
You can copy a file to your own laptop and work with it
Again, SQLite is faster!
Limits
SQLite has some limitations in terms of concurrency, which usually don't apply for typical ML/AI jobs.
Read more here.
Command line (CLI) example
Create environment
mkdir projects/sqlite-testcd projects/sqlite-testconda create -p ./cenvconda activate ./cenvconda install -y sqliteThen follow this SQLite example.
sqlite3 db_file.sqlitecreate table tbl1(one varchar(10), two smallint);insert into tbl1 values('hello!',10);insert into tbl1 values('goodbye', 20);select * from tbl1;Now Close session (Ctrl-D).
Reopen session to check if changes are saved
sqlite3 db_file.sqliteselect * from tbl1;R example
Install
Here we use conda, as a great way to keep everything isolated and reproducible.
Note: conda will install pre-compiled packages. Which is good (faster) and bad (not fully optimized for a specific hardware)
Alternative: install packages to a local directory or use renv as described here
mkdir /scratch/$USER/projects/myTempProjectcd /scratch/$USER/projects/myTempProjectmodule load anaconda3/2020.07
conda create -p ./cenv -c conda-forge r=4.1conda activate ./cenvconda install -c r r-rsqliteconda install -c r r-tidyverseconda install -c conda-forge r-remotesconda install -c r r-featherconda install -c r r-nycflights13
Note:
window functions (row_number in particular) require newer version of rsqlite
(https://github.com/r-dbi/RSQLite/issues/268)
Rremotes::install_github("r-dbi/RSQLite")## update ALLSave list of installed packages for reproducibility
## conda list --export > requirements.txtUse
Many examples can be found here:
dplyr syntax https://db.rstudio.com/dplyr/
SQL syntax https://db.rstudio.com/databases/sqlite/
Copy data frame to database (dplyr)
copy_to(con, nycflights13::flights, "fl", temporary=FALSE)Or copy data to database using DBI
dbCreateTable(con, "fl", nycflights13::flights, temporary = FALSE)dbAppendTable(con, "fl", nycflights13::flights)Connect to a specific table
dbListTables(con)df_con <- tbl(con, "fl")## check number of rowsdf_con %>% count()Subset
df_temp <- df_con %>% filter( row_number() %in% c(1, 3) ) %>% collectSave as feather
feather::write_feather(df_temp, paste0("file_", ind, ".feather"))Alternative: read csv file to SQLite directly
If you already have a large csv file on disk, and you don't want to read it to RAM, you can read it to SQLite file directly
UI for SQLite - SQLiteStudio
Once you have SQLite file, you can easily transfer it to your own laptop and explore it using SQLiteStudio, if you like to use UI instead of terminal