Large Number of Small Files

Motivation

Many datasets contain a large number of files (for example ImageNet contains 14 million images, with ~150 GB size). How to deal with this data? How to store it? How to use for computations? Long-term storage of data is not an issue - an archive like tar.gz can handle this pretty well. However, when you want to use data in computations, the performance may depend on how you handle the data on disk. 

Here are some ideas you can try and evaluate performance for your own project

Squash file system to be used with Singularity

Please read here

Use jpg/png files on disk

One option is to store image files (like png or jpg) on the disk and read from disk directly.

An issue with this approach, is that many linux file system can hold only a limited number of files.

# One can open greene cluster and run the following command$ df -ih /scratch/Filesystem                               Inodes IUsed IFree IUse% Mounted on10.0.0.40@o2ib:10.0.0.41@o2ib:/scratch1   1.6G  209M  1.4G   14% /scratch

This shows us that the total number of files '/scratch' can hold  (currently) is about 1.6 G. This looks like a large number. But let us translate this into number of datasets like ImageNet (14 mil images) -> 100 datasets like that would almost fully occupy Total possible slots for files! This is a problem!

And even if you can ignore this on your own PC, on HPC, there is a limit of files each user can put on /scratch to prevent such problems.

This is the reason why you just can't extract all those files in '/scratch'

SLURM_TMPDIR

Another option would be to start SLURM job and extract everything into $SLURM_TMPDIR. This can work, but would require to do un-tar every time you run SLURM command.

SLURM_RAM_TMPDIR

You can also use the custom-made RAM mapped disk using #SLURM_RAM_TMPDIR while submitting the job. In this case when you start a job you first un-tar your files to $SLURM_RAM_TMPDIR and then read from there. This basically requires you to use 2*(size of the data) size of RAM just to hold the data.

Binary files (pickle, etc)

Store data in some binary file (say pickle in Python) which you load fully when you start SLURM job.

This option may require a lot of RAM - thus you may have to wait a long time for scheduler to find resources for your job. Also this approach would not work on regular PC without so much RAM, and thus your scripts are not transferable.

Container files, one-file databases

Special containers, which allow to either load data fast fully or access chosen elements without loading the whole dataset into RAM.

SQLite

If you have structured data, a good option would be to use SQLite. Please refer to this page for more information

HDF5

One can think about HDF5 file as a "container file" (database of a sort), which holds a lot of objects inside.

HDF5 files do not have a file size limitation, and can hold huge number of objects inside, providing fast read/write access to those objects.

It is easy to learn how to subset data and load to RAM only to those data objects that you need.

This format is extensively in Academia and Industry across the world  - look here

More info:

hdf5 supports reading and writing in parallel, so you can use several nodes reading from the same file.

More info: https://support.hdfgroup.org/HDF5/PHDF5/

LMDB

LMDB (Lightning Memory-Mapped Database) is a light-weight, high-speed embedded database for key-value data.

Essentially, this is a large file sitting on the disk that contains a lot of smaller objects inside.

This is a memory-mapped database meaning, file can be larger than RAM. OS is responsible for managing the pages (like caching frequently uses pages).

For practical use it means: say you have 10 GB of RAM, and LMDB file of 100 GB. When you connect to this file, OS may deside to load 5GB to RAM, and the rest 95GB will be attached as virtual memory. PRINCE does not have limit for virtual memory. Of course, if your RAM is larger than LMDB file, this database will perform the best, as OS will have enough resources to keep what is needed directly in RAM.

Note: when you write key-value pairs to LMDB they have to be byte-encoded. For example, if you use Python you can use: for string 'st.encode()', for np.array use 'ar.tobytes()', or in general pickle.dumps()

Problem with large number of files: LMDB uses B Tree, which has O(long n) complexity for search.

Thus, when number of elements in LMDB becomes really big, search of specific element slows down considerably

More info:

LMDB supports reading by many readers and many parallel thread from the same file

Formats inside HDF5/LMDB: binary, numpy, other..

One can store data in different way inside LMDB or HDF5. For example we can store binary representation of jpeg, or we can store python numpy array. In the first case file can be read from any language, in the second - only from Python. We can also store objects from other languages - for example tibble in R

Other formats

There are other formats like Bcolz, Zarr, and others. Some examples can be found here.

Example Code