Large data

FAIR for large datasets

If your project has produced or collected large amounts of data, or you suspect it will, there are some extra elements to consider in order to make your data FAIR. 

We refer to 'large data' here to distinguish from Big Data, which is data that would require new and novel approaches to process and store (e.g. NoSQL). While Big Data would also need extra considerations to be made FAIR, some of which will be addressed below, we will focus on 'large data' in reference to any datasets that are greater in size than normal (i.e. terabytes instead of gigabytes).

For general information on making your data FAIR, view the following information:

Factors to consider include:

Too big to share?

If your dataset surpasses available storage in your repository of choice, or if you feel that the size of the dataset may impede its usefulness or ability to be reused, you might consider:

Where only a sample or subset of the data is deposited, you should include a data availability statement indicating whether, where and how people can request access to the full dataset. More information on such statements can be found on the Library Research Data Management pages. You should also consider the practicalities of transferring the data if and when a data access request is received.

The University of Sheffield's institutional repository, ORDA, initially allows up to 25GB of storage per user, but this can be increased on request up to 100GB via the deposit form, or for 100GB+ by contacting rdm@sheffield.ac.uk. There may also be subject-specific repositories within your subject area which can accommodate large datasets - see the Subject-specific repositories page for guidance on how to search.

Depositing large datasets in ORDA

As a general rule, if your dataset is too large to upload reasonably quickly, it's also too large to download at once conveniently. Consider uploading the data in multiple zip files which could be downloaded individually, and making the relationship between these clear in both your README file and the 'description' field of the deposit. Try to avoid zip files within zip files as this can be confusing.

When depositing your dataset, you may find it useful to use the 'preview item' option (if available) in order to see what your files will look like when the record is published. On checking, if you find that this doesn't appear user-friendly, you may wish to reconsider your deposit's structure.

Processing and anonymisation

For any large data that has undergone processing, for example automated anonymisation, to allow the data to be shared, it may not be possible to be completely sure if all personal/identifying data has been removed, particularly if there were any free-text fields involved. While spot checks can seemingly give a good indication, it is always advisable to be cautious in such situations. Again, the sharing of fully checked parts (rather than the entirety) of your data would be a good approach, although it may not be possible to allow others to request access to the full data if you cannot be sure of its personal/sensitive status.

Associated costs and planning

Where there are associated costs with storing/depositing your (large) dataset, wherever possible these should be anticipated in advance and costed into your funding application. You can discuss your plans with the Library's Research Data Management team (rdm@sheffield.ac.uk) to obtain guidance, and can seek advice from your faculty Research Hub.

Long-term storage for large datasets from Research IT

If you wish to store large datasets long-term on the University X: Drive, you should contact the Research IT team who will work with you to establish a long-term solution. For example, the platforms team can create a shared storage area for long-term storage for researchers where appropriate, provided that the data has an 'owner', i.e. an academic contact who is responsible for it.

More helpful information can also be found on the follwing pages: