Data Sets
Kaggle Data Sets
Kaggle is a website for data science competitions that hosts thousands of data sets and code examples. It is a great resource for those starting out with data science. Examples of data sets that may be of interest to students in EEPS-DATA 1340:
additional data sets are available at Kaggle Data Sets
UCI Machine Learning Repository
Collection of data sets [link] used in published machine learning papers. Some examples of interest:
Brown University Library
Open Data Resources, includes data in categories "Earth and Geoscience" and "Environment."
Academic Data Sets
CloudCase: A large-scale dataset and baseline for forecasting clouds
WildfireDB: A Spatio-Temporal Dataset Combining Wildfire Occurrence with Relevant Covariates
FOWD: Free Ocean Wave Dataset for Data Mining and Machine Learning
Harvard Dataverse: a repository for research data
Other Data Sets or Repositories
ClimateNet Datasets (NERSC)
Google Dataset Search: search engine for data
Google Earth Engine Data Sets: a planetary-scale platform for Earth Science data and analysis
Registry of Open Data on AWS: open geospatial data
National Centers for Environmental Information - Data Access: climate and weather data (NOAA)
Paleoclimatology, Marine/Ocean, Model data, etc.
xBD Dataset: annotated high-resolution satellite imagery for building damage assessment
iWildCam 2020 Competition: camera trap images
DeepWeeds: a multi-class weed species image classification data set
xView: aerial imagery
AI for Good Foundation (data available for some projects)
TensorFlow Datasets: ready-to-use datasets for machine learning in Python (e.g. BigEarthNet, EuroSAT)
Earth System Science Data: data publishing journal
Data.gov: US government's Open Data
COVID + Atmosphere data sets (Caltech)
Gridded Climate Data Sets (Physical Sciences Laboratory, NOAA)
Added for Spring 2022
Weather4Cast: Multi-sensor Weather Forecast Competition
HumAID: human-annotated disaster incidents data from Twitter [paper] [dataset]
CrisisBench: crisis-related social media datasets for humanitarian information processing [paper] [dataset]
WILDS datasets: distribution shift in the wild
CropHarvest: open source remote sensing dataset for agriculture
Radiant MLHub: open library for Earth observation machine learning
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models
EaDAR Lab vertical land motion datasets
SECOORA Data Portal: Centralized access to Southeast U.S. coastal and ocean data
EuroCrops: A Pan-European Dataset for Time Series Crop Type Classification [arXiv]
ACM Energy Systems and Informatics: Resources - Datasets
OceanOPS: Global Ocean Observing Systems Coordination framework
SSL4EO-S12: Multi-modal, multi-temporal dataset for unsupervised/self-supervised pre-training in Earth observation
Data Underground: Subsurface/geophysical datasets
TALLO: a global tree allometry and crown architecture database
Caltech Fish Counting Dataset
Added in 2023 (for EEPS-DATA 1720 or 1340)
ReaLSAT, A global dataset of reservoir and lake surface area variations
Dynamic World, Near real-time global 10m land use land cover mapping
ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections
NADBenchmarks: a compilation of Benchmark Datasets for ML Tasks related to Natural Disasters
NICFI Satellite Data Program: access to Planet's high-resolution, analysis-ready mosaics of the world’s tropics.
CaFFE: a benchmark dataset and methodology for automatic glacier calving fron extraction for SAR imagery.
WorldStrat: open high-resolution satellite imagery with application to super-resolution
ENS-10: a dataset for post-processing ensemble weather forecasts
Change Event dataset: for discovery from spatio-temporal remote sensing imagery
PDEBench: an extensive benchmark for scientific machine learning
xView3-SAR: Detecting dark fishing activity using SAR imagery
Caravan: a global community dataset for large-sample hydrology
DoriaNet: a visual dataset from Hurricane Dorian for post-disaster building damage assessment
CSU Synthetic Attribution Benchmark Dataset
FathomNet Challenge: Out-of-sample detection in the deep ocean (Kaggle competition)
Whales from Space Dataset: an annotated satellite image dataset of whales for training ML models
DeepFish: a realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis
MLAAPDE: A Machine Learning Dataset for Determining Global Earthquake Source Parameters
AI for Understanding Earthquakes Kaggle Challenge
airPy: Generating AI-ready data sets for air quality studies
This list of data sets has been made available as a resource to students in EEPS 1960D. The course instructional team did not create and does not maintain any of these data sets, so we can not guarantee their accuracy or availability. It is the responsibility of the student to verify the source and vet quality of any data set they choose to use in their project.
If you are familiar with other interesting data sets related to Earth, Environmental or Planetary Sciences, please email the course staff and we'll add them to the list!