Datasets
Every project will need a rich dataset to work with for learning and implement data-driven solutions. Below are some examples of open-source datasets that could be leveraged for the course project.
Strategies & Resources for Finding Research Datasets
There are many resources for finding high-quality open-source datasets for your course project. One strategy is to search for recent research papers (e.g., within the last 3 years) on your topic of interest published in Nature venues like Nature Biomedical Engineering, Nature Medicine, npj Digital Health, given the data sharing policy of such journals most papers are required to share a link to their dataset under the "Data Availability" section of the paper. This strategy can help you find recent high-quality datasets to do interesting and impactful research.
Other example resources are:
Note: When you find a dataset that you think is fitting for your research, it is your responsibility as the researcher to immediately evaluate the quality of the dataset. Do not make the often incorrect assumption that the dataset is high-quality. Some important criteria for evaluating datasets include:
Volume: The dataset cannot be too small because this will limit the types of questions you can ask and answer.
Sparsity: The dataset cannot be too sparse because this will make it challenging to build robust learning models.
Age: The dataset cannot be too old (e.g., over 5 or 10 years old) because that means it has existed for a while and other researchers have likely exhausted its use/benefit (i.e., it may be difficult for you to find new questions to ask and answer with such a dataset).
Example Datasets
Below are some example datasets that are likely high quality:
Cancer - TCGA:
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types.
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
NIH Chest x-ray dataset:
Deep Lesion - NIH:
AF Classification from a Short Single Lead ECG Recording:
The 2017 PhysioNet/CinC Challenge aims to encourage the development of algorithms to classify, from a single short ECG lead recording (between 30 s and 60 s in length), whether the recording shows normal sinus rhythm, atrial fibrillation (AF), an alternative rhythm, or is too noisy to be classified.