Curate high-quality, well-documented datasets that will be shared openly and used by researchers around the world for downstream machine learning and modeling applications
Edit and improve the Rosetta Commons Data Curation Standards Guide and Decision Tree based on experience during the Data Bazaar
*For more details on timing and meals see Schedule
Hotel check-in and group dinner
Guided, Hands-On Dataset Curation
Participants will be split into teams to work on easy, starter datasets: Day-1 Dataset Teaming & Assignments
Teams will follow the Data Curation Guide & Molecular Dataset Curation Guide on Rosetta Common's HuggingFace
Best Practices for ML-Ready Dataset Curation + Custom Dataset Development
Gina El Nesr will give a lecture on dataset curation for downstream machine learning applications
Teams will reform and apply workflow from Tuesday to participant's datasets or more challenging datasets from Data Bazaar list.
Dataset Finalization, Presentation, & Standards Synthesis
Teams will finalize curation, complete uploads to HuggingFace and add collaborators
Teams will give mini-presentations (slide deck here) on
Dataset purpose and intended downstream ML application
Key curation decisions and tradeoffs
Dataset split strategy and evaluation considerations
Known limitations, open questions, or future extensions
Group discussion to improve data standards guide and recap lessons learned
Enjoy San Juan!
For questions please email Hope Woods (hope.woods@omsf.io) or Ashley Vater (awvater@ucdavis.edu)