Data Bazaar

Rosetta Common's Hugging Face

Event Goals

Curate high-quality, well-documented datasets that will be shared openly and used by researchers around the world for downstream machine learning and modeling applications
Edit and improve the Rosetta Commons Data Curation Standards Guide and Decision Tree based on experience during the Data Bazaar

Schedule

*For more details on timing and meals see Schedule

Monday (3/2):

Hotel check-in and group dinner

Tuesday (3/3):

Guided, Hands-On Dataset Curation

Participants will be split into teams to work on easy, starter datasets: Day-1 Dataset Teaming & Assignments
Teams will follow the Data Curation Guide & Molecular Dataset Curation Guide on Rosetta Common's HuggingFace

Wednesday (3/4):

Best Practices for ML-Ready Dataset Curation + Custom Dataset Development

Gina El Nesr will give a lecture on dataset curation for downstream machine learning applications
Teams will reform and apply workflow from Tuesday to participant's datasets or more challenging datasets from Data Bazaar list.

Thursday (3/5):

Dataset Finalization, Presentation, & Standards Synthesis

Teams will finalize curation, complete uploads to HuggingFace and add collaborators
Teams will give mini-presentations (slide deck here) on
- Dataset purpose and intended downstream ML application
- Key curation decisions and tradeoffs
- Dataset split strategy and evaluation considerations
- Known limitations, open questions, or future extensions
Feedback Form
Group discussion to improve data standards guide and recap lessons learned

Friday (3/6) - Sunday (3/8):

Enjoy San Juan!

Questions?

For questions please email Hope Woods (hope.woods@omsf.io) or Ashley Vater (awvater@ucdavis.edu)

Page updated

Report abuse