We release VedantaNY-10M, a curated dataset of over 750 hours of transcripts from public discourses on the Indian philosophy of Advaita Vedanta. Sourced from 612 YouTube lectures by Swami Sarvapriyananda of the Vedanta Society of New York (VSNY), the dataset contains ~10 million tokens. These lectures offer a comprehensive exposition of Advaita Vedanta, making the dataset an invaluable resource for philosophy and linguistics research.
The transcripts in the dataset can be generated using our provided transcription code.
The dataset primarily features content in English (~97%) and romanized Sanskrit transliterations (~3%). These transcripts are automatically generated from the open source Whisper large-v2 model from OpenAI. Frequently occurring Sanskrit terms in the corpus are shown below.