The Multimodal Academic English Learner Corpus (MAELC) is a collection of oral and written discourse produced by undergraduate students in Hidalgo, Mexico. The corpus focuses on English as a Foreign Language (EFL) learners enrolled in Research Seminar courses within the BA in English Language Teaching. This project started in december 2024 and it is conducted with the institutional support of the Instituto de Ciencias Sociales y Humanidades at the Universidad Autónoma del Estado de Hidalgo (UAEH).
Objective
The primary goal of MAELC is to provide an empirical foundation for research on academic writing and oral production. It specifically targets high-proficiency levels (CEFR B2–C1) within the UAEH higher education context, facilitating a deeper understanding of how advanced learners navigate academic requirements.
Methodology
The corpus includes data from university students in their seventh semester of the BA in English Language Teaching. As part of their academic requirements, participants engage in a formal research presentation process consisting of two primary components:
Oral Component: Students deliver a research presentation to a panel of evaluators. Presentations are strictly regulated with a seven-minute time limit and a maximum of nine slides. This presentation is video-recorded.
Written Component: Participants submit a comprehensive final research paper. To maintain structural consistency within the corpus, all papers follow a standardized academic format: Introduction, literature review, methodology, findings and conclusions. Students also submit the file they used to present their research.
Data:
No. of participants:
The MAELC currently contains data collected from students enrolled in the Research Seminar II course during the seventh semester of the BA in English Language Teaching at UAEH across four academic semesters.
● July–December 2024: 16 participants
● January–June 2025: 20 participants
● July–December 2025: 29 participants
● January–June 2026: 18 participants
In total, the corpus includes data from 83 participants. The collection integrates multiple forms of academic production generated as part of the students’ Research Seminar II coursework.
Number of written and spoken files:
The corpus currently contains academic materials produced by 83 students enrolled in the Research Seminar II course. Each participant contributed three primary types of documents:
● 83 written research papers
● 83 sets of presentation slides
● 83 transcriptions of oral presentations
In total, the corpus is composed of 249 files representing written and spoken discourse.
The written component consists of:
83 final research papers developed throughout the semester as part of the students’ research projects. To ensure consistency across the corpus, all papers follow a standardized five-chapter structure:
1. Introduction
2. Literature Review
3. Methodology
4. Findings
5. Conclusions
83 sets of presentation slides created by students to summarize and present the main aspects of their research papers during their final oral presentation.
The spoken component consists of:
83 transcriptions derived from video-recorded oral presentations. During these presentations, students shared the main aspects of their research projects. These presentations had a maximum duration of seven minutes.
How to interpret labels of the files:
The corpus is organized into folders according to academic semester. Each folder contains the three corresponding documents per participant produced during that semester: the research paper, the presentation slides, and the transcript of the oral presentation.
The semester folders are labeled as follows:
● Seminario II-JD2024
● Seminario II-EJ2025
● Seminario II-JD2025
● Seminario II-EJ2026
In each folder, files follow a labeling system designed to facilitate organization, retrieval, and analysis of the corpus data. File names follow this structure:
[Semester]-[Year]-[Participant Number][Document Type]
For example:
● JD-2024-001_PAPER
● JD-2024-001_TRANSCRIPT
● JD-2024-001_PRESENTATION
The first two sections of the label (e.g., JD-2024) identifies the academic semester to which the participant belongs. The abbreviations correspond to the semester periods:
● JD = July–December
● EJ = January–June
The second section (e.g., 001) represents the participant identification number assigned.
The final section specifies the type of document included in the corpus:
● PAPER = final written research paper
● TRANSCRIPT = transcription of the oral presentation
● PRESENTATION = presentation slides used during the oral presentation
This labeling system ensures consistency across the corpus while preserving participant anonymity and facilitates the organization of the files.
How to cite: Carretero, A., Flores, A. & Hidalgo, H. (2026). Multimodal Academic English Learner Corpus.