In today’s AI-driven environments, intelligent automation is only as good as the data that fuels it. I specialize in transforming unstructured content—such as SCORM packages, PDFs, image-based files, and text documents—into structured, machine-readable formats. Whether preparing files for ingestion into machine learning pipelines or converting curriculum into JSON for use in AI tools, I bring order and clarity to complex, messy content.
To enhance the intelligence and adaptability of our content, I built a knowledge graph that maps our curriculum to key standards and themes. This structured representation of learning goals enabled us to implement a retrieval-augmented generation (RAG) model that personalizes AI-driven assistance for educators. My work bridges instructional design with machine learning, creating smarter content systems that understand not only what is being taught, but why. This initiative demonstrates my ability to architect data for learning and lays the groundwork for more adaptive, responsive educational experiences powered by AI.
Conversion of Complex Content into Structured Data:
I routinely break down SCORM packages, image-based PDFs, text-heavy documents, and proprietary formats into structured outputs such as JSON, CSV, and Markdown.
Normalization of Data for AI:
I remove inconsistencies, standardize formatting, and ensure schema alignment to prepare data for training and inference stages in AI workflows.
Ingestion of Content into AI Systems:
From curriculum and assessment documents to interactive activities, I extract key metadata, content, and structure for ingestion into Retrieval-Augmented Generation (RAG), voice synthesis, and tagging systems.
Automation with AWS Lambda:
I integrate these processes into scalable AWS Lambda functions, enabling real-time or batch automation. These workflows often include document parsing, API calls to LLMs, and saving processed output to S3 or DynamoDB.
Optimization for Downstream Use Cases:
By structuring content early, I enable more accurate machine learning outputs, better search and retrieval, more consistent tagging, and seamless text-to-speech generation.
SCORM to JSON Conversion
Problem: SCORM activity files were difficult to analyze and tag for AI.
Solution: Wrote a parser that extracts questions, prompts, and interaction data into JSON format.
Impact: Enabled automated evidence tagging and curriculum mapping via LLMs.
Image-based PDF → Structured Text
Problem: Curriculum files were only available as scanned PDFs.
Solution: Used OCR + Lambda to convert to text, then parsed and segmented content into machine-readable blocks.
Impact: Made previously unusable content available for AI alignment, search, and text-to-speech workflows.
CSV & JSON Outputs for Voice Automation
Problem: Manual audio generation using scripts was time-consuming.
Solution: Built Python pipelines that extract lesson scripts and created integration with ElevenLabs API.
Impact: Reduced production time by 80% and standardized quality across languages.
Parsing & Conversion: Python, Pandas, pdfplumber, Tesseract OCR, BeautifulSoup
Cloud Automation: AWS Lambda, S3, Step Functions, EventBridge
AI Integration: Claude, OpenAI, ElevenLabs, Bedrock
Output Formats: JSON, CSV, Markdown, XML