This project is a comprehensive data engineering solution designed to collect, process, and manage text-to-speech data at scale. This platform leverages Apache Kafka, Airflow, and Spark to create a robust data pipeline that captures audio recordings of text corpus, processes them in real-time, and prepares them for training speech-to-text machine learning models. The system was specifically implemented for Amharic language news data, creating a valuable dataset for language model training.
Build a scalable text-to-speech data collection platform for AI speech recognition systems
Process and transform raw audio recordings into clean, high-quality training data
Create a reliable data source for training speech-to-text machine learning models
Support multilingual capabilities with initial focus on Amharic language
Implement a real-time processing pipeline to handle audio data efficiently
Initial Setup & Architecture Design
Configured Kafka cluster and defined topics based on text categories
Set up Spark environment for distributed processing
Designed the overall system architecture
Backend Development
Created API endpoints for audio recording submission
Implemented Kafka producers and consumers
Configured data partitioning for optimal performance
Data Processing Pipeline
Developed PySpark scripts to read from Kafka and perform transformations
Implemented audio processing modules for noise reduction
Created Airflow DAGs to orchestrate the entire pipeline
Frontend Implementation
Designed and built ReactJS interface for text display and audio recording
Connected frontend to backend API services
Implemented user feedback mechanisms
Testing & Optimization
Tested different audio processing techniques and parameters
Optimized Kafka configuration for high throughput
Verified end-to-end data flow
Kafka Configuration: Set up to ensure message delivery and random partitions for balanced data distribution
Audio Processing: Implemented stationary and non-stationary noise removal using the noisereduce library
Data Sources: Utilized the Amharic News Text classification Dataset for initial corpus
Cloud Integration: Deployed on AWS for scalable infrastructure
Web Interface for Recording: User-friendly ReactJS frontend for recording audio of provided text
Real-Time Data Processing: Seamless handling of audio streams through Kafka
Scalable Infrastructure: System design that can dynamically scale with growing data volumes
Noise Reduction: Sophisticated audio preprocessing to clean recordings from noise and silence
Distributed Storage: Integration with S3 bucket for reliable data storage
Workflow Automation: Scheduled tasks and dependencies management through Airflow
Event-Driven Architecture: Implemented a publish-subscribe model using Kafka to handle real-time audio data streams
Distributed Processing: Utilized Spark for parallel data processing and transformation capabilities
Workflow Automation: Orchestrated complex data pipeline tasks using Airflow DAGs
Audio Preprocessing: Applied noise reduction and audio cleaning techniques using Python libraries
Full-Stack Development: Created both frontend and backend components to enable user recording and data management
Successfully created a scalable platform for text-to-speech data collection
Established a reliable pipeline for processing audio recordings into clean training data
Built foundation for speech-to-text machine learning model development
Created valuable dataset for Amharic language processing
Kafka-Spark Integration: Resolved version compatibility issues between PySpark and Kafka connectors
Audio Quality: Implemented sophisticated noise reduction techniques to improve recording quality
Distributed Processing: Optimized data partitioning for efficient parallel processing
AWS Environment Setup: Tackled remote infrastructure configuration challenges
Apache Kafka: For real-time data streaming and event-driven architecture
Apache Spark: For distributed data processing and transformation
Apache Airflow: For workflow orchestration and automation
ReactJS: For frontend development
Python: For backend services and data processing
Libraries: noisereduce, librosa for audio processing
AWS: For cloud infrastructure and S3 storage
Docker: For containerization and deployment
10 Academy & Tenacious Intelligence Corp (Co-Founder)
10 Academy Managing Director, co-CEO