Data Engineering

Text-to-Speech data collection with Kafka, Airflow, and Spark

🧭 Project Overview

This project is a comprehensive data engineering solution designed to collect, process, and manage text-to-speech data at scale. This platform leverages Apache Kafka, Airflow, and Spark to create a robust data pipeline that captures audio recordings of text corpus, processes them in real-time, and prepares them for training speech-to-text machine learning models. The system was specifically implemented for Amharic language news data, creating a valuable dataset for language model training.

🏬 Business Objectives

Build a scalable text-to-speech data collection platform for AI speech recognition systems
Process and transform raw audio recordings into clean, high-quality training data
Create a reliable data source for training speech-to-text machine learning models
Support multilingual capabilities with initial focus on Amharic language
Implement a real-time processing pipeline to handle audio data efficiently

⚙️ Development Details

Implementation Stages

Initial Setup & Architecture Design

Configured Kafka cluster and defined topics based on text categories
Set up Spark environment for distributed processing
Designed the overall system architecture

Backend Development

Created API endpoints for audio recording submission
Implemented Kafka producers and consumers
Configured data partitioning for optimal performance

Data Processing Pipeline

Developed PySpark scripts to read from Kafka and perform transformations
Implemented audio processing modules for noise reduction
Created Airflow DAGs to orchestrate the entire pipeline

Frontend Implementation

Designed and built ReactJS interface for text display and audio recording
Connected frontend to backend API services
Implemented user feedback mechanisms

Testing & Optimization

- Tested different audio processing techniques and parameters
- Optimized Kafka configuration for high throughput
- Verified end-to-end data flow

Technical Implementation Details

Kafka Configuration: Set up to ensure message delivery and random partitions for balanced data distribution
Audio Processing: Implemented stationary and non-stationary noise removal using the noisereduce library
Data Sources: Utilized the Amharic News Text classification Dataset for initial corpus
Cloud Integration: Deployed on AWS for scalable infrastructure

Project Details

🔶 Key Features

Web Interface for Recording: User-friendly ReactJS frontend for recording audio of provided text
Real-Time Data Processing: Seamless handling of audio streams through Kafka
Scalable Infrastructure: System design that can dynamically scale with growing data volumes
Noise Reduction: Sophisticated audio preprocessing to clean recordings from noise and silence
Distributed Storage: Integration with S3 bucket for reliable data storage
Workflow Automation: Scheduled tasks and dependencies management through Airflow

💡 Implementation Highlights

Event-Driven Architecture: Implemented a publish-subscribe model using Kafka to handle real-time audio data streams
Distributed Processing: Utilized Spark for parallel data processing and transformation capabilities
Workflow Automation: Orchestrated complex data pipeline tasks using Airflow DAGs
Audio Preprocessing: Applied noise reduction and audio cleaning techniques using Python libraries
Full-Stack Development: Created both frontend and backend components to enable user recording and data management

📈 Results

Successfully created a scalable platform for text-to-speech data collection
Established a reliable pipeline for processing audio recordings into clean training data
Built foundation for speech-to-text machine learning model development
Created valuable dataset for Amharic language processing

🚧 Challenges & Solutions

Kafka-Spark Integration: Resolved version compatibility issues between PySpark and Kafka connectors
Audio Quality: Implemented sophisticated noise reduction techniques to improve recording quality
Distributed Processing: Optimized data partitioning for efficient parallel processing
AWS Environment Setup: Tackled remote infrastructure configuration challenges