Machine Learning Engineering

This portfolio section reflects my interdisciplinary expertise developed at Université Paris Dauphine-PSL and through personal projects, bridging Data Engineering, Machine Learning Engineering, and Cloud Engineering to solve real-world challenges at scale.

Tech Stack

Distributed Systems & Data Engineering:

Hadoop Ecosystem: HDFS, MapReduce, YARN
Design and implementation of high-performance, fault-tolerant data pipelines

Scalable Analytics with Apache Spark:

Core: RDDs, DataFrames, Apache Spark
Programming in Scala and Python
Deployment and optimization using Databricks for in‑memory iterative processing

Machine Learning Engineering:

Algorithms: K‑means clustering, gradient descent, perceptron (binary and multiclass)
Implementation of scalable machine learning models on Big Data platforms

Cloud Engineering & Cluster Management:

Google Cloud Platform: Cluster creation, configuration, and resource manipulation
Integrating cloud infrastructure with distributed processing frameworks

Projects

Perceptron and Gradient Descent in Scala & Spark

Perceptron Implementation:
- Developed a perceptron from scratch in Scala, implementing the step function, net input computation, and iterative weight updates with a bias adjustment.
- Demonstrated binary classification on a synthetic dataset and extended the approach by processing a real-world cardio training dataset via Spark DataFrames.
Gradient Descent for Linear Regression:
- Implemented both batch and stochastic gradient descent on synthetic datasets and normalized data.
- Built a suite of vector operations in Scala (including dot product, vector subtraction, and scalar multiplication) to support efficient gradient computations.
- Validated and measured model convergence through custom accuracy functions.

Graph Analytics with Spark RDD

Iterative Connected Components:
- Created a Spark RDD–based algorithm that simulates iterative MapReduce cycles for connected components.
- Developed custom mapper and reducer functions that propagate lexicographic minimum labels across nodes, iterating until convergence.
- Extended the project to demonstrate dynamic graph operations such as counting arcs, identifying sinks, and detecting triangles, showing proficiency in grouping, union operations, and RDD transformations.
Additional RDD Tests:
- Conducted multiple lab exercises on fundamental RDD operations: word count, partitioning, flatMap, union, and complex joins.
- Experimented with grouping and aggregating key-value pairs to optimize shuffles and understand the efficiency trade-offs in distributed processing.

K-means Clustering in Spark RDD (Scala)

Clustering Algorithm Implementation:
- Designed a first implementation of k‑means clustering using Spark RDDs, emphasizing the initialization, assignment, and update steps.
- Analyzed the performance of the basic algorithm and introduced vector operations to support distance computation and centroid updates.
- Worked with the Iris dataset by reading data via RDDs, inferring the number of clusters from distinct labels, and evaluating clustering efficiency.

Dynamic Data Analytics with Spark DataFrames & SQL

Flat Data and Nested Data Processing:
- Utilized Spark DataFrame operations to perform filtering, grouping, and aggregation on flat CSV data (e.g., Books, Users, and Ratings).
- Enhanced skills in Spark SQL by registering DataFrames as temporary views and executing complex queries.
- Managed and flattened nested JSON structures (using functions like explode, from_unixtime, and month) to extract and pivot information from social network data.

Cloud Cluster Management on Google Cloud

Google Cloud Integration:
- Designed and administered Google Cloud clusters to run Spark applications, focusing on resource configuration and deployment in a cloud environment.
- Demonstrated practical experience in integrating cloud-based processing with scalable analytics via Databricks and Spark’s distributed capabilities.

Sources & Documents

PDF

Paper CCF Fast and Scalable Connected Component.pdf

Code & Documents

PDF

Connected Components in Graphs (PySpark)

Developed and implemented a MapReduce-based algorithm (CCF approach) in PySpark to determine connected components in extremely large graphs. The project involved iterative processing using RDDs and efficient job design on Databricks.

Project-CCF Fast and Scalable Connected Component Computation in MapReduce.pdf

Graph Analytics with Spark RDD

- Designed a first implementation of k‑means clustering using Spark RDDs, emphasizing the initialization, assignment, and update steps.
- Analyzed the performance of the basic algorithm and introduced vector operations to support distance computation and centroid updates.
- Worked with the Iris dataset by reading data via RDDs, inferring the number of clusters from distinct labels, and evaluating clustering efficiency.

M2 MIAGE - Spark RDD.pdf

Perceptron Implementation

- Developed a perceptron from scratch in Scala, implementing the step function, net input computation, and iterative weight updates with a bias adjustment.
- Demonstrated binary classification on a synthetic dataset and extended the approach by processing a real-world cardio training dataset via Spark DataFrames.

Perceptron2.pdf

Flat Data and Nested Data Processing

- Utilized Spark DataFrame operations to perform filtering, grouping, and aggregation on flat CSV data (e.g., Books, Users, and Ratings).
- Enhanced skills in Spark SQL by registering DataFrames as temporary views and executing complex queries.
- Managed and flattened nested JSON structures (using functions like explode, from_unixtime, and month) to extract and pivot information from social network data.

Large Scale Machine Learning.pdf

Dataframes_Manipulation_in_Spark.pdf

K-means Clustering in Spark RDD (Scala)

Clustering Algorithm Implementation:

Designed a first implementation of k‑means clustering using Spark RDDs, emphasizing the initialization, assignment, and update steps.
Analyzed the performance of the basic algorithm and introduced vector operations to support distance computation and centroid updates.
Worked with the Iris dataset by reading data via RDDs, inferring the number of clusters from distinct labels, and evaluating clustering efficiency.

K-means in Spark RDD (Scala).pdf

Gradient Descent for Linear Regression

- Implemented both batch and stochastic gradient descent on synthetic datasets and normalized data.
- Built a suite of vector operations in Scala (including dot product, vector subtraction, and scalar multiplication) to support efficient gradient computations.
- Validated and measured model convergence through custom accuracy functions.

Gradient_Descent_Solution_Normalization.pdf

Project: Application of a Paper

CCF: Fast and Scalable Connected Component: Computation in MapReduce

Hakan Kardes, Siddharth Agrawal, Xin Wang, and Ang Sun

Data Research

inome Inc.

Bellevue, WA USA

Project-CCF 2 Fast and Scalable Connected Component Computation in MapReduce.pdf

Additional RDD Tests

- Conducted multiple lab exercises on fundamental RDD operations: word count, partitioning, flatMap, union, and complex joins.
- Experimented with grouping and aggregating key-value pairs to optimize shuffles and understand the efficiency trade-offs in distributed processing.

MIAGE Graph Analytics Notebook.pdf

Page updated

Google Sites

Report abuse