This portfolio section reflects my interdisciplinary expertise developed at Université Paris Dauphine-PSL and through personal projects, bridging Data Engineering, Machine Learning Engineering, and Cloud Engineering to solve real-world challenges at scale.
Distributed Systems & Data Engineering:
Hadoop Ecosystem: HDFS, MapReduce, YARN
Design and implementation of high-performance, fault-tolerant data pipelines
Scalable Analytics with Apache Spark:
Core: RDDs, DataFrames, Apache Spark
Programming in Scala and Python
Deployment and optimization using Databricks for in‑memory iterative processing
Machine Learning Engineering:
Algorithms: K‑means clustering, gradient descent, perceptron (binary and multiclass)
Implementation of scalable machine learning models on Big Data platforms
Cloud Engineering & Cluster Management:
Google Cloud Platform: Cluster creation, configuration, and resource manipulation
Integrating cloud infrastructure with distributed processing frameworks
Perceptron Implementation:
Developed a perceptron from scratch in Scala, implementing the step function, net input computation, and iterative weight updates with a bias adjustment.
Demonstrated binary classification on a synthetic dataset and extended the approach by processing a real-world cardio training dataset via Spark DataFrames.
Gradient Descent for Linear Regression:
Implemented both batch and stochastic gradient descent on synthetic datasets and normalized data.
Built a suite of vector operations in Scala (including dot product, vector subtraction, and scalar multiplication) to support efficient gradient computations.
Validated and measured model convergence through custom accuracy functions.
Iterative Connected Components:
Created a Spark RDD–based algorithm that simulates iterative MapReduce cycles for connected components.
Developed custom mapper and reducer functions that propagate lexicographic minimum labels across nodes, iterating until convergence.
Extended the project to demonstrate dynamic graph operations such as counting arcs, identifying sinks, and detecting triangles, showing proficiency in grouping, union operations, and RDD transformations.
Additional RDD Tests:
Conducted multiple lab exercises on fundamental RDD operations: word count, partitioning, flatMap, union, and complex joins.
Experimented with grouping and aggregating key-value pairs to optimize shuffles and understand the efficiency trade-offs in distributed processing.
Clustering Algorithm Implementation:
Designed a first implementation of k‑means clustering using Spark RDDs, emphasizing the initialization, assignment, and update steps.
Analyzed the performance of the basic algorithm and introduced vector operations to support distance computation and centroid updates.
Worked with the Iris dataset by reading data via RDDs, inferring the number of clusters from distinct labels, and evaluating clustering efficiency.
Flat Data and Nested Data Processing:
Utilized Spark DataFrame operations to perform filtering, grouping, and aggregation on flat CSV data (e.g., Books, Users, and Ratings).
Enhanced skills in Spark SQL by registering DataFrames as temporary views and executing complex queries.
Managed and flattened nested JSON structures (using functions like explode, from_unixtime, and month) to extract and pivot information from social network data.
Google Cloud Integration:
Designed and administered Google Cloud clusters to run Spark applications, focusing on resource configuration and deployment in a cloud environment.
Demonstrated practical experience in integrating cloud-based processing with scalable analytics via Databricks and Spark’s distributed capabilities.
Developed and implemented a MapReduce-based algorithm (CCF approach) in PySpark to determine connected components in extremely large graphs. The project involved iterative processing using RDDs and efficient job design on Databricks.
Designed a first implementation of k‑means clustering using Spark RDDs, emphasizing the initialization, assignment, and update steps.
Analyzed the performance of the basic algorithm and introduced vector operations to support distance computation and centroid updates.
Worked with the Iris dataset by reading data via RDDs, inferring the number of clusters from distinct labels, and evaluating clustering efficiency.
Developed a perceptron from scratch in Scala, implementing the step function, net input computation, and iterative weight updates with a bias adjustment.
Demonstrated binary classification on a synthetic dataset and extended the approach by processing a real-world cardio training dataset via Spark DataFrames.
Utilized Spark DataFrame operations to perform filtering, grouping, and aggregation on flat CSV data (e.g., Books, Users, and Ratings).
Enhanced skills in Spark SQL by registering DataFrames as temporary views and executing complex queries.
Managed and flattened nested JSON structures (using functions like explode, from_unixtime, and month) to extract and pivot information from social network data.
Clustering Algorithm Implementation:
Designed a first implementation of k‑means clustering using Spark RDDs, emphasizing the initialization, assignment, and update steps.
Analyzed the performance of the basic algorithm and introduced vector operations to support distance computation and centroid updates.
Worked with the Iris dataset by reading data via RDDs, inferring the number of clusters from distinct labels, and evaluating clustering efficiency.
Implemented both batch and stochastic gradient descent on synthetic datasets and normalized data.
Built a suite of vector operations in Scala (including dot product, vector subtraction, and scalar multiplication) to support efficient gradient computations.
Validated and measured model convergence through custom accuracy functions.
Hakan Kardes, Siddharth Agrawal, Xin Wang, and Ang Sun
Data Research
inome Inc.
Bellevue, WA USA
Conducted multiple lab exercises on fundamental RDD operations: word count, partitioning, flatMap, union, and complex joins.
Experimented with grouping and aggregating key-value pairs to optimize shuffles and understand the efficiency trade-offs in distributed processing.