As a recent graduate, I'm currently pursuing exciting opportunities in a number of cities. Please let me know if you have questions about my educational or professional background, experiences, or passions.
2021-2023
Arizona State University
Masters, Computer Science [Big Data Systems]
A summa cum laude graduate from ASU with interests in Cloud Computing, Distributed Database Systems, Database Management Implementation, Data Mining, Data Visualization, Semantic Web Mining
–
2014-2018
Ahmedabad University
Information and Communications Technology
2022-Present
Amazon
Data Engineer Intern
• Created a cloud data ingestion and validation CLI tool that automated the data comparison process from two high-volume pipelines and improved performance for ticket resolutions for Amazon Ads.
• Leveraged knowledge from the Cloud Computing course to design efficient architectures using AWS S3, Lambda, EMR clusters, and CloudWatch.
• Developed logic and transformation rules to map and transform datasets with different schemas using EMR clusters, Scala, and Spark.
• Showcased expertise in various AWS tools, and adhered to Amazon standards for package creation and Python code deployment.
• Streamlined operations for 3 teams and 75% of customer issue resolutions by integrating multiple tools and interfaces into a unified CLI solution, reducing latency from 3 days to 1 day, resulting in a time savings of 66%.
• Customized additional features to establish pre-processing rules to filter, aggregate, and group individual datasets, improving performance by 66%.
• Collaborated closely with cross-functional teams to gather requirements, design data models, and ensure seamless integration and delivery of data solutions.
–
2018 - 2021
IQM
Data Engineer
• Successfully implemented and deployed four highly scalable and enterprise-level Extract, Transform, Load (ETL) pipelines on AWS and Snowflake that can handle large volumes of complex datasets efficiently using Jenkins and Git.
• Developed automated watchers and triggers using a combination of AWS services including S3, Lambda, EC2, CloudWatch, and shell scriptsto streamline the data ingestion process, ensuring a smooth and reliable flow of data from multiple sources into the ETL pipelines.
• Automated the deployment and management of data pipelines using Terraform. Leveraged Terraform's infrastructure-as-code capabilities to define and provision the necessary resources and configurations for the pipelines.
• Designed and implemented a fault-tolerant architecture in AWS using Python, Lambda, S3, EC2, and CloudWatch that significantly improved the success rate of data ingestion, achieving an impressive 98% success rate. It provided robust resilience and recovery mechanisms to handle failures and ensure continuous data processing.
• Architected and optimized a dynamic system utilizing Python, AWS Lambda, EC2 instances, and S3 to increase resource utilization to 70-80% and reduced resource costs by an impressive 75% per second.
• Achieved a remarkable 100x optimization in core query performance within an AWS RDS PostgreSQL database by fine-tuning and optimizing the database queries to reduce the response latency from minutes to a sub-second level to enable real-time summarization and on-demand report generation, empowering clients with timely and actionable insights.
• Developed comprehensive self-check mechanisms, self-healing auto-recovery capabilities, and real-time monitoring dashboards leveraging AWS CloudWatch, Grafana, and AWS SNS to ensure the reliability, availability, and performance of the ETL processes.
• Queried Athena on terabytes of data on S3 and performed exploratory analysis and descriptive analysis to gain insights into the data features for high dimensionality reduction, feature extraction, and missing value imputation, to help improve the data science models’ efficiency by 20%.
• Identified and resolved pipeline errors by utilizing AWS CloudWatch, Quick Sight, and Athena for effective debugging.
• Developed and fine-tuned a Spark pipeline in Scala that efficiently utilized 80% of the available resources on EC2 and EMR. Proficient in converting SQL queries into Spark transformations using Spark RDD, Scala, and Python.
• Developed Spark applications utilizing PySpark and Spark-SQL for feature extraction, processing, and aggregation from multiple data sources regularly for analyzing and joining the data to create a big data table that can be used in data science solutions.
• Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, applied transformations, and saving in different file formats (ORC/Parquet/CSV/JSON) as per the use cases.
• Containerized ETL tasks in Docker, created Amazon Machine Images (AMIs) containing the containerized tasks, and seamlessly deployed them to Amazon EC2 instances.
• Integrated Airflow with other data engineering and analytics tools, such as AWS services (S3, Redshift, EMR) and databases (MySQL, PostgreSQL), to seamlessly orchestrate end-to-end data workflows.
• Led the successful migration of terabytes of data to Snowflake, a cloud-based data warehousing platform.
• Actively contributed to the development of an optimized pipeline, which streamlined the process of data summarization and made it more efficient for specific business requirements.
• Established data retention policies for pipeline data and implemented archival procedures using AWS Glacier, reducing monthly S3 costs by 60%.
• Played a crucial role in enhancing operational efficiency by implementing proactive alerts and notifications through robust monitoring systems, enabling prompt action and issue resolution.
• Conducted detailed benchmarking assessments of various AWS services including EMR, ECS, Fargate, Glue, and EC2 to identify the most resource-efficient and cost-effective solutions. Based on the evaluation, the decision was made to migrate the data pipeline from EMR to EC2 for a specific business use case, optimizing resource utilization and achieving better cost efficiency.
• Developed a robust schema-matching algorithm in pandas and Python to reconcile schema discrepancies in multiple data sources, specifically for high-dimensional data with over 100 fields.
–
2017 - 2018
Elegant Microweb
Fullstack Engineering Intern
• Developed a comprehensive web application to automate the client's Enterprise Resource Planning (ERP) system, specifically designed to manage various aspects such as employee data, salary management, accrual schedules, and shift codes.
• Engineered and implemented a robust client database using MySQL, ensuring efficient storage and retrieval of data for the ERP system.
• Designed and developed APIs using Java MVC (Model-View-Controller) architecture, enabling seamless integration between different components of the ERP system.
• Utilized Angular framework to create user-friendly interfaces and interactive front-end components that aligned with the client's specific software requirements and specifications and deployed it onto an Apache Tomcat server.
• Integrated various modules and functionalities within the ERP system, including employee management, salary calculation, accrual scheduling, and shift code management.
• Conducted thorough testing and debugging of the web application using Postman to ensure its reliability and adherence to the client's functional requirements and business rules.
• Collaborated closely with the client's stakeholders to gather software requirements, understand their business processes, and translate them into technical specifications for the development of the ERP system.
Please keep my resume on file in case you think I may be a good fit with your organization either now or in the future. Feel free to contact me with any questions.