Enormous computing power is vital for modern applications that utilize deep learning algorithms. Deep learning and neural network models excel with large training datasets, but this increases the training time due to the number of learnable parameters. To address this, we parallelized the training of a two-layer neural network for hand-written digit recognition using data and model parallelization techniques. We implemented this in C programming with OpenMP and Pthread APIs, resulting in speedups of approximately 4.84 and 4 for data and model parallelization, respectively. However, we encountered the challenge of reduced model accuracy when increasing the number of threads during data parallelization.
PNN_Project_Report.pdf
The initial image illustrates the algorithm's structure, which employed data parallelism. On the other hand, the second image demonstrates the algorithm's performance, specifically its accuracy, across multiple threads. The final image displays the speedup achieved by parallelizing the serial code across different thread counts.
In the project, my role involved parallelizing the serial code and collaborating with my team member to organize and report the results. Additionally, I contributed equally with my friend in writing the project documentation.
FRAUD DETECTION USING CLASSICAL MACHINE LEARNING ALGORITHMS
In summary of this project on fraud detection using classical machine learning models, the Random Forest model demonstrated superior performance compared to the Bayes classifier, both on the original dataset and the SMOTE oversampled dataset. The utilization of the SMOTE oversampling technique contributed to overall enhanced model performance by addressing the issue of class imbalance and reducing overfitting. Furthermore, the analysis of feature importance provided valuable insights into the predictive power of different words. For the training and testing data, the Random Forest model achieved an accuracy of 99.97% and 97.02% respectively.
Fraud-Detection.pdf
The initial two images depict the dataset distribution before applying SMOTE oversampling, specifically highlighting the need to increase the number of samples for the spam class. The last two images showcase the AUC (Area Under the Curve) values of the two models and illustrate the feature importance scores obtained from the Random Forest model, which proved to be the superior model for this classification task.
It was my project, It was done under the guidance of my teacher, who also served as the reviewer of the document. In the project, I was responsible for preprocessing the data, applying the model, and summarizing the experimental results.