Research Work

MINING Imbalanced Big Data with Julia (2019)

Julia Conference 2019

In this era of big data, classifying imbalanced real-life data in supervised learning is a challenging research issue. Standard data sampling methods: under-sampling, and over- sampling have several limitations for dealing with big data. Mostly, under-sampling approach removes data points from majority class instances and over-sampling approach engenders artificial minority class instances to make the data balanced. However, we may lose informative information instances using under-sampling approach, and under other conditions over-sampling approach causes overfitting problem. In this research work, we have presented a new cluster-based under-sampling approach by amalgamating ensemble learning (e.g. RandomForest classifier) for classification of imbalanced data that we implemented in Julia. We have collected actual illegal money transaction telecom fraud data, which is highly imbalanced with only 8,213 minority class instances among 63,62,620 instances. The proposed method bifurcates the data into majority class and minority class instances. Then, clusters the majority class instances into several clusters and considers a set of instances from each cluster to create several sub-balanced datasets. Finally, a number of classifiers are generated using these balances datasets and apply majority voting technique for classifying unknown new instances. We have tested the proposed method on separate test dataset that achieved 97% accuracy.

Link: https://github.com/atikul-islam-sajib/Undergraduate-Thesis-/blob/main/Fraud_Classification_Imbalanced_Big_Data.pdf

CRISPRforecast: An effective method to predict high on-target sgRNA activity in CRISPR/Cas9 system (2020)

Structure:

Original Dataset is stored inside the "data" folder.
Models and Feature Generation codes are stored inside "codes" folder.
All the necessary diagrams are stored inside "graphs and diagrams" folder.

Employing CRISPRforecast Models:

Install necessary packages from "requirements.txt"
In order to generate features, use the dataset that is given inside the "data" folder.
- Put both Feature_Generation.ipynb and Original_Dataset.csv file in the same folder.
- Then just execute the Feature_Generation.ipynb file, the output will be a new file titled "Dataset.csv".
- Dataset.csv holds all the generated features which will be used for CRISPRforecast models.
Pull the following files in the same folder:
- CRISPR_Feature_Extraction_KPCA_Linear.ipynb
- CRISPR_Feature_Extraction_KPCA_Poly.ipynb
- CRISPR_Feature_Selection_RFC.ipynb
- CRISPR_No_Feature_Reduction.ipynb
- Dataset.csv
"Creating_Independent_Random_Test_Set.ipynb" file is used for only to create the independent test set.
Final step, execute every single .ipynb (except "Creating_Independent_Random_Test_Set.ipynb" file) file to get the results of every single CRISPRforecast models.

Link: https://github.com/zahid6454/CRISPRforecast