This lab will use Google Colaboratory, a tool that allows the execution of Python code and notably employs the use of Google Drive.
Once downloading the dataset and uploading to your Google Drive, mount the Drive into the Colab environment.
Read the completeSpamAssassin.csv file into a data frame using the Pandas library. Remove the index column and change the numbered classification values to their respective string names. In this case, 'spam' refers to spam emails and 'ham' refers to non-spam emails.
Remove empty and null emails. In this case, the email body at index 328 is null, so it is removed. Then change uppercase letters, remove URLs, and remove grammar. Split the email bodies into lists.
Remove stopwords which are commonly used words such as "the", "and", and "are". Remove numbers and single letters.
Remove least common words. In this case, least common words are considered words that appear ten or fewer times in the entire dataset.
Randomize rows in dataset. You will find using value_counts() that 28% of emails are spam and 72% are ham.
Split into testing and training sets. Then, create vocabulary with unique words found in training set.
Combine vocabulary with training set data frame. With this method, in the column under each unique word there will be a value of 1 in every row where the corresponding email body contains that word.
In the application of the classical Bayesian model, we are implementing the following formulas:
Clarification of some terms:
Nwi|Spam: amount of times a word wi appears in spam emails
Nwi|ham: amount of times a word wi appears in ham emails
Nspam: average number of words per spam email
Nham: average number of words per ham email
Nvocabulary: amount of words in vocabulary
a (alpha): smoothing parameter to ensure calculated probabilities are never 0
Apply the formulas and calculate probabilities in respective ham and spam categories for each word.
Create functions to calculate the probabilities of spam and ham given an email.
Time for testing. Then, calculate accuracy.
The quantum implementation of this model will use the following 3-node quantum circuit:
Clarification of some terms:
θA: collective probability of a given email's spam words
θB: probability that an email is spam
|00> : percentage of a given email's ham words in all ham emails
|01> : percentage of a given email's ham words in all spam emails
|10> : percentage of a given email's spam words in all ham emails
|11> : percentage of a given email's spam words in all spam emails
All of these percentages will be converted to a probabilistic angle for the rotation (RY and CCRY) gates about the Bloch sphere, hence why they are given the theta (θ) symbol.
Build functions to calculate |00>, |01>, |10>, and |11> for each email.
You will be using the PennyLane framework to build the quantum circuit. "Wires" refer to qubits and "shots" represent the number of times the circuit will be run for each email classification.
Create function to convert probability to angle for the rotation (RY and CCRY) gates.
Build the circuit, including the rotation gates used in the circuit. Since PennyLane does not have a built in Controlled-Controlled Y-Rotation (CCRY) gate, it must be created manually.
Build the prediction function for testing and then run on the test dataset. The prediction will be calculated based on the results of the third qubit (the child node of the network).
If there is a higher score where the third qubit is 1, the prediction will be spam. Reversely, if there is a higher score where the third qubit is 0, the prediction will be ham.
Calculate accuracy.
Borujeni, S. E., Nannapaneni, S., Nguyen, N. H., Behrman, E. C., & Steck, J. E. (2021, April 12). Quantum Circuit representation of Bayesian Networks. arXiv.org. Retrieved April 2023, from https://arxiv.org/abs/2004.14803
Spam Filter in Python: Naive Bayes from Scratch. KDnuggets. (n.d.). Retrieved April 2023, from https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html