Methods and Baseline Results

Before using any machine learning algorithms, data always need to be preprocessed, especially when you are dealing with large scale data. In our case, the feature we used in our models is the comments and reviews which are in unstructured text format. As most of the text classification methods do, we started with some feature selection and some cleanings to the data while evaluating each step on our chosen machine learning algorithms to decide whether this step we took is beneficial or not. For example, we test whether we should use TF-IDF Vectorizer or Count Vectorizer, and we also experiment whether digits and special characters should be removed or they may be describing an emoji and feeling that can help after in the sentiment analysis and in boosting the accuracy. For both, YouTube and Amazon dataset, the first step we took in processing our data is to find the empty comments and that happened because when you query the YouTube comments of a video, the results will be all the comments that existed on this video, but some of the comments may have been deleted by the user itself, we thought of using other generative models to compensate for these missing values, but we agreed on removing any data point that doesn't have content in the comment field. We faced the same problem also when we were retrieving the comments from the video URLs we have. These videos' URLs are gotten from querying and getting the results of the search page. The problem is some of these results were a Playlist page that doesn't have comments on them. So, we also eliminated such URLs to keep the data consistent. The next step was to tokenize the text, using the best practices we found that TF-IDF is the best suitable for our dataset.

Sklearn provides many useful parameters for tuning the algorithms and components it provides. Starting with the Tf-IDF vectorizer, we thought of either our comments should be analyzed using unigrams, bigrams, or trigrams. The metrics we followed for measuring our performance showed that using a combination of unigrams and bigrams is the best so we adapted this setting for all of the algorithms I will explain later in this section. \\Another way to feature select the attributes is to make use of the SVD, or Truncated SVD matrix, which is a method of decomposing the matrix (for example the sparse matrix outputted from count vectorizer) will be entered to SVD decomposition method or even eigenvalue decomposition in case if we are using PCA. Using either of the two cases, we can map and transform our matrix into a new domain that is called the semantic or latent domain that helps in reducing the dimensionality, the number of components, while at the same time focusing on the most important principles that exist in our comments. Using this approach and as you can see in the latent semantic analysis Jupyter notebook file, we propose there the 3 components of each topic we did the decomposition on, and as was expected the 3 components were namely the anti, pro, and neutral videos.

Example of metrics and algorithms used on Topic 1 to select the best settings

Example of metrics and algorithms used on Topic 2 to select the best settings

The next step after getting our text normalized into a sparse matrix, we fed it as an input to the machine learning algorithms. Some issues we had is that the output of Truncated SVD or the Hashing Vectorizer contained some infinite or negative values which couldn't be used for training Naïve Bayes or Decision Trees as they required having the only matrix without negative values, so the approach that is taken was the TF-IDF Vectorizer with a combination of unigrams and bigrams for getting the best performance we can approach as shown in the next section. Choosing the machine learning algorithm to use and tweaking the parameters to find the best combination of best settings for your results is the core of the machine learning art. Most practices used Multinomial Naïve Bayes in text classification, i.e., natural language processing. We went further than that and experimented with three algorithms with each of them we tried different parameter settings while also reprocessing the data to improve the accuracy. For example in the above two figures, I just give two examples of the accuracy and f1-score results we obtained. For complete results and evaluation of the metrics, the accuracy, precision, recall, and f1-score are represented in this table. All algorithms we used are the traditional machine learning algorithms without going deeper into the deep learning methods and we obtained the highest accuracy was 82.74\% which is pretty high given that these comments or reviews are just written by variant users that do have different backgrounds and thoughts, and there might be also malicious comments generated by advanced bots.

Table: Testing Accuracy, Precision, Recall, F1-Score of each classifier trained on each dataset. MNB, SGD, and LR on YouTube. MNB and LSVC on Amazon.

The algorithms used in the YouTube dataset are Multinomial Naïve Bayes, Stochastic Gradient Descent, and Logistic Regression. For the amazon dataset, we used MNB and LSVC. The reason why we didn't use Support Vector Machines in the YouTube dataset is simple; the number of YouTube comments is enormous and would take a long time for training machines, and we will not be able to search for its best parameters using cross-validation. So, we rather preferred to use the Stochastic gradient descent classifier for training data that is greater than 100k as suggested by the Sklearn cheat sheet. Stochastic gradient descent is heavily used in many ML projects and it set the basis for the natural networks, with its idea of mini-batch gradient without need to load all the data into the memory and then you run out of memory. This algorithm proved its effectiveness in our dataset, but it wasn't the highest accuracy we obtained.