The tool is a side-project that I worked on in my free time and weekends. The idea for the project stems from the observation that I have at L'Oréal. Social medial evaluation is one of the most tedious tasks that we have. Manually copying every comment and classifying them can grow extremely frustrating when it comes to hundreds or even thousands. Employees have to spend most of their working day doing it when a campaign hits.
The project aligns perfectly well with my philosophy of ML design. Employing machine learning systems to reduce unnecessary working hours and harvest the potential of unstructured data. Hopefully, it is also the first foundation to a larger market monitoring system, a personal project goal of mine.
The tool is not perfect. Due to the nature of a problem (few-shot learning), the classifier is exposed to a limited amount of data. However, the tool changes the problem type for end-user from classification to validation, which can significantly reduce effort. A loop-back system is also implemented to improve the model over time as uses. The automation of comment download is too a tremendous help.
The tool is built fully using Python, from data crawling to classification, using multiple different libraries to create a full pipeline.
Since there is no Facebook comment dataset out there, and the label is designed for optima social media post evaluation. Therefore, this boils down to a few-shot problem and depends on the amount of data that I can crawl and manually classify.
The label for this data will be the following:
Brand Love (L)
Brand Hate (H)
Question - Interest of the product (Q)
Discuss, Reply to tag only (D)
Page Getter
Facebook is relatively private when it comes to anonymous users. Therefore, a simple request is not going to be usable. The Facebook API is also considered, but it has a fixed quota and will cost extra if it exceeds the limit.
My solution is an HTML getter using Selenium. It will mimic and have access to any information available to a normal user. Selenium is needed to help add my personal account cookies to the browser to save login time and validation.
In this project rather than using the normal Facebook URL, the m.facebook version is employed, a lighter version of Facebook for phone’s browsers. The special thing about this version is its simple structure, easy to understand HTML layout.
Since we need the full HTML page of the post, the power of interaction from Selenium is utilized
click to extend every comment
click to get reply on a post
Comment Getter
A comment getter is relatively simple since the page getter has done most of the hard work. Its only job is to extract comments and replies using class using BeautifulSoup via class name.
Luckily the m.facebook structure is fix and class names are the same across multiple post pages.
I am personally most excited about this part. Many solutions are considered, from a normal probabilistic model to a sequential architecture. A prominent possibility is to use the available Amazon review dataset to train a Bidirectional GRU model, then use it to predict the mined comment. However, this approach has many drawbacks. The distribution of Facebook post comments and Amazon reviews is significantly different. Moreover, the label of the Amazon dataset is a rating given by the customers, which is not optimal for social medial evaluation.
Because of the lack of labeled data, the project is basically a few-shot learning problem. The final solution and I believe to be the best approach, is to implement transfer learning using BERT, SVC, and a loop-back system.
Transformer
BERT (Bidirectional Encoder Representations from Transformers) developed by Google AI, the most state-of-the-art deep learning model out there, perform significantly well in a wide variety of NLP tasks. In this case, BERT can be used as a transfer to convert word to vector, a feature extractor.
This feature extractor is extremely powerful and able to capture the context of a given corpus, in terms significantly reduced the sample needed. Transfer learning, the method that resues a more advanced model to assist similar issues has corrected our lack of training example perfectly.
Moreover, the up-sampling technique is also employed to assist with some of the unbalanced labels. This may cause overfit at the beginning, but the problem should be fixed as the dataset grows.
Classifier
SVC (Support Vector Machine Classifier) is famous for its ability to deal with high-dimensional data. A BERT transformer can convert text to a 768 number array, and the fact that training examples are limited, SVC is perfect for the job.
As the training set grows, SVC can notably slow down, by then we can substitute it with another model.
Loop-back
The Loop-back system is an idea of my to improve the performance of the model as end-user run the tool. As mentioned before, the model is not perfect, the lack of training data and up-sampling techniques can cause significant overfitting.
However, the tool transforms the problem for end-user from classification to validation. And employees validate the results of the model, that data can be collected for further training. Creating a natural feedback system, which helps refine the performance of the tool as time goes by.