Based on hand-labelled toxicity data set containing 1000 comments crawled from YouTube videos about the Ferguson unrest in 2014. In addition to toxicity, this data set contains labels for multiple subclassifications of toxicity which form a hierarchical structure. Each comment can have multiple of these labels assigned. The structure can be seen in the following enumeration:
Data Source : https://www.kaggle.com/datasets/reihanenamdari/youtube-toxicity-data
Analysis Summary
Train - Test Data Split : 80:20%
#evaluated 98% (recall) of the 647 false samples with 68% accuracy [out of 647 evaluated 634 samples with 431 samples identified correctly]
#evaluated 27% (recall) of the 353 true samples with 91% accuracy [out of 353 evaluated 95 samples with 86 samples identified correctly]
647+353 = Total 1000 Samples
*Based on 200 Testing Sample set
Observation : 25% (approx) FN needs Furter refinement
Resources :
from sklearn.feature_extraction.text import TfidfVectorizer
NLTK :: Natural Language Toolkit NLTK :: Natural Language Toolkit https://www.nltk.org Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the ..
import re
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from flask import Flask, request, jsonify
import joblib
Deployment
# Save the model and vectorizer
joblib.dump(model, "model.pkl")
joblib.dump(tfidf_vectorizer, "vectorizer.pkl")
# Load the model in the API
model = joblib.load("model.pkl")
vectorizer = joblib.load("vectorizer.pkl")
API Testing via Flask Framework
app = Flask(__name__)
@app.route('/')
def home():
return "Hello, Flask in Jupyter!"
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
review = data['review']
print(review)
review_vector = vectorizer.transform([review]).toarray()
prediction = model.predict(review_vector)
return jsonify({'prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True, use_reloader=False)