Hate Speech Project

The beauty of the internet comes with its share of ugliness. In recent months, Twitter’s dark side has been in the spotlight. Increasingly its the platform’s bad eggs that seem to define it, not its usefulness for connecting people across many different interests. Much of this unsavory content comes in the form of offensive language and hate speech. Identifying tweets which fit the criteria is a huge problem in terms of scale, yes, but most crucially in terms of subjectivity. There is an appreciable amount of variation in what is considered to be offensive from person to person. Thus, obtaining robust data is the first major obstacle in attacking such a problem with Data Science.

Luckily, CrowdFlower’s free data repository contains such a dataset. Using input from three people, each tweet is labelled as either innocuous, containing offensive language but not hate speech, or containing hate speech. That is the straightforward component of the data, but there are several other columns - Confidence, for example - whose usefulness is undercut by the lack of a code book to outline exactly how these features were collected. that being said, the transparent portion of the data was enough for me to forge ahead with my analysis.

After explanatory analysis, the pre-processing of the data involved encoding the features, preprocessing the tweets to remove unwanted material - stopwords, usernames etc - and then vectorizing said tweets. The final vectorization step used scikit-learn’s CountVectorizer and considered ngrams of lengths one and two after the TF-IDF Vectorizer proved to be less effective.

For the modeling phase I chose to pit four classification algorithms against each other: Logistic Regression, Support Vector Classifier, Multinomial Naive Bayes, and Random Forests. The Random Forest and Suppor Vector Classifier models performed best in the first phase of testing with the train-test-split paradigm. Progression to ten fold cross validation brought a huge difference in the performance of the two models with the Random Forest being the clear superior with an average of about 76% accuracy. (The model acheived the same accuracy on the test set.)

Hyperparameter tuning was put on hold until I partioned the code in my Jupyter notebook into Python files. However, I still went ahead and built an app to classify the content of tweets. You can find the app here and the notebook with my code here.

Written on May 15, 2016