These are more or less sorted in descending order with more recent projects coming first.
Prostate Cancer: [Blog]
Description: Analysis of of a small data set of 12,600 gene markers to find patterns which indicate the presence or absence of a tumor.
Final Result: A Logistic Regression model with a 98% ROC-AUC score on the training set and 88% for the test set. Feature selection identified 67 genes with non-zero feature importances.
Description: Identifying hate speech is an important task on the internet. I used scikit-learn and the nltk package to build a hate speech classifier using Twitter data from CrowdFlower.
Final Result: The final model utilized the Random Forest Classifier and achieved 76% accuracy on unseen data, a 26% increase over the baseline accuracy of 50%. I productionized the model as an app which allows a user to submit text to be classified.
Description: Kickstarter is the foremost platform for crowd-sourced projects on the internet. I scraped data from the site and used R’s caret library to predict whether or not a project will be funded.
Final Result: The final model was an ensemble of a Logistic Regression classifier built on numerical features and a Random Forest model built from text features. It achieved close to 83% accuracy on unseen data compared to a baseline of 60%.
An Introduction To Principal Component Analysis: [Code]
Description: A talk given at the New York Data Science Study Group’s meeting for August. In the presentation I went through the algorithm’s mathematical foundations, and then moved through three increasing complex examples of its use. The repository also contains a link to a binder for interactive exploration of the material.
Description: The Fund For Peace is a nonprofit which applies a data-driven approach to understanding the world’s problems. One of their initiatives is the Fragile State Index, a yearly ranking of the stability of countries around the world. With over a decade’s worth of their data in hand, let’s see what it has to say.
Description: Expected Goals is the hot metric in soccer analytics. Here I compare the actual results and those predicted by Expected Goals compare in Germany’s top soccer league, the Bundesliga.
Description: A (more or less) progressive, minimalist React web app with an interactive display of New York City’s train stations and available transfers at said stations.
Description: An interactive app made in R which shows the Open Data Science Conference’s (ODSC) meetups around the world. Data is scraped from the Meetup.com using rvest, and displayed as a map via the leaflet library. The flexdashboard library provides the layout.