Wednesday, 1 November, 2017 UTC


Summary

This is a running blog written during my attempt to build a Trump-Obama tweet classifier in under an hour, providing a quick guide to text classification using a Naive Bayesian approach without ‘recoding the wheel’.
Note: This is less a tutorial on Machine Learning / Classifier theory, and more targeted at showing how well established classification techniques along with simple libraries can be used to build classifiers quickly for real world data.
Data
First we need labelled historic data which machine learning approaches, such as Bayesian Classifiers rely on for training.
So we need to get past tweets of both presidents. Luckily Twitter gives us access to the last 3,200 tweets of a user, but it relies on a bit of scripting to automate the process.
Let’s start with Tweepy which is a simple Python interface to the Twitter API that should speed up the scripting side.
  • Note: Issues with pip install so cloned and built the package manually on OSX.
Now we need credentials so let’s go to Twitter, sign in and use the Twitter Application Console to create a new app and get credentials.
  • If using a placeholder for your app’s URL fails then direct it to your public GitHub page, that’s what I’ve done.
Now, there is a challenge to get the tweets as there are multiple API calls to get the list of tweet IDs and then the tweet content. To save time I found a script and adapted it for Donald Trump and Obama respectively.
After running this twice we have two JSON files of the last 3,200 tweets of each president. Yet, the JSONs are just listed as “{…}{…}” with no comma delimitation and no surrounding square brackets. This is therefore invalid JSON and needs to be fixed.
A quick regex turns the files into usable JSON arrays. Replacing “}{“ with “},{“  and adding the two surrounding square brackets to the whole list.
Building the Classifier
Next, building a Naive Bayesian Classifier for our 2 categories, Trump and Obama.
The maths behind the classifier isn’t too complex if you’re interested.
The main decision to make is what feature set (attributes of each data element that are used in classification e.g. length, words) to use and how to implement it. Both of these are solved by the Bayes NPM package which provides a simple interface to build Naive Bayesian models from textual data.
The bayes package uses term frequency as the single, relatively simple, feature for classification. Text input is tokenized (split up into individual words without punctuation) and then a frequency table constructed mapping each token to the number of times it’s used within the document (tweet).
  • There are perhaps some improvements that could be made to the tokenisation such as stop word removal and stemming, but let’s see how this performs.
(Checkout the implementation, it’s ~300 lines of very readable Javascript.)
We can open up a fresh NPM project, require the Bayes package and jump into importing the JSON files… so far so good. (Don’t forget to NPM init and install)
var bayes = require('bayes');
var classifier = bayes();
var trumpTweets = require('./tweetFormatted.json');
var obamaTweets = require('./tweetFormatted2.json');
Now training the model by iterating over the president’s and then their tweets, using the tweet text attribute to get the content of the tweet. The classifier is trained with a simple call to the ‘learn’ function with each tweet.
const data = [{name: 'obama', tweets: obamaTweets}, {name: 'trump', tweets: trumpTweets}];

for (var president of data) {
  console.log(`training model with historical ${president.name} data.`)
  for (var tweet of president.tweets) {
    classifier.learn(tweet.text, president.name);
  }
}
Great, let’s try it out…
console.log(classifier.categorize('Lets build a wall!')); // Trump
console.log(classifier.categorize('I will bear hillary')); // Trump
console.log(classifier.categorize('Climate change is important.')); //Obama
console.log(classifier.categorize('Obamacare has helped americans.')); //Obama
OK! But that’s not exactly scientific. Let’s move to separating training and test data.
Model Validation
Splitting our historic data into test and training is a core principle for machine learning approaches. Training data is the data we train our model on and test data is that data we use to evaluate the model. We could take an arbitrary sample, but more interesting is to exclude each tweet individually from the training data, build a new model and then test with that individual tweet. This rotation can be used to find the average accuracy while taking advantage of as much training data as possible. In the world of ML statistics this method is called ‘leave one out cross validation’ or ‘k-folds cross validation (with k=1)’
We can achieve this exhaustive cross validation with a bit of loop logic and some counters.
A basic working implementation counting false positives, true positives, false negatives and true negatives is as follows:
var bayes = require('bayes');
var classifier = bayes();
var trumpTweets = require('./tweetFormatted.json');
var obamaTweets = require('./tweetFormatted2.json');

const data = [{name: 'trump', tweets: trumpTweets}, {name: 'obama', tweets: obamaTweets}];

var totalDataCount = trumpTweets.length + obamaTweets.length;
var tp = 0;
var tn = 0;
var fp = 0;
var fn = 0;

var t0 = new Date().getTime();

// Iterate through every historic data element index
for (var testIndex=0; testIndex<totalDataCount; testIndex++){
  console.log(testIndex);
  // instantiate a new model
  var classifier = bayes();
  var testData = [];
  var counter = 0;
  for (var president of data) {
    for (var tweet of president.tweets) {
      counter ++;
      if (counter === testIndex) {
        // If equal to test Index then ommit from training.
        testData.push({president: president.name, tweet: tweet});
      } else {
        // Train on all other data elements.
        classifier.learn(tweet.text, president.name);
      }
    }
  }
  // Use test data.
  for (var test of testData) {
    if (classifier.categorize(test.tweet.text) === test.president) {
      if (test.president === 'obama') {
        tp++;
      } else {
        tn ++;
      }
    } else {
      if (test.president === 'obama') {
        fp++;
      } else {
        fn++;
      }
    }
  }
}
var t1 = new Date().getTime();

console.log('total tests: ', (tp + tn + fp + fn));
console.log(`TP = ${tp}`);
console.log(`TN = ${tn}`);
console.log(`FP = ${fp}`);
console.log(`FN = ${fn}`);
console.log('Took ' + (t1 - t0) + ' milliseconds.')
Now we wait for around 40 minutes (model validation execution not included in challenge time) for each of the 6,400 models to be trained and evaluated.
It’s finished with an accuracy of 98%!
We can analyse the results as a ‘confusion matrix’ which tabulates all possible outcomes of classification success or failure (True positives (TP), False positives (FP), True Negatives (TN), False Negatives (FN)).
This is useful as accuracy alone is not a great measure for classifiers.
Predicted
Actual Obama Trump
Obama 3195 3123
Trump 27 82
From this we can calculate the accuracy of our model:
Accuracy = TP + TN / TP +TN +FP +FN
Accuracy = 3195 +3123 /  3195 +3123 + 27 + 82
Accuracy  = 0.98
98 %
Conclusion
This was obviously a very quick exercise in text classification using a Naive Bayesian Classifiers. We have not gone deeply into the subject, discussed Bayesian Probability or compared to other methods such as Support Vector Machines (SVM), k Nearest Neighbours (KNN) or Neural Networks. These areas are interesting, applicable and accessible without deep theoretical knowledge through libraries. I hope this quick tutorial will help you to see real world Machine Learning applications and learn by doing!
Our key steps were:
  • Find and clean the data
  • Choose an approach (Bayesian Probability, SVM, KNN or Neural Networks…)
  • Find a library rather than ‘recode the wheel’
  • Model Validation
  • Share your results!
Note: I challenged myself to do this in one hour, and the resulting accuracy of the model is surprising. I have not checked the data thoroughly for duplications due to time constraints, but if such an error has occurred that would contribute to the high accuracy seen.
Let me know your results in the comments! Takeaways:
  • You can use machine learning techniques without going deep into maths and theory.
  • There are some great libraries to simplify machine learning application
  • You have access to more labelled historic data than you think; be creative.

Further Reading

  • https://en.wikipedia.org/wiki/Feature_(machine_learning)
  • https://en.wikipedia.org/wiki/Naive_Bayes_classifier
  • https://en.wikipedia.org/wiki/Additive_smoothing
  • https://en.wikipedia.org/wiki/Support_vector_machine
  • https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
  • https://en.wikipedia.org/wiki/Machine_learning
  • https://en.wikipedia.org/wiki/Artificial_neural_network
The post Building a Trump/Obama Tweet Classifier with 98% accuracy in 1 hour! appeared first on Theodo.