Blasters and Bayesian Stats

I’m alive! For my first post of the new year I’ll apply some Bayesian Stats to Star Wars. This was originally down in an Jupyter notebook which I’ll just post below.

Who Is The Best Shot In Star Wars?

I happened upon this thread on reddit the other day. The poster made an infographic ranking the best shooter with a blaster from the Star Wars movies. S/he doesn’t go into the precise methodology, but it seems to center only around the main characters. I only needed a glance to feel a deep disturbance in the Force of Statistics. It was as if thousands of statisticians cried out with one voice.

Though this analysis may seem fine from a naive perspective, a more nuanced view immediately reveals that uneven sample sizes skew the rankings. Finn may have 100% accuracy, but, with only four shots taken, this cannot be taken to be a pure reflection on his aiming ability. (Small sample sizes are more likely to display extremes.) To fall back on the oft used example, which do you trust more: The product on Amazon with an average rating of 3.5 from 1,000 reviews, or a similar one with a rating of 5 from 10 reviews? The two scenarios are the same from a statistical perspective.

Robustness can be added to this analysis with Bayesian Statistics. We know from Bayes Theorem that:

$P(A B) = \frac{P(B A)(A)}{P(B)}$
Where P(B A) is the likelihood, P(A) is the prior, P(B) is the marginal likelihood, and P(A B) is the posterior. We can omit the marginal likelihood in the discussion since it is, in essence, a normalization factor.

The best prior would be based on the expected accuracy given taking into account all the blaster bolts fired over the course of the seven movies. I don’t have that data and I am not inclined to collect it myself. In the absence of this ideal prior, I’ll use one with an expected value of 0.5, i.e a blaster bolt being fired is equally likely to hit or miss. I decided to use a Beta (50, 50) distribution. Given this arbitrary choice, the new ratings will necessarily fluctuate with a difference choice of Beta(n,n) - where n is greater than 1. (Given a Beta(n, n) distribution, the variance decreases as n increases.)

The likelihood is drawn from a binomial distibution. Given this prior and likelihood, the concept of conjugate priors tells us that the posterior follows a Beta distribution with parameters which will be discusses below.

Here’s the original table of rankings.

import pandas as pd
from collections import OrderedDict

star_wars_shooting = OrderedDict()
star_wars_shooting['name'] = ['Finn', 'Padme', 'Leia', 'Chewbacca', 'Luke', 'Rey', 'Han']
star_wars_shooting['shots_fired'] = [4, 13, 12, 19, 8, 13, 45] 
star_wars_shooting['confirmed_hits'] = [4, 11, 8, 12, 5, 8, 26]
star_wars_shooting['confirmed_misses'] = [0, 2, 4, 7, 3, 5, 19]
star_wars_shooting['raw_accuracy'] = [1, .84, .66, .63, .62, .61, .57]
star_wars_shooting = pd.DataFrame(star_wars_shooting, index = [i for i in range(1, 8)])
name shots_fired confirmed_hits confirmed_misses raw_accuracy
1 Finn 4 4 0 1.00
2 Padme 13 11 2 0.84
3 Leia 12 8 4 0.66
4 Chewbacca 19 12 7 0.63
5 Luke 8 5 3 0.62
6 Rey 13 8 5 0.61
7 Han 45 26 19 0.57

With a Beta($\alpha$, $\beta$) prior and a Binomial likelihood, a posterior of Beta($\alpha$ + hits, $\beta$ + misses) follows. The expected accuracy with the particular prior chosen is:

$\frac{\alpha}{\alpha + \beta} = \frac{50}{50 + 50} = \frac{1}{2}$

The expected accuracy for the posterior is simply:

$\frac{\alpha + h}{\alpha + h + \beta + m}$

Where h is the number of hits and m is the number of misses.

From this point a couple lines of code produces the new rankings sorted by expected_accuracy from the posterior distribution.

star_wars_shooting['expected_accuracy'] = (50 + star_wars_shooting['confirmed_hits'])/(100 + star_wars_shooting['shots_fired'])
star_wars_shooting.sort(columns = ['expected_accuracy'], ascending = False).drop(['raw_accuracy'], axis = 1)
name shots_fired confirmed_hits confirmed_misses expected_accuracy
2 Padme 13 11 2 0.539823
7 Han 45 26 19 0.524138
4 Chewbacca 19 12 7 0.521008
1 Finn 4 4 0 0.519231
3 Leia 12 8 4 0.517857
6 Rey 13 8 5 0.513274
5 Luke 8 5 3 0.509259

The big changes are with Finn and Han as the former drops to fourth while the latter climbs to second from the bottom. From this revised standpoint, Padme takes the crown of the best shot among the characters profiled.

If you want more Star Wars and Bayesian Statistics, check out this post.

Written on January 28, 2016