Fun and Profit with Markov Chain Monte Carlo

The NBA season has started and I'm ready to make some money now that the Supreme Court has legalized sports betting! So I searched around for sports betting models and came across a post on rugby prediction. Here it is replicated with NBA basketball data. The model is very simple and surely won't win me any money but it's a fun introduction to PyMC3, hierarchical (multilevel) models and MCMC.

Scores are based on the team's offensive (attack) and defensive strengths, as well as a home court advantage for the home team. Scores are assumed Poisson distributed. Data is for the 2017-18 regular season I found on Kaggle.

Without further ado, here are the Wednesday October 31 predictions. Like how Madden NFL runs a single simulation for their Super Bowl prediction, this is base on a single game simulation.

DET vs BKN    BKN -12.5    100-113
IND vs NY     IND -20.5    106-86
DEN vs CHI    CHI -4.5     112-117
UTA vs MIN    UTA -8.5     101-93
NO vs GS      GS -13.5     104-118
DAL vs LAL    LAL -1.5     107-109
SA vs PHO     SA -20.5     111-91
In [1]:
import numpy as np
import pandas as pd
import pymc3 as pm, theano.tensor as tt
import matplotlib.pyplot as plt
import seaborn as sns
from theano import shared

%matplotlib inline
In [2]:
df = pd.read_csv('nba_games_2017-18.csv')

teams = df.home_team.unique()
teams = pd.DataFrame(teams, columns=['team'])
teams['i'] = teams.index

df = pd.merge(df, teams, left_on='home_team', right_on='team', how='left')
df = df.rename(columns={'i': 'i_home'}).drop('team', 1)
df = pd.merge(df, teams, left_on='away_team', right_on='team', how='left')
df = df.rename(columns={'i': 'i_away'}).drop('team', 1)

df.head()
Out[2]:
home_team away_team home_score away_score i_home i_away
0 CLE BOS 102 99 0 6
1 GS HOU 121 122 1 27
2 IND BKN 140 131 2 18
3 DET CHA 102 90 3 21
4 UTA DEN 106 96 4 23
In [3]:
observed_home_goals = df.home_score.values
observed_away_goals = df.away_score.values

home_team = df.i_home.values
away_team = df.i_away.values

home_team_shared = shared(home_team)
away_team_shared = shared(away_team)

num_teams = len(df.i_home.drop_duplicates())
num_games = len(home_team)
In [4]:
with pm.Model() as model:
    # global model parameters
    home = pm.Flat('home')
    sd_att = pm.HalfStudentT('sd_att', nu=1, sd=0.5)
    sd_def = pm.HalfStudentT('sd_def', nu=1, sd=0.5)
    intercept = pm.Flat('intercept')

    # team-specific model parameters
    atts_star = pm.Normal("atts_star", mu=0, sd=sd_att, shape=num_teams)
    defs_star = pm.Normal("defs_star", mu=0, sd=sd_def, shape=num_teams)

    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))

    home_theta = tt.exp(intercept + home + atts[home_team_shared] +
                        defs[away_team_shared])
    away_theta = tt.exp(intercept + atts[away_team_shared] +
                        defs[home_team_shared])

    # likelihood of observed data
    home_points = pm.Poisson(
        'home_points', mu=home_theta, observed=observed_home_goals)
    away_points = pm.Poisson(
        'away_points', mu=away_theta, observed=observed_away_goals)
In [5]:
with model:
    trace = pm.sample(2000, tune=1000)
    pm.traceplot(trace)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [defs_star, atts_star, intercept, sd_def, sd_att, home]
Sampling 4 chains: 100%|██████████| 12000/12000 [00:21<00:00, 566.80draws/s]