A Bayesian Model of NCAA Match Point Margins

Predicting NCAA Division One Tournament Outcomes

Nathan J. Doogan

EMail: doogan.1@osu.edu

Twitter: @ndoogan

Web: http://doogan.us/

Update: This post was updated on March 3rd, 2018 to fix an error in the model identified by okay zed.

Introduction

I don't watch sports, but I like data and I like probabilistic modeling. I found some data of NCAA college basketball match outcomes and wanted to try my hand at a predictive model for future games. More specifically, I wanted to win a March Madness pool, and I wanted a machine to tell me how. Perhaps primarily, though, I wanted some practice using a Bayesian approach to solve a practical problem.

Since I rarely watch sports, I don't know much about them. So the model needed to be predictive with minimal domain expertise from me. I do know some things. For example, the team with the most points wins, and the larger the outcome point differential (the “margin” hereafter), the more disparate the true underlying skill levels of the teams (probabilistically speaking).

I'm certainly not the first person to want a machine to help me win a March Madness pool. Adit Deshpande tried last year. According to his blog post, he landed on a gradient boosted regression tree as his best model. He used cross-validation to estimate an accuracy of \(76.4\%\) for guessing the correct winner (seems pretty good!). His bracket was apparently not much of a success, however. Sorry Adit.

A group at Georgia Tech has been successful at beating a bunch of other team rating systems, which can be used for prediction. This site has some comparisons. The second chart on the webpage makes it look like their model is correct over \(95\%\) of the time! But I think the chart is misleading. You can examine the data yourself in a provided spreadsheet. It looks like their model is correct somewhere in the range of \(60\) - \(75\%\) of the time. This does not necessarily make it worse than Adit's, because these guys are predicting tournament games, which might be a different kind of animal compared with regular season games. Whereas I believe Adit was doing cross-validation by splitting regular season games into training and test sets.

This post walks you through a set of steps I took to reach my conclusion. You can expect to read about some of my thoughts about or approaches to Bayesian priors, model specification, inference, cross-validation, Stan, and R. I don't claim to be an expert in, like, any of this, and I welcome all feedback (EMail: doogan.1@osu.edu, Twitter: @ndoogan).

Once I'm happy with the quality of my model (or run out of time), and once the first round teams are selected for the tournament, I will build my 2018 NCAA Division One Men's Tournament bracket. I'll update this post to describe how I did.

The Model

The score margin \(y_{ijt}\) is the the final score of team \(i\) minus the final score of team \(j\) for a given match \(t\). I model the margin as a difference between the skill of team \(i\) (notated as \(\theta_i\)) plus a home court advantage of team \(i\) (the value of \(\alpha_i\), if team \(i\) is playing at home) minus the skill of team \(j\) (notated as \(\theta_j\)) plus a home court advantage of team \(j\) (the value of \(\alpha_j\), if team \(j\) is playing at home); or in brief, it's the difference in the skill/advantage combo between opposing teams. And finally, this won't be a perfect model so I add some random Gaussian noise, \(\epsilon_{ijt}\). If \(y_{ijt}\) (the margin) is positive, team \(i\) won game \(t\), and if it is negative, team \(j\) won.

\[ y_{ijt} = (\theta_i + \alpha_i h_{it}) - (\theta_j + \alpha_j h_{jt}) + \epsilon_{ijt} \]

The most important component of the model is the vector \(\theta\), the 351 elements of which are estimates of the true latent skill of the 351 NCAA Division One basketball teams.

The home court advantage terms, \(\alpha_i h_{it}\) and \(\alpha_j h_{jt}\) include a known value \(h_{it}\) or \(h_{jt}\), which are \(1\) if the respective team plays at home for match \(t\), and \(0\) otherwise. Thus, the advantage term for the away team falls out, or both advantage terms fall out if neither team played at home. The advantage term also includes a team-specific factor \(\alpha\), which is a team specific home-court advantage in points. This, because others have (convincingly enough) shown that home-team advantage is real and varies by team.

The advantage terms will not play a role when predicting NCAA Division One tournament outcomes because the teams are unlikely to (will not?) play on their home court. However, they are important during estimation to ensure that a true home court advantage is not “absorbed'' into the latent team skill \(\theta\). In other words, these terms delineate the home-court advantage and skill so that, for tournament predictions, the home-team advantage can be appropriately discarded when necessary.

Priors

My choices and some limited justification

A proper Bayesian model includes a distribution on all unknown parameters to characterize my prior beliefs about the location of those parameters. Just below, I write out all of the model priors (including so-called "hierarchical priors”), and then discuss them.

\[ \theta \sim N(0, \tau_\theta) \] \[ \alpha \sim N(\eta, \tau_\alpha) \] \[ \epsilon \sim N(0, \sigma) \] \[ \eta \sim N(4,1) \] \[ \tau_\theta \sim \text{half-Cauchy}(0,1) \] \[ \tau_\alpha \sim \text{half-Cauchy}(0,.25) \] \[ \sigma \sim \text{half-Cauchy}(0,1) \]

I start with \(\tau_\theta\) and \(\sigma\). They are the standard deviation of the population of team skills and the point margin error, respectively. Both are on the scale of game points. I characterize my prior beliefs about the magnitude of these standard deviation parameters with a half-Cauchy (sounds like “Koshi”) distribution with a location of \(0\) and scale of \(1\). Here's a visualization of this distribution.

plot of chunk unnamed-chunk-2

The next line of code answers the question, “Below which point does \(95\%\) of the probability mass of this distribution live?”

require(truncdist)
qtrunc(0.95, spec="cauchy", a=0, b=Inf, location=0, scale=1)
## [1] 12.7062

This means that I believe with \(95\%\) probability the true value of these parameters is less than \(12.7\) (95th percentile) with increasing weight given to values further below \(12.7\), and zero weight for any value below \(0\).

This is reasonably consistent with my actual prior beliefs about how large these standard deviation parameters could be. Net of random noise, I would not expect to see consistent margins between a given match up of much bigger than \(25\) points, and probably it would be much less. Likewise, the standard deviation of the random match-level “noise” ought to be similar. A pretty surprising point swing (from expectation) would not be much bigger than \(25\). So a prior belief that includes the possibility of a \(25\) point swing from expectation, but with more weight on smaller swings, seems reasonable.

Likewise, I also believe, a priori, that \(\eta\), the home-court advantage is about \(4\) points give or take a couple of points. I believe this because of prior work that has been done (link). Even if this prior work was not available, I'd still believe a priori that \(\eta\) is probably going to be pretty small since a home court advantage does not seem to dominate the conclusion of NCAA must in such a way that it must be very large.

The cited work also suggests that teams have varying degrees of home court advantage, and that the variation is inclusive of less than a point, and up to \(7\) points (roughly a standard deviation of \(2\)). That strikes me as reasonable, but of course it could be an under- or over-estimate. I define my prior for \(\tau_\alpha\) to again be half-Cauchy, but I choose a smaller scale (\(0.25\)) than that for \(\tau_\theta\) and \(\sigma\). half-Cauchy with scale of \(0.25\) is inclusive of \(2\), but also puts weight on values that are a little higher and a little lower.

qtrunc(c(0.95,0.99), spec="cauchy", a=0, b=Inf, location=0, scale=.25)
## [1]  3.176551 15.914185

The output of the code above again answers the question, “Below which point does \(95\%\) of the probability mass of this distribution live?” The answer is about \(3.2\), which means for the most part I don't think the standard deviation is higher than that.

The output also answers the question “Below which point does \(99\%\) of the probability mass of this distribution live?” The answer is a whopping \(15.9\), indicative of the “fat-tailed” nature of the Cauchy distribution. This means that while I put a lot of belief in a fairly small value for this standard deviation, I give some allowance for a surprising estimate.

Sidebar on priors and hierarchical priors—estimate-constraining rubber bands

The effect of priors is very interesting and useful. If, for example, the data were to suggest that the standard deviation of \(\alpha\) (which I refer to as \(\tau_\alpha\)) was far above \(3.2\), my prior belief specification would increasingly penalize values that are increasingly large. It acts like a rubber band tugging on the estimate of \(\tau_\alpha\) to softly constrain the estimate towards a range I specified was believable. The estimate can definitely go beyond that range because technically the half-Cauchy distribution puts non-zero weight on values up to infinity. However, the data will have to provide strong evidence that it should. Or said differently, the data will have to provide a lot of force to stretch the rubber band that far.

This rubber band action is a fascinating characteristic of prior distributions because they protect against wild parameter estimates in cases where the data provide weak and perhaps unusual information about the location of a parameter (the effect is sometimes referred to as regularization). For example, if an experimental treatment group contained only a few participants in total who all, for some reason, showed an unrealistically large effect compared to control, a reasonable prior would reign in the large effect estimate to a balance point between the prior rubber band and the metaphorical weight (the data) hanging from it.

One of the most interesting characteristics of Bayesian models like mine is the hierarchical structure of parameters and priors. The heart of the model I present is \(\theta\), the vector of team-specific skill values, which are estimated from the game outcome data. Each element of the vector belongs to a team, and is given the prior \(N(0,\tau_\theta)\). I call this prior “hierarchical” because it depends on an unknown parameter, \(\tau_\theta\), which itself requires the specification of a prior distribution. So I have a prior on \(\tau_\theta\), then \(\tau_\theta\) is embedded in the prior for \(\theta\), then \(\theta\) is embedded in the model for \(y_{ijt}\).

To repeat, the width of the \(N(0,\tau_\theta)\) prior on \(\theta\) depends on \(\tau_\theta\), which is estimated from the data. This is remarkable because it means that elements of \(\theta\) (estimates of latent team skill) that are not well informed by the data can use information from other elements of \(\theta\) that are well informed. This happens because the prior \(N(0,\tau_\theta)\) dictates a soft constraint—a rubber band effect—on where elements of \(\theta\) are most likely to be located. So if the data weakly suggest some \(\theta_i\) is \(10\), but \(\tau_\theta\) was estimated from data to be \(2\), (and thus, most team specific values of \(\theta\) are between about \(-4\) and \(4\)), the \(N(0,\tau_\theta=2)\) prior will rubber band that \(10\) down to a more reasonable value. If the data informing the location of \(\theta_i\) are extremely weak, the rubber band will snap \(\theta_i\) almost all the way back to the center of the distribution of \(\theta\), zero in this case.

This feature of hierarchical Bayesian models is sometimes referred to as partial pooling. I choose not to expand on that term. Though, it totally makes sense, and I encourage you to read about it elsewhere (e.g., in one of Jim Savage's blog posts) and compare with what I've said. I now digress and move on to the data and modeling sections.

Data

For this project, I used 2017/2018 men's college basketball data from Massey Ratings. On that website, you can work your way here to download a nicely formatted CSV file of only division one match outcomes for the NCAA men's league. I select “Matlab games”, click “go” and it eventually reveals a text file formatted well for reading into R. I do the same for the “Matlab teams” radio button to get a CSV of team names mapped to the ID numbers that are used in the outcome CSV.

I save these files in a convenient location and name them 2018NCAA_D1_Scores.csv and 2018NCAA_D1_Teams.csv, which will soon appear in code. The files used in this post were downloaded on February 15, 2018.

In the code below, I assume you've placed the data files into an appropriate file directory and set the current working directory within R. First I load the files and give the resulting data frames appropriate column names. Then I do some brief house-keeping with the name strings, recode the home-team indicators, and take a quick look at the raw data.

# setwd("/path/to/files/")

d <- read.csv("2018NCAA_D1_Scores.csv", header=F) 
names(d) <- c('gid', 'date', 'teami', 'homei', 'scorei', 'teamj', 'homej', 'scorej')

teams <- read.csv("2018NCAA_D1_Teams.csv", header=F, stringsAsFactors=F)
names(teams) <- c('tid','name')

## Some string house-keeping for the team names
# substitute "&amp;" with "&"
teams$name <- gsub("&amp;","&",teams$name)
# remove unwanted space padding
teams$name <- gsub(" ","",teams$name)

# the home-team indicators are coded such that
# if teami is playing at teamj's home, homei is
# -1. I'd rather homei simply indicate if the 
# game is at teami's home or not (0/1)
# And the same for homej
d$homei <- (d$homei == 1) * 1
d$homej <- (d$homej == 1) * 1

# Take a look at the data
head(d)
##      gid     date teami homei scorei teamj homej scorej
## 1 736990 20171023    86     1     94   112     0     78
## 2 736994 20171027   241     0     73   200     1     70
## 3 736995 20171028   234     1     94   124     0     72
## 4 736996 20171029    80     0     76   228     1     70
## 5 736997 20171030   138     1     92   179     0     67
## 6 736999 20171101   344     1     69   210     0     38
head(teams)
##   tid        name
## 1   1 Abilene_Chr
## 2   2   Air_Force
## 3   3       Akron
## 4   4     Alabama
## 5   5 Alabama_A&M
## 6   6  Alabama_St

The model I'm fitting is of the margin, which is not present in this data. However it can be calculated as a simple difference between the scores of teams \(i\) and \(j\) for each row.

d$margin <- with(d, scorei - scorej)
head(d)
##      gid     date teami homei scorei teamj homej scorej margin
## 1 736990 20171023    86     1     94   112     0     78     16
## 2 736994 20171027   241     0     73   200     1     70      3
## 3 736995 20171028   234     1     94   124     0     72     22
## 4 736996 20171029    80     0     76   228     1     70      6
## 5 736997 20171030   138     1     92   179     0     67     25
## 6 736999 20171101   344     1     69   210     0     38     31

I can take a look at the distribution of scores.

plot of chunk unnamed-chunk-8

Oops. What are all those zeros? It turns out the data included scheduled matches that have not yet been played. So as one more matter of house-keeping, I can remove those rows, which I identify by zero scores for both teams.

d <- d[!(d$scorei == 0 & d$scorej == 0),]
nrow(d)
## [1] 4401

This leaves us with 4401 match outcomes to work with.

One potentially important note about the data is that they are organized such that teami is always the winning team for a match, and teamj the loser. I don't think my model will attempt to take advantage of that fact (which would be bad for out-of-sample prediction!), but do keep that in mind if you choose to play with this data. I give it attention in the Cross-Validation section below.

Building the Model with Stan

Now I'm ready to build a model, which I do with the wonderful open source probabilistic programming language and HMC posterior sampler called Stan. This software is very well written and documented, and the fact that it is open source literally gives me warm fuzzy feelings in my stomach. Many thanks to the Stan team and their funders for this truly amazing software package.

I'm not going to teach you how to build a model, because like I said, Stan is really well documented. However, I will show you the code for my model. I use R for virtually all of my data work. Therefore, I also use the rstan interface to Stan. Below I define an object model within R that is one long string containing the Stan model code.

model <- "
  data{
    int N;
    vector[N] y;
    int team_i[N];
    int team_j[N];
    int h_i[N];
    int h_j[N];
    int N_g;
  }
  parameters{
    vector[N_g] alpha_raw;
    vector[N_g] theta_raw;
    real eta;
    real<lower=0> tau_theta;
    real<lower=0> tau_alpha;
    real<lower=0> sigma;
  }
  transformed parameters{
    vector[N_g] alpha;
    vector[N_g] theta;
    alpha = eta + alpha_raw*tau_alpha;
    theta = theta_raw*tau_theta;
  }
  model{
    // vector for conditional mean storage
    vector[N] mu;

    // priors
    tau_theta ~ cauchy(0,1)T[0,];
    tau_alpha ~ cauchy(0,.25)T[0,];
    sigma ~ cauchy(0,1)T[0,];
    eta ~ normal(4,1);
    theta_raw ~ normal(0,1);
    alpha_raw ~ normal(0,1);

    // define mu for the Gaussian
    for( t in 1:N ) {
      mu[t] = (theta[team_i[t]] + alpha[team_i[t]]*h_i[t]) - 
              (theta[team_j[t]] + alpha[team_j[t]]*h_j[t]);
    }

    // the likelihood
    y ~ normal(mu,sigma);
  }
"

If you look closely at the Stan model code, you'll find some very familiar structure, as well as some unexpected things. In particular, you should be able to link the code in the for loop within the model block to the first regression-like line of the model as defined in the The Model section. The unexpected things might include these raw parameters and the transformed parameters code block. These are subtle reparameterizations that result in an equivalent model, but which Stan likes a little better. I think of these as quirks of Stan, but someone else would have a better explanation.

With the model code defined, I need to package the data in a way that rstan will expect (as an R list). This process includes some subtle details depending on how the known variables are defined in the data block of the model code. I refer you to the documentation for the details. In this case, it is pretty simple.

sd <- list(N=       nrow(d),
           y=       d$margin,
           h_i=     d$homei,
           h_j=     d$homej,
           team_i=  d$teami,
           team_j=  d$teamj,
           N_g=     nrow(teams))

Check out the connection between what's in the data block of the model code and what is defined in the list called sd above.

Believe it or not, I'm ready to make this thing go and estimate all those parameters of our model. Below I call Stan using the model code model and data sd.

require(rstan)
# fit the model
fit <- stan(model_code=model, iter=800, warmup=500, data=sd, refresh=100, chains=4, cores=4)

Now that we've drawn 1200 samples from the posterior distribution of the model parameters (what I referred to as “estimation” above), we can examine some summary statistics of them. I do not print out summary statistics for \(\alpha\) or \(\theta\) because they are both composed of 351 elements. Instead I'll visualize some of them later.

print(fit, c('eta','tau_theta','tau_alpha','sigma'))
## Inference for Stan model: 3c77c0af3f8b270dd43416701e15c0cb.
## 4 chains, each with iter=800; warmup=500; thin=1; 
## post-warmup draws per chain=300, total post-warmup draws=1200.
## 
##            mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## eta        3.78    0.01 0.19  3.41  3.66  3.78  3.92  4.15  1200 1.00
## tau_theta  8.61    0.03 0.38  7.88  8.35  8.60  8.87  9.35   169 1.02
## tau_alpha  0.34    0.03 0.34  0.01  0.09  0.22  0.50  1.28   182 1.01
## sigma     10.70    0.00 0.12 10.48 10.62 10.70 10.77 10.95  1200 1.00
## 
## Samples were drawn using NUTS(diag_e) at Tue Feb 27 16:34:30 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Starting at the top, \(\eta\) (eta) is the mean parameter for the hierarchical prior for \(\alpha\), which means it refers to the population average home-court advantage in points. The mean column in the output summarizes the central tendency of the samples for this parameter, and I'll casually refer to this as the “estimate”—about \(3.8\) give or take two standard deviations, which is about \(0.4\) points. Because I like to win, I'd definitely prefer to play on my home-court rather than my opponent's.

I now move out-of-order to \(\tau_\alpha\) (tau_alpha), the standard deviation parameter for the hierarchical prior for \(\alpha\). The mean summary of this parameter is about \(0.3\), meaning the home-court advantage varies across teams by only a little over half a point on either side of the mean (\(\eta=3.8\)). This is a lot less variable than Jimmy Boyd's estimates found here. It's probably safe to simplify the model such that the home-court advantage is fixed for all teams.

The estimate for \(\tau_\theta\) (tau_theta) is the standard deviation parameter for the hierarchical prior for \(\theta\). The estimate is about \(8.6\), meaning a team's \(\theta\) (my estimate of the team's true skill) could vary above or below \(0\) by about \(17\) units (\(2\) standard deviations). But this isn't particularly meaningful to me. \(\tau_\theta\) simply scales the distribution of \(\theta\) so it maps appropriately to score margins, net of other model features.

Finally, \(\sigma\) (sigma) is the standard deviation parameter for the residual of the margin that the systematic components of the model could not explain. Ideally this will be small. I suppose \(10.7\) seems high, but I'll use other methods to determine how good the model is.

Now I want to visualize \(\theta\) so I can compare the team rankings of my model to other rankings.

theta <- extract(fit)$theta
colnames(theta) <- teams$name

summary_theta <- apply(theta, 2, function(x) c(mean=mean(x), quantile(x,c(.025,.975,.25,.75))))
top25 <- summary_theta[,order(summary_theta[1,], decreasing=T)[1:25]]

par(mai=c(1,2,.75,.75))
plot(NA, xlim=c(min(top25),max(top25)), ylim=c(1,25), ylab='',
  xlab='skill', yaxt='n')
title(main=expression("team-specific "*theta*" with 50% & 95% credible intervals"), cex.main=.85)
for(i in seq(ncol(top25))) {
  lines(c(-100,100), rep(26-i,2), lty=3, lwd=.5, col=rgb(.5,.5,.5))
  lines(top25[2:3,i],rep(26-i,2), lwd=1)
  lines(top25[4:5,i],rep(26-i,2), lwd=3)
  points(top25[1,i],26-i, cex=1, pch=20, col=rgb(.6,0,0))
  axis(2, at=26-i, label=paste(colnames(top25)[i],i),las=2,cex.axis=.8)
}

plot of chunk unnamed-chunk-15

I have Villanova on top, and this is after their unexpected losses to St. John's and Providence, neither of which appear in my top 25. It's also notable, while my top 25 includes a lot of the same teams found in a rank order I pulled up in google, many of my rankings appear to be somewhat unique. My rankings are perhaps more similar to those of the LRMC from Georgia Tech. I'm going to be optimistic and believe for now that differences are due to my model capturing something important that others do not. The tournament will decide.

The model is Bayesian, so once fitted, we should be able to ask it the probability that some team \(a\) will beat another \(b\). Incidentally, this is subtly different than asking the probability that team \(a\) is better (has a higher skill) than team \(b\). Below I extract the parameter samples we need and define a function that should allow us to make such probabilistic predictions.

theta <- extract(fit)$theta
# assign names to the columns of the matrix of theta samples
# assuming the teams data frame is ordered (increasing) by tid
colnames(theta) <- teams$name

alpha <- extract(fit)$alpha
colnames(alpha) <- teams$name

sigma <- extract(fit)$sigma

# now define a function to calculate the probability
# it adds in match level noise (epsilon ~ N[0,sigma])
# so we repeat the calc repl times and average
compare <- function(i,j,th,a,sig,homei=0,homej=0,repl=10) {
  reps <- replicate(repl, mean((th[,i] - th[,j] + a[,i]*homei - a[,j]*homej + 
                                rnorm(nrow(th),0,sig)) > 0))
  setNames(c(mean(reps),sd(reps)/sqrt(repl)), c(paste(i,'>',j),'mcmcSE'))
}

My alma mater is Ohio State University, which my model ranks \(17^{th}\). I wonder how they'd fare against number one ranked (by my model) Villanova on neutral ground (i.e., neither team gets a home-court advantage).

compare('Ohio_St','Villanova', theta, alpha, sigma, homei=0, repl=100)
## Ohio_St > Villanova              mcmcSE 
##         0.252600000         0.001366038

A \(25\%\) chance. What if Ohio State had a home-court advantage?

compare('Ohio_St','Villanova', theta, alpha, sigma, homei=1, repl=100)
## Ohio_St > Villanova              mcmcSE 
##         0.371525000         0.001275473

At home the chances improve substantially to \(37\%\). A rational person who thinks my model is reasonable probably wouldn't put her money on Ohio State, though.

Cross-Validation

I can use the compare() function that I defined above to generate predictions for any match-up among the \(351\) teams, even those that have not been observed. I temporarily set a decision rule for prediction that I bet on the team to which my model gives the highest probability of victory in a given match, even if that team only has a \(50.1\%\) chance of winning (here's a nice post explaining why this may not be ideal in a March Madness bracket pool). With that rule, I can check the expected accuracy of my model by using cross-validation. The variant I use can be called Monte Carlo cross-validation, and consists of repeatedly splitting the data set into an 80% training set and a 20% testing set, fitting the model with the training set, and then predicting the outcomes of the matches represented in the testing set. The accuracy will vary each time due to randomness in the train/test data splitting, so it needs to be done many times. The resulting accuracy estimates are averaged to identify a reasonable estimate of how good my model is at predicting the outcome of new games.

First I need a function that can take in a held-out testing set of data and use compare() for every match record it contains.

cv <- function(test, ...) {
  # for reference:
  # columns 3 & 6 of 'test' are team IDs for teami & teamj
  # columns 4/7 indicate whether teami/teamj played at home
  # column 9 is the margin
  res <- apply(test, 1, function(x) {
    igtj <- compare(x[3],x[6],a,adv,sig,homei=x[4],homej=x[7],repl=20)[1]
    if(sign(x[9]) == 1) return(igtj > .5)
    if(sign(x[9]) == -1) return(igtj < .5)
  })
  c("% correct"=mean(res))
}

The code below does all of the work of repeatedly splitting the data into training and testing sets, fitting the model on the training set, testing the accuracy on the testing set, and saving the results for later use.

# I store the accuracy estimates in a vector
# called MCCV (Monte Carlo Cross Validation)
MCCV <- NULL

# for loop, 50 iterations
for(i in seq(50)) {

  # the data were organized such that the winner
  # is always recorded as teami. While I don't 
  # think my model structure can be biased from
  # that, I nevertheless shuffle these up for
  # each iteration
  # 1. randomly split the data in half
  sss <- sample(nrow(d),.5*nrow(d))
  d1 <- d[sss,]
  d2 <- d[-sss,]
  # 2. flip the team labels in one half
  names(d2) <- c("gid", "date", "teamj", "homej", 
    "scorej", "teami", "homei", "scorei", "margin")
  # 3. reorder the columns so the labels once again
  # align with those of d1
  d2 <- d2[,c("gid", "date", "teami", "homei",
    "scorei", "teamj", "homej", "scorej", "margin")]
  # 4. flip the margin for the matches we reversed
  d2$margin <- d2$margin * -1
  # 5. recombine the data
  d <- rbind(d1,d2)

  # hold out 20% of data points as a test set
  testss <- sample(nrow(d),.2*nrow(d))
  test  <- d[testss,]
  train <- d[-testss,]

  # build the Stan data object
  sd <- list(N=nrow(train),
             y=train$margin,
             h_i=train$homei,
             h_j=train$homej,
             team_i=train$teami,
             team_j=train$teamj,
             N_g=nrow(teams))

  # fit the model
  fit <- stan(model_code=model, iter=700, warmup=400, 
    data=sd, refresh=100, chains=4, cores=4,
    control=list(adapt_delta=.85))

  # extract the samples we need for testing
  a <- extract(fit)$theta
  colnames(a) <- teams$name
  adv <- extract(fit)$alpha
  colnames(a) <- teams$name
  sig <- extract(fit)$sigma

  # calculate and capture the accuracy estimate
  MCCV[i] <- cv(test,a,adv,sig=sig)

  # print the current accuracy estimate
  # as feedback while I wait
  cat("\nIteration ",i,": ",MCCV[i], sep='')
}

Below I summarize the cross-validation results.

# mean accuracy
mean(MCCV)
## [1] 0.7513864
# standard deviation of accuracy
sd(MCCV)
## [1] 0.01260453
# standard error of accuracy
sd(MCCV)/sqrt(length(MCCV))
## [1] 0.00178255

Oh boy… well I'm almost on par with Adit Deshpande's \(76.4\%\). I expect to get about \(75\%\) of my predictions correct. But hey, this really isn't bad at all considering my data consist solely of outcome score margins and an indication of who's court the game was played on. My moderate accuracy doesn't mean I'm guaranteed to fare poorly in a pool, but it also doesn't give me a ton of confidence. That's OK because this has been a lot of fun so far!

The next step is to build out my bracket (the official bracket is not out yet). I want to plug this post once more because I think it contains some important considerations for building a bracket that go beyond simply always choosing the team with the highest probability of a win in a particular match.

While I am waiting for the official bracket to come out, I will be thinking about how I can make this model better. I have a couple of ideas, and they won't appear in this post. But if you're interested, or have your own ideas you want to share, let me know.

As I said earlier in the post, feel free to give any comments or feedback (EMail: doogan.1@osu.edu, Twitter: @ndoogan). I have thick skin, so even harsh criticism will be taken if useful.