Quite a few people have posted articles or BGG threads with the results of their analysis of OCTGN data. There have been at least a half-dozen since the last data dump. In that span, I’ve been slowly working on R code to process and analyze the data, since once I have code written, it will work for all future OCTGN data dumps as well as the current one.
To this point, I have:
- Basic cleanup: removing games with 0 deck size, etc.
- Group the games into periods (e.g. week or month, not OCTGN version yet).
- Compute Glicko ratings for each player through each period, saving rating history data along with the final ratings after the most recent period.
- Filter out players with Glicko high deviation ratings (i.e. players where I’m not confident their rating is accurate). Currently removes the top 25% of players by deviation. This leaves ~3,400 players in the most recent dataset.
- Filter the games by any subset of players – currently I take only the top 25% of rated players, but this is trivial to change. This leaves ~800 players in the current dataset.
- Group the data by ID and compute: games per period, win percentage per period, percentage of wins by flatline per period.
What I’m working on now is validation: I ran the analysis monthly vs weekly and looked at the median and mean games played by each player to see if it was affecting the Glicko calculations – per the algorithm’s creator, Glicko works best when most players are playing 5-10 games per period. Weekly ended up being closer to this than monthly, but there is still a large spread in games played, even after filtering out players who played fewer than 5 games (which is something like 5,000 players!).
Next I’m planning to use the number of games played per ID per period to try and put a confidence interval on my winrate calculations and my flatline calculations.
Some of what I’ve done to this point overlaps with work posted here and on BGG, but since I’m also using this to learn a new R package (dplyr), I’m not too worried about that. I’m also planning to release the code on GitHub once I have something that’s worthy of a 1.0, and as a longer-term item I’m tinkering with building an interactive web app so that people can slice up the data in some of the ways I’ve described above in real time. The nice thing about all of this is that any new code builds on old code, so even if the web app itself takes a while, all of the work I do between now and then supports it.
As a side note, I also have generalized code for computing the cumulative univariate and multivariate hypergeometric distributions thanks to some help from jimmypoopins, meaning that I can do some fairly robust probability calculations for deck construction (evidence in the TWIY* thread). There’s nothing stopping me from linking this to my OCTGN analysis code, although there isn’t much to be done with it unless we get more detailed deck information from OCTGN data in the future, since the only deck information we currently have is the number of agenda cards in the Corp player’s deck along with the total deck sizes.
So what else would people like to see from the data? I have everything in the original OCTGN data to work with – deck size, agenda card count, game duration, turns, etc. Some of those things have already been analyzed by others, but would be fairly easy for me to incorporate into my code. Is there anything that hasn’t been done yet that you would like to see? Anything missing from previous analyses, or extensions of previous analyses that you would like to see?