Baseball: A Data Treasure Trove
There is no shortage of free baseball data available on the internet.
While some may hate baseball color commentators for ridiculous insights along the lines of: “The Tampa Bay Devil Rays are playing .750 ball when Chris Archer pitches on a Tuesday in the rain after a full moon”.
I love it, because it means somewhere that data is being stored and curated. No matter how nonsensically it may be used.
Sites like Fangraphs and Baseball Reference host just about every individual and team metric you could ever dream up. From oldies like AVG and ERA, to new kids on the block like SIERA, xFIP, and WoBA. It’s all there and freely available.
The Statcast era brought terms like “Exit Velocity”, “Launch Angle”, and “Spin Rate” into the public eye. Brooks Baseball provides the tools for visualizing and quantifying that data.
Even Major League Baseball itself provides all of the data used in the popular At Bat app for free online.
Of course, there are tons of paid services available… but why pay when we don’t have to.
Historical vs. “Day Of” Data
With all that data, the real trick is narrowing down to what we actually need and retrieving it freely.
For Charlie Hustle, the goal is to use what we know about two teams prior to a games start to predict who we expect to win the game and determine what the payout would be for a bet placed on each team.
We need to retrieve this data not only for the games on a given day on which we’re making picks, but also historically. Once the season starts in April we’ll pull daily data in realtime, but until then we’ll rely on data from the past to test our predictions.
Which means that for our historical data we need to be careful that we don’t pull any information we wouldn’t actually have before the game started.
This can be trickier than it sounds.
As an example, imagine we’re analyzing the May 22, 2017 game between the Houston Astros and the Detroit Tigers. Brad Peacock, normally a reliever for Houston, is making his first start of the season for the injury riddled Astros rotation. To this point Peacock has pitched around 1.2 innings in each appearance and struck out 2-ish batters. On May 22, he’ll pitch 4.1 and strike out 8.
On May 22, 2017 there was absolutely no way for us to know that Peacock would strike out 8 in his debut. However, immediately after the game that information was freely available online.
Houston was throwing a reliever in his first start against Michael Fulmer, the Tiger’s ace. Without knowing how Peacock would perform, the Tigers had a definitive edge at the Starting Pitcher position.
If you made a prediction for May 22, 2017 based on what is known today about May 22, 2017…
At that point we may as well be asking: “On May 22, 2017 the Detroit Tigers visited the Houston Astros and lost the game on a score of 0-1. Who do you think won the game?”.
The danger here is that it’s very easy to accidentally pull too much information into our historical predictions. Which results in a “prediction” model that is way better than it could ever actually be.
When analyzing games historically, we’ll use statistical information from the day before and what I’ll call “Program Information” (Team Matchup, Starting Pitchers, Lineups, Game Time, Location, etc.) from the day of.
What Data Do We Need? Where Do We Get It? How?
I narrowed down to a couple key pieces of information that needed to be gathered for every game for every team:
- Team Record to date
- Total Runs Scored/Allowed to date
- Starting Pitcher
From these data points we can extrapolate more complex data like:
- Team Winning Percentage
- Expected Team Winning Percentage (Based on The Pythagorean Theorem of Baseball)
Which are great, widely accepted metrics to determine how a team looks on paper vs. how they’re actually performing.
Of all this, historical moneyline data turned out to be the trickiest to find…
The technique I used for gathering all of this data fo’ free is called “web scraping”.
When you type a URL into your browser and hit “Enter” a whole bunch of magic happens in the background (explaining that magic is a common question asked in Software Development interviews, sup Microsoft Internet Explorer team), but once the page loads on your computer all it is is a wall of text.
The browser interprets the wall of text and renders it all pretty so you can scan through and find the information you’re looking for.
For example, let’s say you wanted to know what the moneyline was for that May 22, 2017 game between the Tigers and Astros.
You might start by Googling “Tigers Astros May 22, 2017” and after clicking on a couple links (MLB.com doesn’t have moneylines, obviously. This Rule 21 posting condemning sports betting can be found in every single Major League clubhouse.) you’ll find what you’re looking for on this ESPN page way down at the bottom. HOU -138 (We’ll go more in depth on what this number means in a later piece).
Great! That’s all we need right! WRONG!!!!
First, we’ll get into what web scraping is and why this sucks from that perspective.
Remember how you navigated to that ESPN page and scanned the page for moneyline data? That won’t work with a web scraper, at least not efficiently.
A web scraper works in much the same way as a person when finding data but needs more specific direction. It opens a specific URL and looks for specific elements on a page. This isn’t the place for an in-depth exploration of HTML structure or web scraper functionality so for the purposes of this post, so we’ll just say that telling the web scraper to find something called “odds-details” at the URL http://www.espn.com/mlb/game?gameId=370522118 is the same as the process we went through when trying to find this same data manually above.
The key here is that the web scraper needs to know exactly where it’s going to retrieve this information. No Google search for “Tigers Astros May 22, 2017” here. Instead, we rely on parameters in the URL to get us where we need to be.
The base of the page URL is: http://www.espn.com/mlb/
To access a specific game another piece is appended to the end: game?gameId=370522118
This basically just says “give me the page for the game represented by gameID 370522118”.
Which isn’t super useful :/
Ideally, I’d like to be able to use a date and teams to retrieve this information. Not a spooky gameID.
*Sidenote: for anyone technical who’s like “bruh there’s a pattern for gameID there and it’s probably something like [awayTeamID][month][day][homeTeamID], ur a idiot you can definitely use this”… shut up and get back in your locker. There’s other reasons this page won’t work.*
Additionally, this page only provides the moneyline data for the winning team. For our purposes, we need the pre-game lines for both teams.
Fortunately, after plunging through sports betting forums and weighing several paid odds providers (turns out sports betting services want to make money on the info they provide), I happened upon covers.com.
Covers provides historical data for all MLB games in an easy to read format! Sweet!
For instance, I can access all of the moneyline data and more for Houston’s 2017 season here.
And the URL is super simple to parse: https://www.covers.com/pageLoader/pageLoader.aspx?page=/data/mlb/teams/pastresults/2017/team2981.html
I figured out which numeric value maps to each team, and can retrieve a whole seasons worth of data from a single web page. The Covers page will give me: Game Dates, Opponents, Scores, Results, Lines, O/Us, and Starting Pitchers for each game in a teams season. From there, I can calculate other values like Winning Percentage and Expected Winning Percentage.
*Starting Pitchers data was also tricky to find. Most historical game data contains Winning and Losing pitcher information. I was really stoked that Covers included this.
For each season, I’ll cycle through every team’s covers page and build a database containing all of this information. I’ll save the database to a CSV (Comma Separated Values) file so that I can access the information later without needing to scrape the page again. Which makes accessing the data faster and also ensures I’ll have the data saved and available in case the web page changes.
Next, I’ll use MLB At Bat data to combine all of these team CSV files into a game-by-game representation of the entire 2017 season. Since we’ll be running simulations on this data, we’ll want to have it stored in such a way that we can step through each game just as we would during a live season.
I also stumbled upon a giant CSV containing all of FiveThirtyEight’s MLB projections. Since it was free, available, and required no work by me. I stole a couple values from here (Team Rating, Win Probability, and Pitcher Quality) to add to my database. Below is 538s projection for that May 22nd game, note the advantage given to Tigers pitching 🙂
And there we have it! A game by game representation of any MLB season containing:
- 538 Starting Pitcher Ratings
- 538 Win Probabilities
- 538 Team Ratings
- Winning Percentages
- Expected Winning Percentages
- Winning Team
- Game Score
- Total Runs Scored for Team
- Total Runs Allowed for Team
- Average Runs Scored for Team
- Average Runs Allowed for Team
The cool thing is that this data is really easily manipulated to include/exclude whatever is necessary to improve our predictions.
Next time, we’ll use this data to simulate entire seasons and test effectiveness of basic methods of picking games.
Please provide feedback if things are too technical (or not technical enough). Or if there are pieces you found interesting that you’d like me to write more about, I’m happy to dive into any area in more detail.
Previous Charlie Hustle Posts: