Stats for Baseball Fans: Pitching Edition

A data scientist shows that ERA is the most important stat to look at as a casual fan

Courtney Perigo
Towards Data Science

--

Photo by Chris Moore on Unsplash

It’s baseball time again. Where I’m writing from, Chicago, the snow has started to melt and the Cubs are giving us hope in Spring Training in Arizona.

Of course it’s time to dust off the stats records and take a look at baseball and statistics with a data scientist.

As I described in my earlier blog on offensive statistics, here, Moneyball statisticians will look at a different set of metrics to assess the skill of a player. Pitching is no different. Metrics like “Runs Allowed per 9 Innings” and “Adjusted Wins Above Replacement” allow those analysts to understand the pitchers true impact by controlling for a team’s defensive capability or the game state when the pitcher enters the game.

Unfortunately, those metrics are not available on national TV broadcasts and you won’t be able to calculate them while enjoying the game live. So what should the average fan focus on when assessing MLB pitchers!? In this post, I explain the single metric to focus on — with a little help from statistics.

Why did I write this?

What’s a baseball article doing on a data scientist’s blog? It’s because I have merged something I grew up loving (baseball) with something I built my career on (data science.)

I grew up as a baseball card collector. My Dad spent his hard earned cash getting me a pack or two after work every now and then to share with a son who played little league and took a liking to the hobby.

The thing about baseball cards is they are filled with numbers. Metrics that broke down every little detail about how a player plays the game. I’d compare and contrast my favorite players and some of the most obvious stats would jump out. Frank Thomas, a huge baseball player from Georgia, would crush a ton of home runs and his cards would show that. Nolan Ryan, a Texas hurler, would have a ton of strikeouts listed on his cards.

Even someone who looks at these stats as a hobby has trouble knowing what to focus on to understand the best players in the game.

Today’s blog is a follow up to my previous blog looking at the best batters in the game; and I’ll make it simple for the most casual of fans out there to answer this question:

What is the one statistic you need to focus on to understand the best pitcher on your team. That stat is the player’s ERA; but let’s go on a journey to understand why that is, using statistics.

TLDR: The player with the lowest ERA statistic on your favorite team is VERY LIKELY the best pitcher they have. Remember to stand up and cheer when you hear their entry music.

Pitching stats are information overload.

Photo by Jason Weingardt on Unsplash

Pitching stats ARE insane. Major League Baseball uses high powered cameras to collect velocity and spin of pitches thrown at all major league baseball stadiums. The same system also measures where each player is on the field at all times.

As an analyst and data scientist, this data fascinates me. As a fan looking to just enjoy the game, it looks cool on tv — but there’s no way to crunch all that data. As a casual fan, there happens to be a single metric you can rely on to understand the best pitcher on any team.

Of course, to understand who’s best you could rely on the manager’s decisions. Who they put at the top of the rotation, the pitcher called on in closing situations, or who the starter is for a crucial match up with a rival team. That works as well, but we’re analysts and we like to know why that decision was made. Maybe even understand who on the bench should’ve been called on. So how do we cut through the noise and assess the player that just prevent runs from being scored against my team?

First, we have to fix a few issues with statistics and baseball.

To analyze the data effectively, we have to deal with the human element of baseball.

There’s a problem with using pitcher’s individual performances to understand how valuable throwing stats are.

Photo by Nicole Green on Unsplash

Players get sick. They take days off. The manager pulls them out of a game because they’re having a bad game. Some have a little too much fun the night before the game.⁴

For our pitchers analysis, we need to once again control for human variability. Instead of analyzing pitchers, we will analyze teams’ pitching capability to identify the metric to focus on.

We focus on regular season team performance because, on average, teams during a regular MLB season are comparable. Teams generally play the same amount of games, have the same opportunity to score runs and the same opportunity to have runs scored against them. The data is normal. Normalized data is a statisticians best friend.

Here’s a visual of player performance versus the same metric at the team level. One shows a wide variety of individual player performance with a large number of players scoring very little runs with some amazing players scoring over 2,000! The other shows team performance tends to be normal and distributed around the average of ~700 runs per team in a season.

Image by author

Pitching = Not Letting the Other Team Score Runs.

Dictionary.com defines baseball in this way:

(Baseball is) a game of ball between two nine-player teams played usually for nine innings on a field that has as a focal point a diamond-shaped infield with a home plate and three other bases, 90 feet (27 meters) apart, forming a circuit that must be completed by a base runner in order to score, the central offensive action entailing hitting of a pitched ball with a wooden or metal bat and running of the bases, the winner being the team scoring the most runs

The best pitcher will be the one that does the best job of preventing runs from being scored against their team. — Me

We focus on runs scored because that is the single most important objective of baseball — to hit runners in and score points. The team with the most runs wins the game. The pitcher is preventing this from happening. In this analysis of pitchers, our objective function will be runs scored against the pitcher’s team (i.e. runs against.) Our goal is to find a pitching statistic that is most correlated to runs against. Our hypothesis is that a pitcher’s ability to prevent hits without walking players will signal a strong pitcher.

The best pitcher on the team likely has the lowest ERA.

The Chadwick Baseball Database⁵ includes raw statistics we’ll use for this analysis. It does not include metrics that are considered crucial for evaluating pitching talent⁶. It also doesn’t include metrics that are displayed on baseball scoreboards. Since, as a casual fan, we’ll have access to the scoreboard the basic pitching ratios are the target of this analysis. Potential metrics include earned run average (ERA), walks plus hits per inning pitched (WHIP), hits per 9 innings (h/9), strike out percentage (K%) as well as other common pitching metrics of hitting stats dealt or given up (K, BB, H, 2B, 3B, HR, etc.)

Since the Chadwick Baseball Database doesn’t include calculated metrics, we’ll include those by calculating them.

Pitching stats that were calculated and their calculations are shown below:

WHIP = (walks + hits) / innings pitched

K/BB Ratio = strikeouts / walks

ERA = earned runs / innings pitched * 9

For this analysis, I further normalized the teams data by removing some outlier seasons. Below are full details of my team outlier removal:

  • Removed teams before 1970: Several key metrics weren’t tracked prior to the 1970 season (including sacrifice flies, hit by pitch and others.) We also know that rule changes make 1970 a good break point to normalize the pitching data.²
  • Removed team seasons where the number of games played was below 158: I wanted to remove seasons that were cut short by strikes and other schedule oddities.
  • Removed teams that do not play in the National or American Leagues: Our focus is Major League Baseball, which includes the African American leagues in the final assessment of pitchers.

With clean data, we run a simple correlation statistics to correlate the common pitching metrics with runs scored against the pitcher’s team. The analysis shows that ERA is the most important metric with a correlation of 0.982. This makes a lot of sense since the earned run average shows how many runs were scored, on average, against a pitcher/team. Our conclusion: The lower the ERA, the better the pitcher.

Image by author
Image by author

Best Pitchers Ever by ERA with *adjustment*

Now that we know that we should pay most attention to ERA, it’s a natural progression to see who the best players of all-time are according to this statistic. If we have a lot of hall of fame talent in our analysis, we directionally are on to something.

Unfortunately, this isn’t as simple as calculating the ERA for every player.

If I look at the lowest ERAs by player throughout baseball history, I notice a very interesting trend…

Image by author

Modern era pitchers have higher ERAs than pitchers from 1871–1968! What? Why does that happen?

The answer: The rules (or lack of them) gave pitchers an advantage in early baseball history.²

The players from the “dead ball” era (1871–1920) and even the “golden era” (1921–1968) were known for their use of pitching techniques that are now illegal.

Photo by Jose Francisco Morales on Unsplash

The dead ball pitchers were known to use the spit ball. A pitch, now illegal, that moved unpredictably and hurt batter productivity in baseball.

Golden era pitchers had the advantage of larger strike zones. In 1968, the strike zone shrunk and even tighter restrictions on pitching techniques.²

To pick the best pitchers of any era, we’ll need to adjust their ERA to account for these rules changes. Fortunately, our statistics “chops” will help do that.

The Linear Model

To identify how much I should adjust the dead ball era and golden era pitchers, we need to use a model that can account for the affect of the year that pitcher pitched. So, we return to our teams data and use a linear model to identify the proper adjustment to any pitcher that pitched primarily during each of those eras.

My method was to add a dummy variable. If a pitcher made more than 50% of their appearances during the dead ball era, they were assigned a “1” in the “ap_era_deadball” feature. If a pitcher made more than 50% of their appearances during the golden era, they were assigned a “1” in the “ap_era_golden” feature. All other pitchers were assigned a 0 as they were modern era pitchers.

We use the linear regression model to understand how much we should adjust each pitcher that pitched during the dead ball and golden eras of baseball.

The output of my model is below:

Image by author

You’ll see that this model is okay. Each of our statistics is statistically significant; but the model only accounts for 19.8% of the variance. The coefficients are the most important output, however. They are statistically significant and give us a way to adjust earlier pitchers so modern pitchers have a fair chance at being the best pitcher in history.

My adjustments:

“Dead ball” era pitchers will have 0.83529 added to their career ERA.

“Golden” era pitchers will have 0.15738 added to their career ERA.

“Modern” era pitchers will have no adjustment to their career ERA.

Without further adieu, below is the list of the top 15 all-time greatest starters according to my analysis.

Image by author

My childhood hero, Nolan Ryan, makes the cut according to ERA. There are no active starters in the top 15; and Grover Cleveland “Pete” Alexander is the king of the starters.

Here is the list of the top 15 relievers according to my analysis.

Image by author

The pitcher with the fastest fastball ever recorded³, Aroldis Chapman, is the top reliever. He’s also an active player with the New York Yankees, and a former Chicago Cub, ready to show us his stuff in 2021.

And finally the top 15 closers according to my analysis.

Image by author

A controversial pick, Craig Kimbrel, sits at the top of the closers list. Most people would fight me for writing that, but Mariano Rivera is a close second — so give me a little credit. Any my personal vote for best mustache in the game, owned by Rollie Fingers, sits at #11 on the top 15 closers list.

In Conclusion

ERA is a tried and true metric that when push comes to shove tells you which pitchers are best at preventing runs from being scored.

My recommendation is to watch out for pitchers with the lowest ERAs on your team. Those pitchers are your best chance at stopping opposing batters from scoring runs.

Image by author

[1] “baseball.” Dictionary.com 2021. https://www.dictionary.com/browse/baseball (23 March 2021)

[2] E Baccellieri, The DH, the Spitter and … the Three-Batter Minimum? A Brief History of Major Rule Changes (2019), Sports Illustrated Magazine (23 March 2021)

[3] Fastest baseball pitch (male) (2010), Guinness World Records Limited (23, March 2021)

[4] M Feinsand, CC Sabathia can’t remember Baltimore bender before playoffs, says drinking didn’t affect his pitching for Yankees: ‘I was functioning as an alcoholic’ (2015), New York Daily News (24, March 2021)

[5] https://github.com/chadwickbureau/baseballdatabank (2020), Chadwick Bureau of Baseball Statistics (19, March 2021)

[6] Major League Baseball’s Statistics Glossary (2021), Major League Baseball (19, March 2021)

--

--

#Analytics, #Data, #MachineLearning and marketing #Research Pro | #datadriven SVP of Data Strategy @cramerkrasselt www.courtneyperigo.com