Tuesday, April 24, 2012

High Heat Stats: DOA, Continued


On Sunday, birtelcom at High Heat Stats wrote an article trying to come up with a way to better measure the skill that RBI attempts to measure and called this new metric "DOA" (Driven-in Over Average). I wrote a program to generate this "DOA" for every season dating back to 1956 and posted the data.

In the interest of furthering this idea, numerous suggestions were made on how to better the metric. This post contains five different data sets incorporating these suggestions to help determine the best course.

The following changes were made to the original process and used in the generation of all five of the following data sets:

  • Originally there were 8 base runner states used to determine RBI expectation. These were split into those situations with 2 outs and those with less than 2 outs to create 16 states to better approximate RBI expectation. Initially the plan was to only introduce this distinction with a man on third, but after looking at the data, I determined there to be a significant difference in expectation with a man on second as well - due to the increased frequency of scoring on a single with 2 outs - so I split all situations by outs to arrive at 16 total states. 
  • All plate appearances ending in an intentional walk or hit by pitch were removed from the data set. This means that they were not used to determine the baseline RBI expectation nor were they used in the calculation of each player's DOA.

With some disagreement on whether unintentional walks should be included in the data set, I have provided both cases below. In those instances marked "UIBB: No", unintentional walks were not used to determine the baseline RBI expectation nor were they used to calculate each player's DOA.

There was also some disagreement on whether to include the RBI a player receives from their own run when hitting a HR. The argument against would be that the RBI would have occurred regardless of situation and thus is unimportant in determining how well a batter hits with a chance to drive in runs. The argument for would be that if we're measuring the ability to drive in runs, why would we remove the very best way of driving them in? Regardless of your position, both cases are listed below. In those marked "HR RBI: No", the single RBI from the batter is removed both in determining the baseline RBI expectation and in calculating each player's DOA. In the case of a grand slam, the batter still receives credit for 3 RBI, just not the 1 he gets for driving in himself.

In addition to the above changes, I also listed one further data set that has a restriction on the generation of the baseline RBI expectation. In calculating the expectation, only those batters with at least 200 plate appearances in the season will be used. This will remove NL pitchers from the equation and eliminate the slight skew they introduce. I wasn't sure if this was worth investigating too much so I've only provided one example to get an idea of the difference involved.

To aid comparison, you can check out this spreadsheet which lists the RBI expectation for each of the 16 states for the 5 data sets listed below. This was generated using every plate appearance between 1956-2011.

A few further notes on data:
  • I've added a "DOA %" column which is calculated simply as [Actual RBI/Expected RBI] to give a rate version of DOA where 1 is average.
  • The second row of each spreadsheet gives the RBI expectation for each of the 16 states. In the yearly spreadsheets this is calculated from that year only. In the "Cumulative Situations" spreadsheets, this is calculated as the average across all plate appearances from 1956-2011.
  • In the "Cumulative Situations" spreadsheets, even though the multi-year average is shown, each player's DOA was calculated by treating every season independently and using the yearly RBI expectations... the multi-year average is presented as a guide but was not used in any calculations.
  • Due to the nature of the metric, play-by-play data is necessary to calculate DOA. I set 1956 as my starting year but Retrosheet is missing some play-by-play data so there are some plate appearances missing from our data. For this reason, you will notice that players from the 1950's and 1960's have RBI numbers that differ slightly from their official totals; poor Hank Aaron lost 62 RBI to lost play-by-play data but can comfort himself with the knowledge that he is still the all-time DOA leader. To give an idea of the amount of play-by-play data that we're working with: In 1956 we have 95.7% of all PAs, in 1957, 97.2% and in only one year after that do we have less than 99% (98.4% in 1968). Starting in 1974, we have 100% of play-by-play data for every season.

There you go... I think I'm done talking. Enjoy the numbers!

UIBB: Yes  |  HR RBI: Yes
Yearly Data1956-1959  |  1960-1969  |  1970-1979  |  1980-1989  |  1990-1999  |  2000-2011
Player Careers: Cumulative Situations  |  By Age


UIBB: Yes  |  HR RBI: Yes  |  Restrict Baseline to Batters with at least 200 PA
Yearly Data1956-1959  |  1960-1969  |  1970-1979  |  1980-1989  |  1990-1999  |  2000-2011
Player Careers: Cumulative Situations  |  By Age

--------------------------------------------------

UIBB: No  |  HR RBI: No
Yearly Data1956-1959  |  1960-1969  |  1970-1979  |  1980-1989  |  1990-1999  |  2000-2011
Player Careers: Cumulative Situations  |  By Age

--------------------------------------------------

UIBB: No  |  HR RBI: Yes
Yearly Data1956-1959  |  1960-1969  |  1970-1979  |  1980-1989  |  1990-1999  |  2000-2011
Player Careers: Cumulative Situations  |  By Age

--------------------------------------------------

UIBB: Yes  |  HR RBI: No
Yearly Data1956-1959  |  1960-1969  |  1970-1979  |  1980-1989  |  1990-1999  |  2000-2011
Player Careers: Cumulative Situations  |  By Age

No comments:

Post a Comment