Sunday, February 26, 2006
Strikeout proficiency regressions ...
(1) BB / (K+BB)
(2) K / BFP
(3) (K-1.15*BB)/IP
Here are the results (all significant to p<0.01):>
Equations 2 and 3 give similar year-to-year correlations indicating that they both have similar predictive power. K/BFP is more pure measure of strikeout power while K-1.15BB includes an element of control. The correlation for BB/BB+KK is substantially lower. Off the top of my head I can’t reason why this is. My guess is that this equation will be quite sensitive to changes in both BB and KK. And because we have a sensitive denominator then this will increase the variance in the equation. Any other thoughts most welcome.Thursday, February 23, 2006
Giving up lines drives IS largely luck ...
OK, so what I did was to work out the line drive percentage for all pitchers who gave up more than 40 BIP in both 2004 & 2005. I then allocated a score between 1 to 6 based on where they ranked in line drive percentage. I did this for both 2004 and 2005. If a pitcher had a low line drive percentage he got a 1, if not his score would be closer to 6. Each group is the same size so you can envisage a 6 by 6 matrix representing the distribution where pitchers who gave up few line drives in 2004 AND 2005 would be in the top left and those particularly bad would be in the bottom right. If its not clear hopefully the diagram below might help:
At this point you are probably wondering why I am bothering to run this categorisation. Why don't I just run a regression? Remember, what I am trying to detect here is the presence of an elite group of pitchers, hence why I am segmenting. Technically you could say I should be comparing this group with REST of the population and not just the pitchers in the bottom right corner. If we find a difference between the extremes then lets come back to this.
So my hypothesis is that you may get an elite group of pitchers who don't give up many line drives and they reside in the top left corner of the diagram. Pitchers in this corner include: Mariano Rivera, Johan Santana, Tim Wakefield, Billy Wagner, AJ Burnett and Jose Contreras to name a few. Not a bad list. But in the other corner there were also some A-list names: Jason Isringhausen, Mark Prior and (ouch) Brad Lidge - hmm my hypothesis looks doomed!!.
Anyway, to test this what I did was to run an independent sample t-test of this data using FIP (Field Independent Pitching - developed by Tangotiger) as the test variable. FIP is a good measure of how effective a pitcher is with defence controlled. The two groups were, group 1: where the pitchers had a rank of either 1 or 2 in both 2004 & 2005 and group 2: where pitchers had a rank of ether 5 or 6 in both 2004 and 2005. Everyone else was excluded. Running the analysis it turned out that the test wasn't significant. In other words there was NO difference between the two test groups in their FIP scores therefore disproving my hypothesis.
Not a surprise I suppose given the low year-to-year correlations in line drive percentage and the observations above. But I was still curious whether people like Santana and Rivera who distinguished themselves in have a low line drive % in both 2004 and 2005 did occupy an elite group of pitching. I segment the existing groups into 4:
- Group 1: elite pitchers, ranked 1 in both 2004 and 2005
- Group 2: semi-elite, ranked 1 and 2 or 2 and 2
- Group 3: poor pitchers, ranked 5 and 6 or 5 and 5
- Group 4: worst, ranked 6 in both years
I then ran an ANOVA to compare the different samples. And no surprise the test failed - the overall mean of the data was a better fit to the data than the ANOVA model. Why am I boring you with all this? Well the one interesting thing I found was that there was a significant difference between the worst (group 4) and the rest. FIP for group 4 was almost a whole point higher. Now this is probably because the sample size was small (only 10 in group 3 vs 30 in other groups). But if this is confirmed with a larger dataset it opens the possibility that there are some pitchers who simply shouldn't be in the major leagues becuase they give up too many line drives - which as we know are expensive. I'd also like to look at line drive percentages for a couple of the elite picther like Santana to see if his line drive percentage regressed towards the mean. Given the findings above I would expect this to happen.
Wednesday, February 22, 2006
Are pop-up pitchers flyball pitchers?

There is a clear relationship but the Rsq it is only 0.18 (significant at 0.01 level). This means that only 18% of the variance of flyballs is explained by this model (ie, pop-ups). (In case you were wondering including 2004 data gives a similar correlation).
Actually the correlation could be a little stronger than it first appears. Because both variables are a percentage of balls in play, if say, flyballs increase then there is less "room" for pop-ups to increase. This is why groundballs correlate invesely to fly-balls - if you don't have one you have the other (ignoring line drives).
Nothing suprising so far. Another way to look at this is to ask the contrarian question: do groundball pitchers give up fewer pop-ups than flyball pitchers. Given that flyballs and groundballs make up ~70% of batted balls we can simply categorise pitchers according to whether they give up a lot of groundballs or not. Then we can look for a difference in pop-ups in these two populations. Clear? Lets have a look at how it works in practice.
To categorise pitchers into those that give up groundballs and those that don't I'm simply going to cut the sample of pitchers in half. Those above the mean will go in the "groundball" group; those who aren't go in "other". Then we can run an independent t-test on the two populations to see if there is a significant difference in pop-ups. And, again no surprise, there is a difference and calculating an "Rsq" for this gives 0.63, which as expected is much more pronounced. We could further control for line drives but since we know that (for pitchers) they are largely a random event I have ignored them.
So, what does all this mean? We know from linear weights and the run expectancy matrix that a pop-out is worth almost the same as a strikeout. This poses a wider question. We know from my last post that inducing groundballs is very effective for the fielding team because most of them (75%) are turned into outs and those that are not are predominantly singles. But our analysis here tells us that popups, which lets not forget are as valuable as strikeouts, are the domain of flyballers. I haven't run the analysis but it would be interesting to see if groundball or flyball pitchers have a higher propensity to strikeout. Then we could use batted bull run value data to build the profile(s) of what an elite pitcher looks like.
Monday, February 20, 2006
Hardball Time Annual 2006 & Batted Balls
I am not going to review the annual in detail except to say that all of the articles are of the highest quality and are extremely well written. What I want to do is spend some time discussing my thoughts on what I consider to be the most interesting part of the book, namely analysis of batted ball types. The boys at THT ordered up a special cut of batted ball data for the last three years from Baseball Info Solutions and carried out all sorts of clever whiz-bang analysis on it. If you are a regular reader of THT, or indeed other blogs like Sabernomics, then this won't be new. But it is only with the advent of THT annual 2006 that I have focused on the potential of batted ball data.
Of particular interest were the year-to-year correlations. Using this technique we can determine the extent to which a pitcher (or batter) has control over various events. For example, if we correlate groundballs per BIP (Ball in Play) for the entire population of pitchers in 2004 vs 2005 we get the chart below (thanks to Yahoo Stats Group for the play-by-play data for all charts in here):

No surprise. Pitchers who gave up a lot of groundballs in 2004 did so in 2005. The Rsq is 0.5, which says that 50% of performance in 2005 is explained by performance in 2004 - which is reasonably high. That is why we refer to pitchers as groundball pitchers - Tim Hudson comes to mind. Now, where it gets more interesting is if we look at the same chart for line drives. Here it is:
As you can see the Rsq is very small: 0.01. In other words whether or not a pitcher gives up a line drive is largely luck. I bet you didn't know that (unless year read THT). Doing the same for batters shows a slightly larger Rsq (~0.1) between one year and the next for line drives. Hitters do show a small degree of skill in hitting line drives. THT Annual 2006 does this (and a lot more) for a range of different batter / pitcher events and I encourage you to have a look.
All very interesting but so what, you may ask? To really understand what is going on lets look at another article from, you've guessed it, THT Annual 2006 (no, I am not a contributor). This particular piece works out run value per batted ball above or below an average baseline. Here are some selected examples:
- Line drive: 0.356
- Outfield flyball: 0.035
- Groundball: -0.101
- Strikeout: -0.287
What this is saying is that if a batter slams a line drive then it contributes runs for the offense - the highest value event. This is because line drives only result in an out 25% of the time. And, remember, we said earlier that line drives are largely luck! Pretty amazing. Now take groundballs. Hitting a groundball is bad news. That is because it results in an out 75% of the time, and if it doesn't the chances are that you will only get to first base.
Two things jump out at me that I want to look at further. Firstly, I want to dig into line drives a little deeper. If we can find some pitchers who consistently prevent line drives more than others then they should be more valuable. I guarantee you that if I was on the mound a lot more than 20% of my pitches will go for line drives. Secondly I want to use this to develop a measuring system for pitchers / batters. Now I know this has been done (check out J'S Bradbury PrOPS metric - http://www.sabernomics.com - all his work is excellent if you have time to peruse), but I am curious if we can find a new part of the player population that has been significantly under-valued. Also I'll continue to explore the batted ball data and post anything else that I discover.
Wednesday, February 15, 2006
Coming up on my blog ....
First of all I have just got my copy of The Hardball Times Annual. I love this publication, so in the next week I'll probably pick the most interesting article and give my comments.
Following that I want to do a piece on my team - the Atlanta Braves. One interesting epiosde surrounding the Braves at the moment is the rumoured pending sale. I want to look at how much the Braves are worth; or at least how much I would pay for them if I had the money - which in case you're wondering I don't. I hope to publish something in the next couple of weeks so keep your eyes peeled.
Following that I'll give a quick run through of how I think all the divisions will shape up as well as my predictions on who will win the batting title, home run race and Cy Young award. Yeah, I know everyone does this but I want to look at it from a contrarian position, particulary on division races - I'll focus exclusive on who I think will end up in last place! (Ok - I'll also give my thoughts on the winner). I'll be wrong - but I will bet $50 on each prediction and I'll let you know how I do at the end of the season.
I haven't really decided what else to write about - I guess it will be topics that interest me but I can't say what those are yet. As I publish my first pieces and get comment that will probably define the direction of future work.
Let me know if there is anything you think I should consider ...
Saturday, February 11, 2006
Welcome
On this blog I'll regularly be posting my views on all aspects of baseball. I love to use data so most of my posts will be based on analysing whatever data I can get my grubby hands on. The tools of my trade are Excel, Access and SPSS, and I'll get data from either Retrosheet or Yahoo Stats Software Group.
Obviously I want this blog to be read as widely as possible. I'll spend the first month or two developing some interesting articles and then I'll try to promote the site towards the start of the season. My goal is to write in an accessible but informative way. I want to focus on the output, and not the methodology of getting there - but I do recognise that this is important to the wider sabermetric community. Actaully, I'll probably end up doing baseball stats primer (perhaps in the 2006 off-season) so people can learn my approach and comment on how I do things.
In the next few days I'll be posting a list of some of the more detailed topics I'll be investigating over the coming months.
In the meantime if there are any issues you want me to comment on then please get in touch at baseballstrategy@googlemail.com