29 April 2010

A simple model for baseball

From the April Notices of the AMS, John D'Angelo writes Baseball and Markov Chains: Power Hitting and Power Series. Consider the following simple model of baseball. Players only hit singles; three singles score a run. That is, the third and every following player to get a hit in a given inning score a run. This can either be interpreted as that, say, all runners score from second on a single or all runners go from first to third on a single -- but not both! -- or that every third hit is actually a double. (And I do mean exactly every third hit, not some random one-third of hits, so this is a bit unnatural.) Then the expected number of runs per half inning is p3(3p2-10p+10)/(1-p). For real baseball the average number of runs per half-inning is around one half, which corresponds to p = 0.361.

D'Angelo gives this as an exercise, but I independently came up with this model a while ago and can't resist sharing the solution. Let q = 1-p. The probability of getting k hits in an inning is pk q3 -- that's the probability of getting those hits in a certain order -- times the number of ways in which k hits and 3 outs can be arranged. Since the last batter of an inning must get out, the number of possible arrangements is the number of ways to pick 2 batters out of the first k+2 to get out, which is (k+2)(k+1)/2.

The probability of getting k runs, if k is at least 1, is just the probability of getting k+2 hits, which is pk+2q3(k+4)(k+3)/2. Call this f(k); then

f(1) + 2f(2) + 3f(3) + ... = p3(3p2-10p+10)/(1-p)

by some annoying algebra. I'm pretty sure I came up with this exact model while procrastinating from some real work a couple years ago; it's probably been independently reinvented many times.

With p = 0.361, the probabilities of scoring 0, 1, 2, 3, 4, 5 runs in an inning are .748, .123, .066, .034, .016, .008 (rounded to three decimal places). (Probabilities of larger numbers of runs can also be calculated; together they have probability around .006.)

Assuming that each half-inning is independent, the probability G(k) of a team scoring k runs in a game is, for each k,







k012345
G(k).073.108.129.133.124.108
k67891011
G(k).088.069.052.038.026.018
k121314151617
G(k).012.008.005.003.002.001

with probability about 0.0006 of scoring 18 runs or more. (This seems a bit low to me -- three times a season in the major leagues -- but after all this is a very crude model!) But one interesting thing here is that the distribution of the number of runs per game, which is a sum of nine skewed distributions, is still skewed; the mode is 3, and the median 4. Recall that I chose p so that the mean would be 4.5. And the actual distribution is similarly skewed.

Of course a more sophisticated model of baseball is as a Markov chain. There are twenty-five states in this chain -- zero, one or two outs combined with eight possible ways to have runners on base, and three outs. We assume that each hitter hits randomly according to his actual statistics, and the runners move in the "appropriate" way. Of course determining what's appropriate here would be a bit tricky. How do runners move? A runner is probably more likely to take an extra base when a power hitter is hitting, but the sample size for any individual is fairly small. But one could probably predict from some measure of the hitter's power (say, the number of doubles and home runs, combined appropriately) the chances of a runner taking an extra base on a single. Something similar is necessary for sacrifice flies (which have to be deep enough to score the runner), grounding into double plays, etc. I'm not sure if the Markov models that are out there, such as that by Sagarin, do this. Sagarin computes the (offensive) value of a player by determining how many runs per game a team composed of only that player would score.

28 April 2010

My thesis!

For the morbidly curious, here's my recently completed PhD thesis, Profiles of large combinatorial structures. (PDF, 1.1 MB, 262 pages (but double-spaced with wide margins)) This is why I haven't been posting!

Abstract: We derive limit laws for random combinatorial structures using singularity analysis of generating functions. We begin with a study of the Boltzmann samplers of Flajolet and collaborators, a useful method for generating large discrete structures at random which is useful both for providing intuition and conjecture and as a possible proof technique. We then apply generating functions and Boltzmann samplers to three main classes of objects: permutations with weighted cycles, involutions, and integer partitions. Random permutations in which each cycle carries a multiplicative weight σ have probability (1-γ)σ of having a random element be in a cycle of length longer than γn; this limit law also holds for cycles carrying multiplicative weights depending on their length and averaging σ. Such permutations have number of cycles asymptotically normally distributed with mean and variance ~ σ log n. For permutations with weights σk = 1/k or σk = k, other limit laws are found; the prior have finitely many cycles in expectation, the latter around √n. Compositions of uniformly chosen involutions of [n], on the other hand, have about √n cycles on average. These can be modeled as modified 2-regular graphs. A composition of two random involutions in Sn typically has about n1/2 cycles, characteristically of length n1/2. The number of factorizations of a random permutation into two involutions appears to be asymptotically lognormally distributed, which we prove for a closely related probabilistic model. We also consider connections to pattern avoidance, in particular to the distribution of the number of inversions in involutions. Last, we consider integer partitions. Various results on the shape of random partitions are simple to prove in the Boltzmann model. We give a (conjecturally tight) asymptotic bound on the number of partitions pM(n) in which all part multiplicities lie in some fixed set n, and explore when that asymptotic form satisfies log pM(n) ~ π√(Cn) for rational C. Finally we give probabilistic interpretations of various pairs of partition identities and study the Boltzmann model of a family of random objects interpolating between partitions and overpartitions.

20 April 2010

Thesis margins

What's the point of having two thousand readers if I can't ask a question like this once in a while?

I'm working on the final version of my dissertation -- the one I'll submit to the graduate school next week. The dissertation manual states that no text may appear in the margin area.

LaTeX, on the other hand, keeps wanting to put some pieces of mathematics, which appear inline, in the margins. (Presumably this is because this is "better" than the alternative of having very long inter-word spaces.)

Two questions:
- is there some way to check that nothing's sticking out in the margin? (I thought this is what "overfull \hbox" meant, but the line numbers where those appear aren't the ones where I have this problem.) There are some things that are just barely sticking out into the margin, and with thousands of lines total I don't trust my eye.
- once I find all the places where text protrudes into the margin, is there some way around this other than just inserting \newline every time this problem occurs? This creates its own problems.

I surely can't be the only person who's had this problem, but Google is failing me.

Ash clouds and probability

From the Daily Mail: New ash cloud could delay re-opening of London airports. We have this gem: "Critics said the agency used a scientific model based on 'probability' rather than fact to forecast the spread of the ash cloud." See the Telegraph as well.

What else are they supposed to do? The agency here -- the Met Office, which is the national weather service of the UK -- doesn't know what the ash cloud is going to do. If they waited to see what the cloud does, the planes would already be in the air. It would be too late.

14 April 2010

Mathematical relationships search

There's a mathematical relationships search. It will tell you, for example, that academically, Max Noether is the first cousin of Emmy Noether. (Both of their advisors were students of Jacobi.) But Michael Artin and Emil Artin aren't even related.

It's less amusing, of course, when you search for people that aren't related in the standard way. But Paul Erdos is my great-great-great-great-uncle. (You can't search for me yet in the Mathematics Genealogy Project, which is where the data comes from; the link goes to the relationship between Erdos and another student of my advisor.)

This blog needs a new title

The word "probability" does not appear in the Bible, or so we learn from Conservapedia's List of missing words in the Bible.

I can only conclude that Einstein was right, and God does not play dice.