ASF

The Kelly Criterion from Shannon Information Theory

March 27, 2026|
A
A

In 1956, John Kelly Jr. at Bell Labs published a remarkable paper connecting two apparently unrelated problems: the reliable transmission of information over noisy channels, and the optimal sizing of repeated bets. His central result was that a gambler with access to a noisy private wire to a racetrack can grow capital at an exponential rate equal to the Shannon mutual information between the source and the received signal. The optimal strategy, now called the Kelly criterion, prescribes exact fractions of wealth to wager on each outcome.

This post derives Kelly's result from first principles. We build the information-theoretic foundations (§1), formulate the gambler's optimization problem (§2), solve it via Karush-Kuhn-Tucker conditions (§3), prove the connection to mutual information (§4), specialize to binary prediction markets (§5), and discuss application to algorithmic trading (§6).

Conventions. All logarithms are base 2 unless otherwise noted, so that information is measured in bits. We write conditional probabilities as p(rs)p(r \mid s), q(sr)q(s \mid r), and use the geometric language of the probability simplex throughout. Kelly's 1956 paper uses a slash in place of the conditioning bar (e.g., p(r/s)p(r/s) for p(rs)p(r \mid s)); an appendix restates the key results in his original notation for readers following the paper directly.

1. Shannon's Communication Theory

Shannon's 1948 theory models communication as a probabilistic process. A source emits symbols from a fixed, finite set of possibilities: letters in a language, bits in a wire protocol, horses in a race. The restriction to finitely many symbols is not a limitation for our purposes. Every prediction market resolves to one of finitely many outcomes (typically just YES or NO), and every horse race has finitely many horses. The mathematical structure we need begins with this finite set and the space of probability distributions over it.

1.1. The probability simplex and entropy

Let S={1,2,,S}\mathcal{S} = \{1, 2, \ldots, S\} be a finite set of symbols (the alphabet). The space of all probability distributions on S\mathcal{S} is the probability simplex

ΔS1  =  {pRS:p(s)0 for all s,    s=1Sp(s)=1},\Delta_{S-1} \;=\; \bigl\{p \in \mathbb{R}^S : p(s) \geq 0 \text{ for all } s,\;\; \textstyle\sum_{s=1}^{S} p(s) = 1\bigr\},

an (S1)(S{-}1)-dimensional manifold with boundary, embedded in RS\mathbb{R}^S. Each point pΔS1p \in \Delta_{S-1} assigns probability p(s)p(s) to symbol ss.

Definition (Shannon entropy).

The entropy of a distribution pΔS1p \in \Delta_{S-1} is

H(p)  =  s=1Sp(s)logp(s),H(p) \;=\; -\sum_{s=1}^{S} p(s) \log p(s),

with the convention 0log0=00 \log 0 = 0 (justified by continuity, since limx0+xlogx=0\lim_{x \to 0^+} x \log x = 0).

Entropy quantifies uncertainty: H(p)=0H(p) = 0 when pp is a point mass (the outcome is deterministic), and H(p)H(p) is maximized when pp is uniform. We prove this bound precisely as a corollary of the next result. For the binary case S=2S = 2, the entropy is a function of a single parameter p[0,1]p \in [0, 1], peaking at H=1H = 1 bit when p=0.5p = 0.5:

1.2. Kullback-Leibler divergence

Definition (Kullback-Leibler divergence).

For distributions p,qΔS1p, q \in \Delta_{S-1} with pp absolutely continuous with respect to qq (i.e., q(s)=0p(s)=0q(s) = 0 \Rightarrow p(s) = 0), the KL divergence from qq to pp is

DKL(pq)  =  s=1Sp(s)logp(s)q(s).D_{\mathrm{KL}}(p \,\Vert\, q) \;=\; \sum_{s=1}^{S} p(s) \log \frac{p(s)}{q(s)}.

Geometrically, DKLD_{\mathrm{KL}} is a Bregman divergence on ΔS1\Delta_{S-1} generated by the negative entropy functional Φ(p)=sp(s)logp(s)\Phi(p) = \sum_s p(s) \log p(s). It is not symmetric and does not satisfy the triangle inequality, so it is not a metric. It does, however, define a canonical notion of directed distance on the simplex. Its Hessian at a point pp recovers the Fisher information metric gij(p)=δij/p(i)g_{ij}(p) = \delta_{ij}/p(i), the unique (up to scale) Riemannian metric on the interior of ΔS1\Delta_{S-1} invariant under sufficient statistics [4].

Theorem (Gibbs' inequality).

For any distributions p,qΔS1p, q \in \Delta_{S-1},

DKL(pq)    0,D_{\mathrm{KL}}(p \,\Vert\, q) \;\geq\; 0,

with equality if and only if p=qp = q.

Proof. Since  ⁣log-\!\log is strictly convex, Jensen's inequality gives

DKL(pq)  =  sp(s)logq(s)p(s)    log ⁣(sp(s)q(s)p(s))=log ⁣(sq(s))=0.D_{\mathrm{KL}}(p \,\Vert\, q) \;=\; -\sum_s p(s) \log \frac{q(s)}{p(s)} \;\geq\; -\log\!\left(\sum_s p(s) \cdot \frac{q(s)}{p(s)}\right) = -\log\!\left(\sum_s q(s)\right) = 0.

Equality in Jensen's inequality holds if and only if q(s)/p(s)q(s)/p(s) is constant pp-almost surely. Since both pp and qq sum to 1, the constant must be 1, forcing p=qp = q. ∎

Corollary (Entropy bounds).

For any pΔS1p \in \Delta_{S-1},

0    H(p)    logS.0 \;\leq\; H(p) \;\leq\; \log S.

The lower bound is achieved if and only if pp is a point mass. The upper bound is achieved if and only if pp is the uniform distribution u=(1/S,,1/S)u = (1/S, \ldots, 1/S).

Proof. For the lower bound: each term p(s)logp(s)0-p(s)\log p(s) \geq 0 since p(s)[0,1]p(s) \in [0,1], with equality for all ss if and only if pp is a point mass.

For the upper bound: let u=(1/S,,1/S)u = (1/S, \ldots, 1/S). By [gibbs],

DKL(pu)  =  sp(s)log(p(s)S)  =  H(p)+logS    0,D_{\mathrm{KL}}(p \,\Vert\, u) \;=\; \sum_s p(s)\log\bigl(p(s) \cdot S\bigr) \;=\; -H(p) + \log S \;\geq\; 0,

so H(p)logSH(p) \leq \log S, with equality if and only if p=up = u. ∎

1.3. Communication channels

Definition (Discrete memoryless channel).

A discrete memoryless channel consists of an input alphabet S={1,,S}\mathcal{S} = \{1, \ldots, S\}, an output alphabet R={1,,R}\mathcal{R} = \{1, \ldots, R\}, and a stochastic map T ⁣:SΔR1T \colon \mathcal{S} \to \Delta_{R-1} defined by the transition probabilities

p(rs)  =  P(receive rtransmit s).p(r \mid s) \;=\; P(\text{receive } r \mid \text{transmit } s).

Given a source distribution pΔS1p \in \Delta_{S-1}, the joint distribution, output marginal, and posterior are:

p(s,r)=p(s)p(rs),q(r)=sp(s,r),q(sr)=p(s,r)q(r).p(s,r) = p(s)\,p(r \mid s), \qquad q(r) = \sum_s p(s,r), \qquad q(s \mid r) = \frac{p(s,r)}{q(r)}.

In geometric language, the channel TT induces a pushforward q=Tpq = T_* p on the output simplex ΔR1\Delta_{R-1}, and the Bayesian posterior defines a map π ⁣:RΔS1\pi \colon \mathcal{R} \to \Delta_{S-1} by π(r)=q(r)\pi(r) = q(\cdot \mid r). The posterior map sends each received symbol rr to a point on the source simplex, encoding what the receiver has learned about ss after observing rr.

1.4. Conditional entropy and mutual information

Definition (Conditional entropy).

The conditional entropy of SS given RR is

H(SR)  =  s,rp(s,r)logq(sr)  =  rq(r)H ⁣(π(r)),H(S \mid R) \;=\; -\sum_{s,r} p(s,r) \log q(s \mid r) \;=\; \sum_r q(r)\, H\!\bigl(\pi(r)\bigr),

where H(π(r))H(\pi(r)) is the entropy of the posterior distribution q(r)q(\cdot \mid r).

Definition (Mutual information).

The mutual information between SS and RR is

I(S;R)  =  s,rp(s,r)logp(s,r)p(s)q(r).I(S;R) \;=\; \sum_{s,r} p(s,r) \log \frac{p(s,r)}{p(s)\,q(r)}.
Proposition (Entropy decomposition of mutual information).
I(S;R)  =  H(S)H(SR).I(S;R) \;=\; H(S) - H(S \mid R).

Proof. Starting from [eq:mi-eq] and using q(sr)=p(s,r)/q(r)q(s \mid r) = p(s,r)/q(r):

I(S;R)=s,rp(s,r)logq(sr)p(s)=s,rp(s,r)logq(sr)s,rp(s,r)logp(s).I(S;R) = \sum_{s,r} p(s,r) \log \frac{q(s \mid r)}{p(s)} = \sum_{s,r} p(s,r) \log q(s \mid r) - \sum_{s,r} p(s,r) \log p(s).

The first sum is H(SR)-H(S \mid R) by [eq:cond-entropy-eq]. For the second sum, marginalize over rr:

s,rp(s,r)logp(s)=sp(s)logp(s)=H(S).\sum_{s,r} p(s,r) \log p(s) = \sum_s p(s) \log p(s) = -H(S).

Combining: I(S;R)=H(SR)+H(S)I(S;R) = -H(S \mid R) + H(S). ∎

Proposition (Divergence decomposition of mutual information).
I(S;R)  =  rq(r)DKL ⁣(q(r)p())  =  ERq ⁣[DKL ⁣(π(R)p)].I(S;R) \;=\; \sum_r q(r)\, D_{\mathrm{KL}}\!\bigl(q(\cdot \mid r) \,\Vert\, p(\cdot)\bigr) \;=\; \mathbb{E}_{R \sim q}\!\bigl[D_{\mathrm{KL}}\!\bigl(\pi(R) \,\Vert\, p\bigr)\bigr].

Proof. From the proof of [mi-entropy]:

I(S;R)=s,rp(s,r)logq(sr)p(s)=rq(r)sq(sr)logq(sr)p(s)=rq(r)DKL ⁣(q(r)p()),I(S;R) = \sum_{s,r} p(s,r) \log \frac{q(s \mid r)}{p(s)} = \sum_r q(r) \sum_s q(s \mid r) \log \frac{q(s \mid r)}{p(s)} = \sum_r q(r)\, D_{\mathrm{KL}}\!\bigl(q(\cdot \mid r) \,\Vert\, p(\cdot)\bigr),

where we used p(s,r)=q(r)q(sr)p(s,r) = q(r)\,q(s \mid r) to factor the outer sum. ∎

Corollary (Non-negativity of mutual information).

I(S;R)0I(S;R) \geq 0, with equality if and only if SS and RR are independent.

Proof. By [mi-kl], I(S;R)I(S;R) is a convex combination of KL divergences. Each term satisfies DKL(q(r)p())0D_{\mathrm{KL}}(q(\cdot \mid r) \,\Vert\, p(\cdot)) \geq 0 by [gibbs], so I(S;R)0I(S;R) \geq 0. Equality holds if and only if DKL(q(r)p())=0D_{\mathrm{KL}}(q(\cdot \mid r) \,\Vert\, p(\cdot)) = 0 for all rr with q(r)>0q(r) > 0, which by [gibbs] requires q(sr)=p(s)q(s \mid r) = p(s) for all s,rs, r. This is precisely the condition that SS and RR are independent. ∎

Remark (Geometric interpretation).

On the simplex ΔS1\Delta_{S-1}, the posterior map π ⁣:RΔS1\pi \colon \mathcal{R} \to \Delta_{S-1} traces out a cloud of points {π(r):rR}\{\pi(r) : r \in \mathcal{R}\} around the prior pp. The mutual information I(S;R)I(S;R) is the average KL divergence from pp to these posterior points, weighted by the output marginal. A noiseless channel sends π(r)\pi(r) to a vertex of the simplex (a point mass), maximizing the divergence at H(S)H(S). A completely noisy channel satisfies π(r)=p\pi(r) = p for all rr, collapsing the divergence to zero. The mutual information thus measures how far, on average, the posterior moves from the prior on the simplex.

The visualization below makes this concrete for a binary channel. The prior p(s=1)=0.6p(s{=}1) = 0.6 (gold circle) sits on the simplex Δ1=[0,1]\Delta_1 = [0,1]. As the channel noise ε\varepsilon decreases, the two posteriors (blue dots, one per received symbol) spread further from the prior, and the mutual information grows.

2. The Gambler's Problem

Consider a sequence of independent horse races, each with SS horses. In each race, horse ss wins with probability p(s)p(s). The track pays odds of αs\alpha_s to 1 on horse ss: a unit bet returns αs\alpha_s if ss wins and zero otherwise.

Before each race, the gambler receives a symbol rr through a noisy channel with transition probabilities p(rs)p(r \mid s). After observing rr, the gambler allocates a fraction a(sr)0a(s \mid r) \geq 0 of current capital to each horse ss, retaining a holdback

br  =  1s=1Sa(sr)    0.b_r \;=\; 1 - \sum_{s=1}^{S} a(s \mid r) \;\geq\; 0.

If horse ss wins, the gambler's wealth is multiplied by br+αsa(sr)b_r + \alpha_s\, a(s \mid r).

Definition (Exponential growth rate).

The exponential growth rate of the gambler's capital is the expected log-return per race:

G  =  s,rp(s,r)log ⁣[br+αsa(sr)].G \;=\; \sum_{s,r} p(s,r) \log\!\bigl[b_r + \alpha_s\, a(s \mid r)\bigr].

By the strong law of large numbers applied to the i.i.d. sequence of log-returns, (1/N)log2(WN/W0)G(1/N)\log_2(W_N/W_0) \to G almost surely as NN \to \infty. After NN races the gambler's wealth satisfies WNW02NGW_N \approx W_0 \cdot 2^{NG}, so maximizing GG maximizes the asymptotic exponential growth rate. The Kelly criterion is the strategy {a(sr),br}\{a^*(s \mid r),\, b^*_r\} that maximizes GG subject to the budget constraint [eq:budget-eq] for each rr.

For a binary bet (p=0.65p = 0.65, market price q=0.50q = 0.50), the growth rate G(f)G(f) as a function of bet fraction is strictly concave, guaranteeing a unique optimum. The gold dot sweeps across all possible bet sizes. Betting too little wastes edge; betting too much inverts the growth rate, leading to ruin. The shaded region marks the zone where G(f)>0G(f) > 0.

3. Optimal Allocation

3.1. Decomposition per received symbol

Proposition (Per-symbol decomposition).

The growth rate decomposes as G=rq(r)GrG = \sum_r q(r)\, G_r, where

Gr  =  sq(sr)log ⁣[br+αsa(sr)].G_r \;=\; \sum_s q(s \mid r) \log\!\bigl[b_r + \alpha_s\, a(s \mid r)\bigr].

Since q(r)>0q(r) > 0 and the variables {a(sr),br}\{a(s \mid r),\, b_r\} for distinct values of rr are disjoint, maximizing GG is equivalent to independently maximizing each GrG_r.

Proof. Write p(s,r)=q(r)q(sr)p(s,r) = q(r)\,q(s \mid r) and factor the outer sum over rr. The budget constraint [eq:budget-eq] couples only variables sharing the same rr, so the per-rr subproblems are independent. Since each q(r)>0q(r) > 0, the global maximum of GG is achieved by maximizing each GrG_r separately. ∎

3.2. The Kelly theorem

By [decomposition], fix a received symbol rr and write p(s)=q(sr)p(s) = q(s \mid r), a(s)=a(sr)a(s) = a(s \mid r), b=brb = b_r for brevity. The subproblem is:

maximizeG  =  s=1Sp(s)log ⁣[b+αsa(s)]subject tob+sa(s)=1,    a(s)0,    b0.\text{maximize}\quad G \;=\; \sum_{s=1}^S p(s) \log\!\bigl[b + \alpha_s\, a(s)\bigr] \quad\text{subject to}\quad b + \sum_s a(s) = 1,\;\; a(s) \geq 0,\;\; b \geq 0.

Substituting b=1ta(t)b = 1 - \sum_t a(t) and differentiating with respect to a(s)a(s), the chain rule gives

a(s)[b+αsa(s)]  =  αsδs,s1,\frac{\partial}{\partial a(s)}\bigl[b + \alpha_{s'} a(s')\bigr] \;=\; \alpha_s\,\delta_{s,s'} - 1,

where δs,s\delta_{s,s'} is the Kronecker delta (the 1-1 arises because increasing a(s)a(s) decreases bb). Therefore:

Ga(s)  =  (loge) ⁣[p(s)αsb+αsa(s)    s=1Sp(s)b+αsa(s)],\frac{\partial G}{\partial a(s)} \;=\; (\log e)\!\left[\frac{p(s)\,\alpha_s}{b + \alpha_s\,a(s)} \;-\; \sum_{s'=1}^{S} \frac{p(s')}{b + \alpha_{s'}\,a(s')}\right],

where the constant k=loge=log2e1.4427k = \log e = \log_2 e \approx 1.4427 arises from differentiating log2\log_2.

Theorem (Kelly optimal allocation).

Let p(s)>0p(s) > 0 be the probability that outcome ss occurs and αs>0\alpha_s > 0 the odds. Define the active set λ={s:a(s)>0}\lambda = \{s : a^*(s) > 0\} and its complement λ={s:a(s)=0}\lambda' = \{s : a^*(s) = 0\}. Write

p  =  sλp(s),σ  =  sλ1αs.p \;=\; \sum_{s \in \lambda} p(s), \qquad \sigma \;=\; \sum_{s \in \lambda} \frac{1}{\alpha_s}.

The unique growth-rate-maximizing allocation is:

b  =  1p1σ,b^* \;=\; \frac{1 - p}{1 - \sigma},a(s)  =  p(s)bαsfor sλ,a(s)=0for sλ.a^*(s) \;=\; p(s) - \frac{b^*}{\alpha_s} \quad \text{for } s \in \lambda, \qquad a^*(s) = 0 \quad \text{for } s \in \lambda'.

An outcome ss belongs to the active set if and only if its expected return exceeds the holdback: p(s)αs>bp(s)\,\alpha_s > b^*.

Proof. The growth rate GG is strictly concave in the a(s)a(s) (it is a positively weighted sum of logarithms of affine functions), so the Karush-Kuhn-Tucker conditions are necessary and sufficient. Since k=loge>0k = \log e > 0, dividing [eq:partial-eq] by kk yields:

For sλ ⁣:p(s)αsb+αsa(s)  =  C,For sλ ⁣:p(s)αsb    C,\text{For } s \in \lambda\!: \quad \frac{p(s)\,\alpha_s}{b + \alpha_s\,a(s)} \;=\; C, \qquad\qquad \text{For } s \in \lambda'\!: \quad \frac{p(s)\,\alpha_s}{b} \;\leq\; C,

where C=sp(s)/(b+αsa(s))C = \sum_{s'} p(s')/(b + \alpha_{s'}\,a(s')) is independent of ss.

Step 1 (Express allocations in terms of CC). From the active-set condition [eq:active-eq]:

b+αsa(s)  =  p(s)αsCfor each sλ,b + \alpha_s\,a(s) \;=\; \frac{p(s)\,\alpha_s}{C} \quad \text{for each } s \in \lambda,

giving a(s)=p(s)/Cb/αsa(s) = p(s)/C - b/\alpha_s.

Step 2 (Determine CC). Decompose the sum defining CC over the active and inactive sets:

C  =  sλp(s)p(s)αs/C  +  sλp(s)b  =  Cσ  +  1pb,C \;=\; \sum_{s \in \lambda} \frac{p(s)}{p(s)\alpha_s / C} \;+\; \sum_{s \in \lambda'} \frac{p(s)}{b} \;=\; C\sigma \;+\; \frac{1-p}{b},

so C(1σ)=(1p)/bC(1 - \sigma) = (1-p)/b, equivalently b=(1p)/(C(1σ))b = (1-p)/(C(1-\sigma)).

Step 3 (Apply the budget constraint). From b+sλa(s)=1b + \sum_{s \in \lambda} a(s) = 1:

b+sλ ⁣(p(s)Cbαs)=b+pCbσ=b(1σ)+pC=1.b + \sum_{s \in \lambda}\!\left(\frac{p(s)}{C} - \frac{b}{\alpha_s}\right) = b + \frac{p}{C} - b\sigma = b(1-\sigma) + \frac{p}{C} = 1.

Substituting b(1σ)=(1p)/Cb(1-\sigma) = (1-p)/C:

1pC+pC  =  1C  =  1,\frac{1-p}{C} + \frac{p}{C} \;=\; \frac{1}{C} \;=\; 1,

so C=1C = 1.

Step 4 (Final formulas). Substituting C=1C = 1: b=(1p)/(1σ)b^* = (1-p)/(1-\sigma) and a(s)=p(s)b/αsa^*(s) = p(s) - b^*/\alpha_s for sλs \in \lambda. The membership condition a(s)>0a^*(s) > 0 is equivalent to p(s)>b/αsp(s) > b^*/\alpha_s, i.e., p(s)αs>bp(s)\,\alpha_s > b^*. ∎

Remark (Side information).

With side information, [decomposition] reduces the problem to one instance of [kelly-optimal] per received symbol rr, with q(sr)q(s \mid r) in place of p(s)p(s). The optimal strategy is br=(1pr)/(1σr)b^*_r = (1 - p_r)/(1 - \sigma_r) and a(sr)=q(sr)br/αsa^*(s \mid r) = q(s \mid r) - b^*_r/\alpha_s for sλrs \in \lambda_r, where pr=sλrq(sr)p_r = \sum_{s \in \lambda_r} q(s \mid r) and σr=sλr1/αs\sigma_r = \sum_{s \in \lambda_r} 1/\alpha_s.

Remark (Determining the active set).

The active set λ\lambda and the holdback bb^* are mutually dependent: λ={s:p(s)αs>b}\lambda = \{s : p(s)\alpha_s > b^*\} while bb^* depends on λ\lambda through pp and σ\sigma. In practice, sort outcomes by p(s)αsp(s)\alpha_s in decreasing order and add them to λ\lambda one at a time, recomputing bb^* at each step, until the membership condition fails.

4. Growth Rate as Mutual Information

Definition (Fair odds).

The odds (α1,,αS)(\alpha_1, \ldots, \alpha_S) are fair when s=1S1/αs=1\sum_{s=1}^S 1/\alpha_s = 1, so that the reciprocal odds (1/α1,,1/αS)(1/\alpha_1, \ldots, 1/\alpha_S) form a probability distribution on outcomes. In particular, αs=1/p(s)\alpha_s = 1/p(s) gives true fair odds reflecting the source distribution.

Theorem (Kelly's fundamental result).

Under true fair odds αs=1/p(s)\alpha_s = 1/p(s), the maximum growth rate with side information equals the mutual information:

G  =  I(S;R).G^* \;=\; I(S;R).

Proof. With αs=1/p(s)\alpha_s = 1/p(s), the full sum s1/αs=sp(s)=1\sum_s 1/\alpha_s = \sum_s p(s) = 1. For each received symbol rr, consider the candidate strategy br=0b_r = 0, a(sr)=q(sr)a(s \mid r) = q(s \mid r) for all ss.

Feasibility. The budget constraint is satisfied: br+sa(sr)=0+sq(sr)=1b_r + \sum_s a(s \mid r) = 0 + \sum_s q(s \mid r) = 1. Each a(sr)=q(sr)0a(s \mid r) = q(s \mid r) \geq 0.

Optimality. Evaluate the partial derivative [eq:partial-eq] at this candidate:

Gra(s)a=q(r)=(loge) ⁣[q(sr)αsαsq(sr)sq(sr)αsq(sr)]=(loge) ⁣[1sp(s)]=0.\frac{\partial G_r}{\partial a(s)}\bigg|_{a = q(\cdot \mid r)} = (\log e)\!\left[\frac{q(s \mid r) \cdot \alpha_s}{\alpha_s\, q(s \mid r)} - \sum_{s'} \frac{q(s' \mid r)}{\alpha_{s'}\, q(s' \mid r)}\right] = (\log e)\!\left[1 - \sum_{s'} p(s')\right] = 0.

Since GrG_r is concave and all partial derivatives vanish simultaneously, a(sr)=q(sr)a(s \mid r) = q(s \mid r) is the global maximizer.

Growth rate. The optimal per-symbol growth rate is:

Gr=sq(sr)log ⁣[αsq(sr)]=sq(sr)logq(sr)p(s)=DKL ⁣(q(r)p()).G^*_r = \sum_s q(s \mid r) \log\!\bigl[\alpha_s\, q(s \mid r)\bigr] = \sum_s q(s \mid r) \log \frac{q(s \mid r)}{p(s)} = D_{\mathrm{KL}}\!\bigl(q(\cdot \mid r) \,\Vert\, p(\cdot)\bigr).

Averaging over received symbols:

G=rq(r)Gr=rq(r)DKL ⁣(q(r)p())=I(S;R),G^* = \sum_r q(r)\, G^*_r = \sum_r q(r)\, D_{\mathrm{KL}}\!\bigl(q(\cdot \mid r) \,\Vert\, p(\cdot)\bigr) = I(S;R),

where the last equality is [mi-kl]. ∎

Geometrically, the optimal Kelly strategy maps each received symbol rr to the corresponding posterior π(r)\pi(r) on the simplex ΔS1\Delta_{S-1}, and the growth rate measures the expected KL divergence traversed from the prior pp to the posterior π(r)\pi(r). The gambler's capital grows precisely because the private signal moves the posterior away from the prior, and this displacement, averaged over the output distribution, is the mutual information.

Corollary (No information, no growth).

Without side information (i.e., when SS and RR are independent), the maximum growth rate under true fair odds is G=I(S;R)=0G^* = I(S;R) = 0. No betting strategy can achieve positive long-run growth against a fair market without an informational edge.

Proof. When SS and RR are independent, q(sr)=p(s)q(s \mid r) = p(s) for all rr, so I(S;R)=0I(S;R) = 0 by [mi-nonneg]. ∎

5. Binary Prediction Markets

We specialize to two outcomes: YES (s=1s = 1) and NO (s=2s = 2). The gambler believes P(YES)=pP(\text{YES}) = p while the market prices a YES contract at qq, giving odds αYES=1/q\alpha_{\text{YES}} = 1/q and αNO=1/(1q)\alpha_{\text{NO}} = 1/(1-q). These are fair odds since 1/αYES+1/αNO=q+(1q)=11/\alpha_{\text{YES}} + 1/\alpha_{\text{NO}} = q + (1-q) = 1.

Corollary (Binary Kelly criterion for prediction markets).

For a binary prediction market with true probability pp and market price qq, with p>qp > q (positive edge on YES), the Kelly-optimal fraction of capital to bet on YES is

f  =  pq1q,f^* \;=\; \frac{p - q}{1 - q},

and the resulting growth rate is

G  =  plogpq+(1p)log1p1q  =  DKL ⁣((p,1p)(q,1q)).G^* \;=\; p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q} \;=\; D_{\mathrm{KL}}\!\bigl((p,\,1{-}p) \,\Vert\, (q,\,1{-}q)\bigr).

Proof. Apply [kelly-optimal] with active set λ={YES}\lambda = \{\text{YES}\}, so σ=1/αYES=q\sigma = 1/\alpha_{\text{YES}} = q and pλ=pp_\lambda = p. Then

b=1p1q,f=a(YES)=pbαYES=pq(1p)1q=p(1q)q(1p)1q=pq1q.b^* = \frac{1-p}{1-q}, \qquad f^* = a^*(\text{YES}) = p - \frac{b^*}{\alpha_{\text{YES}}} = p - \frac{q(1-p)}{1-q} = \frac{p(1-q) - q(1-p)}{1-q} = \frac{p-q}{1-q}.

For the growth rate, the wealth multiplier when YES wins is b+αYESf=(1p)/(1q)+(pq)/(q(1q))b^* + \alpha_{\text{YES}} f^* = (1-p)/(1-q) + (p-q)/(q(1-q)). Combining over a common denominator:

b+αYESf=q(1p)+(pq)q(1q)=p(1q)q(1q)=pq.b^* + \alpha_{\text{YES}} f^* = \frac{q(1-p) + (p-q)}{q(1-q)} = \frac{p(1-q)}{q(1-q)} = \frac{p}{q}.

When NO wins, the multiplier is b+0=(1p)/(1q)b^* + 0 = (1-p)/(1-q). Therefore:

G=plogpq+(1p)log1p1q=DKL ⁣((p,1p)(q,1q)).G^* = p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q} = D_{\mathrm{KL}}\!\bigl((p,1{-}p) \,\Vert\, (q,1{-}q)\bigr).

This is the KL divergence from the market distribution (q,1q)(q, 1{-}q) to the gambler's true distribution (p,1p)(p, 1{-}p). ∎

Remark (Edge on NO).

When p<qp < q, the edge is on NO. By symmetry, the optimal fraction to bet on NO is (qp)/q(q - p)/q, and the growth rate is again DKL((p,1p)(q,1q))D_{\mathrm{KL}}((p, 1{-}p) \,\Vert\, (q, 1{-}q)). In either direction, the Kelly growth rate for a binary prediction market is the KL divergence between the gambler's beliefs and the market's implied probabilities, measuring the information-theoretic "distance" between truth and market on the one-dimensional simplex Δ1=[0,1]\Delta_1 = [0,1].

The visualization below shows G(f)G(f) for even-money odds (q=0.5q = 0.5) as the true probability pp sweeps from 0.52 to 0.80. The gold dot marks the optimal fraction f=(pq)/(1q)f^* = (p - q)/(1 - q), and the faded traces show snapshots at intermediate values of pp.

The next figure illustrates the binary Kelly growth rate as a KL divergence on the simplex Δ1=[0,1]\Delta_1 = [0, 1]. The top panel shows the market price qq (hollow) and the gambler's true probability pp (gold) on the simplex. The bottom panel plots DKL((p,1p)(q,1q))D_{\mathrm{KL}}((p, 1{-}p) \,\Vert\, (q, 1{-}q)) as a function of pp.

Finally, we simulate the long-run behavior. Four wealth trajectories are drawn for each of three strategies: full Kelly (gold), half Kelly (blue), and double Kelly (coral). Full Kelly maximizes asymptotic growth but with high variance. Half Kelly sacrifices some growth for smoother paths. Double Kelly overbets, producing frequent ruin despite the positive edge. Dashed lines show the theoretical expected growth rate gNg \cdot N.

6. Application to Algorithmic Trading

The Kelly framework extends naturally from prediction markets to algorithmic trading systems, though several practical modifications are essential.

6.1. From discrete bets to continuous sizing

In practice, a trading system faces a stream of opportunities rather than a single repeated bet. Each opportunity ii has an estimated edge p^i\hat{p}_i and market-implied probability qiq_i. The Kelly fraction for opportunity ii is fi=(p^iqi)/(1qi)f^*_i = (\hat{p}_i - q_i)/(1 - q_i), and the capital allocated is fiWf^*_i \cdot W where WW is current wealth.

Two complications arise immediately. First, the true probability pp is never known exactly. The gambler works with an estimate p^\hat{p} that carries both systematic bias and estimation variance. Second, multiple positions may be open simultaneously, introducing correlation structure that the single-bet Kelly formula ignores.

6.2. Fractional Kelly and estimation error

When the estimate p^\hat{p} is noisy, full Kelly is too aggressive. To see why, consider the expected growth rate under misspecification. If the true probability is pp but the gambler bets as though it were p^\hat{p}, the realized growth rate is

G(p^;p)=plog ⁣[pqp^qpq1q1q]+(1p)log ⁣[1p1q1p^1p],G(\hat{p}; p) = p \log\!\left[\frac{p}{q} \cdot \frac{\hat{p} - q}{p - q} \cdot \frac{1 - q}{1 - q}\right] + (1-p)\log\!\left[\frac{1-p}{1-q} \cdot \frac{1 - \hat{p}}{1-p}\right],

which is strictly less than GG^* whenever p^p\hat{p} \neq p. The loss from overestimating edge is asymmetrically worse than the loss from underestimating it, because the logarithm penalizes losses more than it rewards gains. This asymmetry motivates fractional Kelly: bet a fraction κ(0,1)\kappa \in (0, 1) of the full Kelly amount. Common choices are κ=0.5\kappa = 0.5 (half Kelly) or κ=0.25\kappa = 0.25 (quarter Kelly).

The fractional Kelly strategy trades expected growth for reduced variance. With fraction κ\kappa, the growth rate becomes approximately G(κ)κGκ2G/2G(\kappa) \approx \kappa G^* - \kappa^2 G^* / 2, achieving 75%75\% of maximum growth at half Kelly while cutting the variance of log-returns in half [5].

6.3. Portfolio Kelly and correlation

When multiple positions are open simultaneously, the correct generalization replaces the scalar Kelly fraction with a vector problem. Given nn simultaneous opportunities with mean excess return vector μ\mu and covariance matrix Σ\Sigma, the growth-rate-maximizing portfolio weight vector is

f=Σ1μ,f^* = \Sigma^{-1} \mu,

the familiar Markowitz mean-variance portfolio, which turns out to be the continuous-time limit of the Kelly criterion [6]. For a trading system running multiple concurrent positions on Polymarket or similar venues, this means accounting for correlation between outcomes. Correlated positions effectively concentrate risk, requiring smaller individual allocations than the independent-bet formula suggests.

6.4. Practical guidelines

For an algorithmic system implementing Kelly sizing:

  1. Estimate conservatively. Calibrate p^\hat{p} on out-of-sample data using proper scoring rules (Brier score, log score). Prefer models that are well-calibrated over those that maximize accuracy.

  2. Use fractional Kelly. Start with κ=0.25\kappa = 0.25 and increase only as confidence in the probability estimates grows. The growth rate curve is flat near the optimum (see Figure 1), so undersizing costs far less than oversizing.

  3. Respect the budget constraint. The total allocation across all active bets must satisfy ifi1\sum_i f_i \leq 1. When many opportunities appear simultaneously, the constraint forces prioritization by edge quality piαip_i \alpha_i, exactly as [active-set-remark] describes.

  4. Rebalance on new information. Each new observation updates the posterior q(sr)q(s \mid r), changing the Kelly fraction. The capital path is not set-and-forget but adapts continuously as the information channel evolves.

  5. Account for fees and slippage. Transaction costs reduce the effective odds αs\alpha_s. If the market charges a fee ϕ\phi, replace αs\alpha_s with αs(1ϕ)\alpha_s(1 - \phi) in all formulas. Small edges can be entirely consumed by fees, leaving no profitable bet.

7. Conclusion

Kelly's 1956 result reveals a deep structural identity: the maximum rate at which a gambler can grow capital through repeated wagers is exactly the mutual information I(S;R)I(S;R) between the source of uncertainty and the signal available to the bettor. This connection is not a loose analogy. It is a theorem, derived here from the KKT conditions of a constrained concave optimization and the divergence decomposition of mutual information.

The proof passes through three stages. First, the growth rate GG decomposes into independent subproblems, one per received symbol. Second, solving each subproblem via the active-set conditions yields the closed-form allocation a(s)=q(sr)b/αsa^*(s) = q(s \mid r) - b^*/\alpha_s. Third, under fair odds, the optimal growth rate for each received symbol reduces to a KL divergence from the prior to the posterior, and averaging recovers the mutual information.

For binary prediction markets, the result simplifies further: the Kelly fraction is f=(pq)/(1q)f^* = (p - q)/(1 - q) and the growth rate is the KL divergence DKL((p,1p)(q,1q))D_{\mathrm{KL}}((p, 1{-}p) \,\Vert\, (q, 1{-}q)). No betting system can beat this rate, and no positive rate is achievable without an informational edge. These are not heuristics but consequences of the concavity of the logarithm and the geometry of the probability simplex.

In practice, uncertainty in the probability estimates, correlated positions, and transaction costs all require modifications. Fractional Kelly provides a principled way to trade growth for robustness, and the portfolio generalization handles simultaneous bets. The core message endures: information is capital, and the rate of capital growth is bounded by the rate of information.

Appendix: Kelly's Original Slash Notation

Expand to see the key results restated in Kelly's 1956 notation

Kelly's paper writes conditional probabilities with a slash rather than a conditioning bar: p(r/s)p(r/s) in place of p(rs)p(r \mid s), and q(s/r)q(s/r) in place of q(sr)q(s \mid r). The full notation dictionary is:

This postKelly (1956)Meaning
p(s)p(s)p(s)p(s)Source probability
p(rs)p(r \mid s)p(r/s)p(r/s)Channel transition
q(r)q(r)q(r)q(r)Output marginal
q(sr)q(s \mid r)q(s/r)q(s/r)Posterior
p(s,r)p(s, r)p(s,r)p(s, r)Joint probability
a(sr)a(s \mid r)a(s/r)a(s/r)Bet fraction
H(SR)H(S \mid R)H(S/R)H(S/R)Conditional entropy

Growth rate (Kelly's Eq. 6). In slash notation:

G  =  s,rp(s,r)log ⁣[br+αsa(s/r)].G \;=\; \sum_{s,r} p(s,r) \log\!\bigl[b_r + \alpha_s\, a(s/r)\bigr].

Budget constraint. For each received symbol rr:

br+s=1Sa(s/r)=1,a(s/r)0,br0.b_r + \sum_{s=1}^{S} a(s/r) = 1, \qquad a(s/r) \geq 0, \qquad b_r \geq 0.

Optimal allocation (Kelly's Eq. 10). Writing p(s)=q(s/r)p(s) = q(s/r) for the posterior, the KKT conditions yield:

b=1p1σ,a(s)=p(s)bαs(sλ).b^* = \frac{1 - p}{1 - \sigma}, \qquad a^*(s) = p(s) - \frac{b^*}{\alpha_s} \quad (s \in \lambda).

Proof sketch in slash notation. The partial derivatives of GG (Kelly's Eq. 7) are:

Ga(s)=k ⁣[p(s)αsb+αsa(s)sp(s)b+αsa(s)],\frac{\partial G}{\partial a(s)} = k\!\left[\frac{p(s)\,\alpha_s}{b + \alpha_s\,a(s)} - \sum_{s'} \frac{p(s')}{b + \alpha_{s'}\,a(s')}\right],

where k=logek = \log e. Setting the active-set equations to zero:

p(s)αsb+αsa(s)=C(sλ),p(s)αsbC(sλ).\frac{p(s)\,\alpha_s}{b + \alpha_s\,a(s)} = C \quad (s \in \lambda), \qquad \frac{p(s)\,\alpha_s}{b} \leq C \quad (s \in \lambda').

From the active-set condition, b+αsa(s)=p(s)αs/Cb + \alpha_s a(s) = p(s)\alpha_s / C, giving a(s)=p(s)/Cb/αsa(s) = p(s)/C - b/\alpha_s. Summing the definition of CC over λ\lambda and λ\lambda' separately yields C(1σ)=(1p)/bC(1 - \sigma) = (1-p)/b. The budget constraint b+sλa(s)=1b + \sum_{s \in \lambda} a(s) = 1 then gives 1/C=11/C = 1, so C=1C = 1.

Kelly's fundamental result (Kelly's Eq. 11). Under fair odds αs=1/p(s)\alpha_s = 1/p(s), the optimal strategy is br=0b_r = 0, a(s/r)=q(s/r)a(s/r) = q(s/r), and the growth rate is:

G=rq(r)sq(s/r)logq(s/r)p(s)=rq(r)DKL ⁣(q(/r)p())=I(S;R).G^* = \sum_r q(r) \sum_s q(s/r) \log \frac{q(s/r)}{p(s)} = \sum_r q(r)\,D_{\mathrm{KL}}\!\bigl(q(\cdot/r) \,\Vert\, p(\cdot)\bigr) = I(S;R).

The mutual information decomposes as I(S;R)=H(S)H(S/R)I(S;R) = H(S) - H(S/R), where H(S/R)=s,rp(s,r)logq(s/r)H(S/R) = -\sum_{s,r} p(s,r) \log q(s/r) is the conditional entropy in slash notation.

References

  1. Kelly, J. L. Jr. (1956). A New Interpretation of Information Rate. Bell System Technical Journal, 35(4), 917–926. DOI
  2. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423. DOI
  3. Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley.
  4. Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry. AMS.
  5. MacLean, L. C., Thorp, E. O., and Ziemba, W. T. (2011). The Kelly Capital Growth Investment Criterion. World Scientific. DOI
  6. Thorp, E. O. (2006). The Kelly Criterion in Blackjack, Sports Betting, and the Stock Market. In Handbook of Asset and Liability Management, Vol. 1. DOI

Comments

Loading…

Leave a comment

or comment as guest

$...$ inline · $$...$$ display

0 / 2000