Shannon information (IS)

What is Shannon Information?

Definition (in bits, log base 2):

\[ I_S(x^n; P) = – \log_2 P(x^n) = \sum_{i=1}^{n} – \log_2 P(x_i \mid x^{i-1}) \]

Average behavior (i.i.d. case):

\[ \mathbb{E}\!\left[I_S(X^n)\right] = n\,H \]

Consequence: number of typical samples and their probability:

\[ \text{count} \;\approx\; 2^{\,\mathbb{E}[I_S(X^n)]} \]
\[ \Pr(\text{typical } x^n) \;\approx\; 2^{-\mathbb{E}[I_S(X^n)]} \]

Short history

Shannon (1948) identified the total codelength with IS: under a given model P, the best lossless code for an observed sample of length n needs about IS(x^n) bits.

Physical meaning

Statistical physics analogy: under a uniform distribution, IS equals the logarithm of the number of microstates, a measure of rarity.

Landauer’s principle: erasing one bit of information has a minimum heat cost:

\[ k\,T\,\ln 2 \]

Mutual information as extractable work: correlations quantified via IS upper-bound the work extractable from correlations.

Mathematical properties (briefly)

Additivity under independence:

\[ I_S(x^n, y^m) \;=\; I_S(x^n) + I_S(y^m) \]

Lower–upper bounds (deterministic and uniform cases):

\[ P(x^n)=1 \;\Rightarrow\; I_S=0 \]
\[ \text{uniform} \;\Rightarrow\; I_S = \log_{2}\!\big(|\mathcal{X}|^{\,n}\big) \]

Information rate (stationary–ergodic processes):

\[ r := \lim_{n\to\infty}\frac{1}{n}\,\mathbb{E}\!\big[I_S(X^n)\big] \]

Coding meaning: there exists a prefix code achieving the following bound:

\[ L(x^n) \;\le\; I_S(x^n) + 1 \]

Relation to Kolmogorov complexity K

Upper approximation (whenever the model P is computable):

\[ K(x^n) \;\le\; I_S(x^n; P) + c_{P} \]

Proof sketch: arithmetic coding provides a prefix code of length at most IS(x^n; P) plus a constant; a fixed universal Turing machine can decode it given a description of P.

Typical accuracy (stationary–ergodic sources, e.g., ergodic Markov chains):

\[ K(x^n) \;=\; I_S(x^n; \mu) + O(1) \]

Per-symbol limits equal the same rate r:

\[ \lim_{n\to\infty}\frac{1}{n}K(x^n) \;=\; r \]
\[ \lim_{n\to\infty}\frac{1}{n}I_S(x^n) \;=\; r \]

Limitations and remarks

Model dependence: IS depends on the chosen model; a poor model overestimates.

Unknown length: the sample length must also be encoded; approximate overhead:

\[ \log_{2} n \quad \text{bits} \]

Per-symbol overhead (negligible for large n):

\[ \frac{\log_{2} n}{n} \]

Continuous variables: density-based expressions are unit-dependent; see the canonical form below.

\[ -\log_{2} f(x^{n}) \]

Semantics-blind: IS measures statistical structure, not meaning.

Ergodic Markov chains: due to the typical equality above, IS is especially reliable for approximating K.

Mismatched model and excess codelength

If the true source is P but encoding uses Q (i.i.d.; for ergodic processes the rate is constant):

\[ \mathbb{E}_{P}\!\left[-\log_{2} Q(X^n)\right] \;=\; \mathbb{E}_{P}\!\left[I_S(X^n; P)\right] \;+\; n\,D(P\Vert Q) \]

Quick takeaways

IS measures rarity and the minimum achievable codelength relative to a model:

\[ I_S(x^n) \;=\; -\log_{2} P(x^n) \]

Single mention of H (average identity):

\[ \mathbb{E}\!\left[I_S(X^n)\right] = n\,H \]

Computable sources (upper bound) and ergodic sources (typical equality):

\[ K(x^n) \;\le\; I_S(x^n; P) + c_{P} \]
\[ I_S(x^n; \mu) \;=\; K(x^n) + O(1) \]