What is Maximum Information?
IMAX is an upper-bound information measure that assigns the largest theoretically possible information to a dataset of length n drawn from an alphabet with k distinct symbols. It answers: “How many bits are needed at most to specify any dataset over this alphabet under the most information-rich (uniform, independent) scenario?”
Short history
The concept follows classical coding and combinatorial counting: with k equiprobable symbols, the per-symbol cost approaches log2(k) bits. Enumerating all sequences of a given (or bounded) length via counting arguments yields tight upper bounds. IMAX packages these bounds into a practical metric.
Physical meaning
IMAX quantifies the hypothetical “information capacity” of the dataset: the number of fair-coin flips required, in the most demanding case, to pinpoint the dataset among all possibilities defined by its length n and alphabet size k.
Mathematical properties
Setup. Let n be the dataset length and k the number of unique symbols. We use the symbol IIMAX in formulas.
Uniform i.i.d. upper bound (fixed length). Under independent, equiprobable symbols the maximum is:
Geometric-series upper bound (length up to n). Indexing any sequence whose length is at most n over k symbols gives:
Offset to fixed-length bound. The difference to the fixed-length form is bounded (for k ≥ 2):
Unary alphabet edge case. If k = 1, the number of distinct datasets of length at most n is n+1 (choose the length), hence:
Monotonicity. IMAX is non-decreasing in both n and k:
Concatenation (piecewise uniform). For two segments with lengths n1, n2 and alphabet sizes k1, k2:
Relation to Kolmogorov complexity
Because IMAX enumerates all admissible datasets under simple side conditions (length, alphabet), it upper-bounds plain descriptional complexity up to machine and side-information terms:
Here Lside(x) accounts for conveying n, k, the symbol–index mapping, and decoding conventions; cU is a machine-dependent constant.
Limitations and remarks
1) IMAX is intentionally optimistic: it ignores actual symbol frequencies and dependencies, so it can overestimate true compressibility.
2) The geometric-series form differs from the fixed-length form by at most one bit for k ≥ 2, but conceptually counts all lengths up to n.
3) For very large values, using the simpler n · log2(k) improves numerical stability; a practical threshold (e.g., 500 bits) can switch to this path.
4) Pragmatic edge conventions like returning 0 for empty input and 1 for singletons keep the scale intuitive and avoid degeneracies.
Algorithm (pseudocode)
The pseudocode mirrors the provided implementation of IMAX for collections and byte arrays.
function IMAX_value(values):
if values == null: return 0
n = length(values)
if n == 0: return 0
if n == 1: return 1 // pragmatic base unit
k = count_unique(values)
// fast path: fixed-length i.i.d. upper bound
v = n * log2(k)
if v > 500:
return v
// exact geometric-series bound (length up to n)
if k == 1:
return log2(n + 1)
else:
return log2( (k^(n + 1) - 1) / (k - 1) )