Maximum Information (IMAX)

What is Maximum Information?

IMAX is an upper-bound information measure that assigns the largest theoretically possible information to a dataset of length n drawn from an alphabet with k distinct symbols. It answers: “How many bits are needed at most to specify any dataset over this alphabet under the most information-rich (uniform, independent) scenario?”

Short history

The concept follows classical coding and combinatorial counting: with k equiprobable symbols, the per-symbol cost approaches log2(k) bits. Enumerating all sequences of a given (or bounded) length via counting arguments yields tight upper bounds. IMAX packages these bounds into a practical metric.

Physical meaning

IMAX quantifies the hypothetical “information capacity” of the dataset: the number of fair-coin flips required, in the most demanding case, to pinpoint the dataset among all possibilities defined by its length n and alphabet size k.

Mathematical properties

Setup. Let n be the dataset length and k the number of unique symbols. We use the symbol IIMAX in formulas.

\[ n = \lvert x \rvert,\quad k = \lvert \mathrm{alphabet}(x) \rvert. \]

Uniform i.i.d. upper bound (fixed length). Under independent, equiprobable symbols the maximum is:

\[ I_{\mathrm{IMAX}}^{\mathrm{iid}}(n,k) = n \,\log_{2} k. \]

Geometric-series upper bound (length up to n). Indexing any sequence whose length is at most n over k symbols gives:

\[ I_{\mathrm{IMAX}}^{\mathrm{up\ to}}(n,k) = \log_{2}\!\left(\frac{k^{\,n+1}-1}{k-1}\right). \]

Offset to fixed-length bound. The difference to the fixed-length form is bounded (for k ≥ 2):

\[ 0 \le I_{\mathrm{IMAX}}^{\mathrm{up\ to}}(n,k) – n\log_{2}k \le \log_{2}\!\left(\frac{k}{k-1}\right) \le 1 \quad \text{for } k \ge 2. \]

Unary alphabet edge case. If k = 1, the number of distinct datasets of length at most n is n+1 (choose the length), hence:

\[ I_{\mathrm{IMAX}}(n,1) = \log_{2}(n+1). \]

Monotonicity. IMAX is non-decreasing in both n and k:

\[ n_{1} \le n_{2} \ \Rightarrow\ I_{\mathrm{IMAX}}(n_{1},k) \le I_{\mathrm{IMAX}}(n_{2},k). \]
\[ k_{1} \le k_{2} \ \Rightarrow\ I_{\mathrm{IMAX}}(n,k_{1}) \le I_{\mathrm{IMAX}}(n,k_{2}). \]

Concatenation (piecewise uniform). For two segments with lengths n1, n2 and alphabet sizes k1, k2:

\[ I_{\mathrm{IMAX}}^{\mathrm{iid}} = n_{1}\log_{2}k_{1} + n_{2}\log_{2}k_{2}. \]

Relation to Kolmogorov complexity

Because IMAX enumerates all admissible datasets under simple side conditions (length, alphabet), it upper-bounds plain descriptional complexity up to machine and side-information terms:

\[ K_{U}(x) \le I_{\mathrm{IMAX}}^{\mathrm{up\ to}}(n,k) + L_{\mathrm{side}}(x) + c_{U}. \]

Here Lside(x) accounts for conveying n, k, the symbol–index mapping, and decoding conventions; cU is a machine-dependent constant.

Limitations and remarks

1) IMAX is intentionally optimistic: it ignores actual symbol frequencies and dependencies, so it can overestimate true compressibility.

2) The geometric-series form differs from the fixed-length form by at most one bit for k ≥ 2, but conceptually counts all lengths up to n.

3) For very large values, using the simpler n · log2(k) improves numerical stability; a practical threshold (e.g., 500 bits) can switch to this path.

4) Pragmatic edge conventions like returning 0 for empty input and 1 for singletons keep the scale intuitive and avoid degeneracies.

Algorithm (pseudocode)

The pseudocode mirrors the provided implementation of IMAX for collections and byte arrays.

function IMAX_value(values):
    if values == null: return 0
    n = length(values)
    if n == 0: return 0
    if n == 1: return 1   // pragmatic base unit

    k = count_unique(values)

    // fast path: fixed-length i.i.d. upper bound
    v = n * log2(k)
    if v > 500:
        return v

    // exact geometric-series bound (length up to n)
    if k == 1:
        return log2(n + 1)
    else:
        return log2( (k^(n + 1) - 1) / (k - 1) )