RLE Information (IRLE)

What is IRLE?

IRLE (RLE Information) quantifies the information content of data by the code length produced by Run–Length Encoding. Instead of modeling full symbol statistics, IRLE focuses on runs (maximal sequences of identical values). Practically, IRLE measures how efficiently the data can be serialized into (value, count) descriptions. In this documentation, the measure is denoted by I with upright acronym subscript: IIRLE.

Given an input sequence x, let RLE(x) be its run-length–encoded form (interleaving values and counts). In our implementation, the unnormalized IRLE equals the encoded length (number of output tokens) times the code length per token inferred from the number of distinct tokens appearing in the encoded stream.

\[ I_{\mathrm{IRLE}}(x) \;=\; \bigl| \mathrm{RLE}(x) \bigr| \cdot \log_{2}\!\bigl( \, \lvert \Sigma_{\mathrm{RLE}}(x) \rvert \, \bigr) \]

Short history

RLE is one of the earliest and simplest compression schemes used in practice, popular in bitmap graphics, faxes, and simple archival formats. Its power lies in exploiting long constant runs, making it a natural baseline for structure detection wherever data tends to “clump.”

Physical meaning

IRLE measures the degree of aggregation in the data: the more the signal forms long, homogeneous stretches, the fewer runs are needed, the smaller the encoded sequence, and hence the smaller IIRLE. Conversely, rapid alternation between symbols increases the number of runs and raises IIRLE. Thus IRLE captures a coarse notion of temporal/spatial smoothness or piecewise-constancy.

Mathematical properties

Runs and encoded length. Let n = |x| and m(x) be the number of runs in x. With a canonical (value,count) serialization, the encoded length is proportional to the number of runs (up to the concrete representation of counts). In particular, |RLE(x)| grows with m(x) and shrinks as runs merge.

\[ m(x) \;=\; 1 \;+\; \sum_{i=1}^{n-1} \mathbf{1}\!\bigl[x_{i} \neq x_{i+1}\bigr] \]

Token alphabet in the encoded stream. Denote by kRLE(x) = |ΣRLE(x)| the number of distinct tokens present in the RLE output (this includes value tokens and whatever count tokens occur under the chosen integer coding). Our implementation uses this kRLE(x) to set the per-token code length via log2 kRLE(x).

Normalization used in this library. The raw IIRLE(x) is mapped to a problem-dependent target range between MinInfo and MaxInfo values for the same input size/alphabet, using empirically estimated RLE baselines as the source range.

\[ \widetilde{I}_{\mathrm{IRLE}}(x) \;=\; \mathrm{scaleToRange}\!\bigl( I_{\mathrm{IRLE}}(x),\; I_{\mathrm{IRLE}}^{\min}(x),\; I_{\mathrm{IRLE}}^{\max}(x);\; I_{\min}(x),\; I_{\max}(x) \bigr) \]

Here the source range endpoints are computed as:

\[ I_{\mathrm{IRLE}}^{\min}(x) \;=\; I_{\mathrm{IRLE}}(0^{n}), \qquad I_{\mathrm{IRLE}}^{\max}(x) \;\approx\; I_{\mathrm{IRLE}}(U_{n}) \]

with 0n the all-zero sequence of length n and Un a (pseudo)random sequence of the same length (one realization in the current implementation).

The target range endpoints (shared across information measures in the library) are:

\[ I_{\min}(x) \;=\; \log_{2}(n \,+\, 1) \]
\[ I_{\max}(n,k) \;=\; \begin{cases} \log_{2}(n \,+\, 1), & k \,=\, 1 \\[4pt] \log_{2}\!\Bigl( \dfrac{k^{\,n+1} \,-\, 1}{k \,-\, 1} \Bigr), & k \,\ge\, 2 \end{cases} \]

Here k is the number of distinct atomic symbols in the original data. These formulas match the provided MinInfo and MaxInfo implementations.

Complexity. A single left-to-right pass suffices to form runs and compute IIRLE, so the time complexity is linear and the working memory is constant (besides the emitted encoding).

\[ T(n) \;=\; \mathcal{O}(n), \qquad S(n) \;=\; \mathcal{O}(1) \]

Concatenation behavior. IRLE is not additive: concatenating x and y may merge the boundary run if the last symbol of x equals the first symbol of y, typically making IIRLE(xy) smaller than IIRLE(x)+IIRLE(y) by a small constant-scale term tied to one run and its count encoding.

Invariance to relabeling. Relabeling atomic values via a bijection leaves run lengths intact; if the RLE token alphabet size kRLE(x) is unaffected, IRLE is unchanged.

Relation to Kolmogorov complexity

RLE is a valid, effective coding scheme; hence, relative to a fixed universal machine, the Kolmogorov complexity is upper-bounded by the IRLE code length plus an additive constant for the decoder and any side information required to reconstruct the exact RLE format (e.g., count representation, delimiters).

\[ K_{U}(x) \;\le\; I_{\mathrm{IRLE}}(x) \;+\; c_{U} \;+\; L_{\mathrm{side}}(x) \]

Thus IRLE serves as a computable upper bound surrogate for K. When x is not run-compressible, IIRLE(x) can substantially exceed K(x); when x has long runs, IRLE may approach K(x) up to a modest overhead.

Limitations and remarks

Pattern selectivity. IRLE reacts only to run structure. Alternating or quasi-periodic data with short runs (e.g., 010101…) will look complex to IRLE even if other compressors exploit regularity.

Count representation. The per-run count coding (fixed-width byte, variable-length integer, etc.) affects |RLE(x)| and kRLE(x). Our implementation sidesteps format specifics by deriving the per-token length from the number of distinct tokens observed in the actual encoded stream, via log2 kRLE(x).

Normalization baselines. The source range endpoints IIRLEmin and IIRLEmax are computed on an all-zero and a random sequence of the same length, respectively. The upper endpoint uses one random draw, so it is an empirical ceiling, not an expectation.

Edge cases. Empty input yields 0 by convention; single-element inputs return 1 to avoid degeneracy, matching the provided code. When k = 1, MaxInfo collapses to the MinInfo formula.

Pseudocode

function IRLE_value(sequence x):
    n := length(x)
    if n == 0: return 0
    if n == 1: return 1

    # 1) RLE encode as alternating [value, count] tokens
    rle := empty list
    i := 1
    while i ≤ n:
        v := x[i]
        c := 1
        while (i + c ≤ n) and (x[i + c] == v):
            c := c + 1
        append rle with v
        append rle with c
        i := i + c

    # 2) Distinct tokens in encoded stream
    Sigma_rle := set(rle)
    k := size(Sigma_rle)
    L := length(rle)

    # 3) Unnormalized IRLE
    irle := L * log2(k)

    # 4) Empirical source-range endpoints
    zeros := sequence of n zeros
    irle_min := IRLE_on_encoded_stream( RLE(zeros) )  # same formula as above
    random_seq := random_sequence_of_length(n)        # same alphabet policy as implementation
    irle_max := IRLE_on_encoded_stream( RLE(random_seq) )

    # 5) Target-range endpoints (global MinInfo/MaxInfo)
    k_atomic := number_of_distinct_atomic_values_in(x)
    Imin := log2(n + 1)
    if k_atomic == 1:
        Imax := log2(n + 1)
    else:
        Imax := log2( (k_atomic^(n + 1) - 1) / (k_atomic - 1) )

    # 6) Normalize into [Imin, Imax] via affine scaling
    return scaleToRange(irle, irle_min, irle_max, Imin, Imax)

Additional formulas and useful bounds

Lower bound via runs. Regardless of how counts are coded, the number of encoded tokens is at least the number of runs m(x), and often proportional to it.

\[ \bigl| \mathrm{RLE}(x) \bigr| \;\ge\; m(x) \]

Canonical pair serialization. If the encoder always emits (value, count) pairs, then (ignoring container overhead) |RLE(x)| = 2 m(x). The per-token code length is bounded by log2 kRLE(x).

\[ I_{\mathrm{IRLE}}(x) \;=\; 2\, m(x)\, \log_{2}\!\bigl( \lvert \Sigma_{\mathrm{RLE}}(x) \rvert \bigr) \quad \text{(idealized pair output)} \]

Concatenation effect (qualitative). For sequences x and y, if the last symbol of x equals the first of y, the boundary runs merge, typically making IIRLE(xy) slightly below IIRLE(x)+IIRLE(y).