RLE Information – Run-Length Encoding as an Information Measure

What is IRLE?

IRLE (RLE Information) quantifies the information content of data by the code length produced by Run–Length Encoding. Instead of modeling full symbol statistics, IRLE focuses on runs (maximal sequences of identical values). Practically, IRLE measures how efficiently the data can be serialized into (value, count) descriptions. In this documentation, the measure is denoted by I with upright acronym subscript: I_IRLE.

Given an input sequence x, let RLE(x) be its run-length–encoded form (interleaving values and counts). In our implementation, the unnormalized IRLE equals the encoded length (number of output tokens) times the code length per token inferred from the number of distinct tokens appearing in the encoded stream.

\[ I_{\mathrm{IRLE}}(x) \;=\; \bigl| \mathrm{RLE}(x) \bigr| \cdot \log_{2}\!\bigl( \, \lvert \Sigma_{\mathrm{RLE}}(x) \rvert \, \bigr) \]

Short history

RLE is one of the earliest and simplest compression schemes used in practice, popular in bitmap graphics, faxes, and simple archival formats. Its power lies in exploiting long constant runs, making it a natural baseline for structure detection wherever data tends to “clump.”

Physical meaning

IRLE measures the degree of aggregation in the data: the more the signal forms long, homogeneous stretches, the fewer runs are needed, the smaller the encoded sequence, and hence the smaller I_IRLE. Conversely, rapid alternation between symbols increases the number of runs and raises I_IRLE. Thus IRLE captures a coarse notion of temporal/spatial smoothness or piecewise-constancy.

Mathematical properties

Runs and encoded length. Let n = |x| and m(x) be the number of runs in x. With a canonical (value,count) serialization, the encoded length is proportional to the number of runs (up to the concrete representation of counts). In particular, |RLE(x)| grows with m(x) and shrinks as runs merge.

\[ m(x) \;=\; 1 \;+\; \sum_{i=1}^{n-1} \mathbf{1}\!\bigl[x_{i} \neq x_{i+1}\bigr] \]

Token alphabet in the encoded stream. Denote by k_RLE(x) = |Σ_RLE(x)| the number of distinct tokens present in the RLE output (this includes value tokens and whatever count tokens occur under the chosen integer coding). Our implementation uses this k_RLE(x) to set the per-token code length via log₂ k_RLE(x).

Normalization used in this library. The raw I_IRLE(x) is mapped to a problem-dependent target range between MinInfo and MaxInfo values for the same input size/alphabet, using empirically estimated RLE baselines as the source range.

\[ \widetilde{I}_{\mathrm{IRLE}}(x) \;=\; \mathrm{scaleToRange}\!\bigl( I_{\mathrm{IRLE}}(x),\; I_{\mathrm{IRLE}}^{\min}(x),\; I_{\mathrm{IRLE}}^{\max}(x);\; I_{\min}(x),\; I_{\max}(x) \bigr) \]

Here the source range endpoints are computed as:

\[ I_{\mathrm{IRLE}}^{\min}(x) \;=\; I_{\mathrm{IRLE}}(0^{n}), \qquad I_{\mathrm{IRLE}}^{\max}(x) \;\approx\; I_{\mathrm{IRLE}}(U_{n}) \]

with 0ⁿ the all-zero sequence of length n and U_n a (pseudo)random sequence of the same length (one realization in the current implementation).

The target range endpoints (shared across information measures in the library) are:

\[ I_{\min}(x) \;=\; \log_{2}(n \,+\, 1) \]

\[ I_{\max}(n,k) \;=\; \begin{cases} \log_{2}(n \,+\, 1), & k \,=\, 1 \\[4pt] \log_{2}\!\Bigl( \dfrac{k^{\,n+1} \,-\, 1}{k \,-\, 1} \Bigr), & k \,\ge\, 2 \end{cases} \]

Here k is the number of distinct atomic symbols in the original data. These formulas match the provided MinInfo and MaxInfo implementations.

Complexity. A single left-to-right pass suffices to form runs and compute I_IRLE, so the time complexity is linear and the working memory is constant (besides the emitted encoding).

\[ T(n) \;=\; \mathcal{O}(n), \qquad S(n) \;=\; \mathcal{O}(1) \]

Concatenation behavior. IRLE is not additive: concatenating x and y may merge the boundary run if the last symbol of x equals the first symbol of y, typically making I_IRLE(xy) smaller than I_IRLE(x)+I_IRLE(y) by a small constant-scale term tied to one run and its count encoding.

Invariance to relabeling. Relabeling atomic values via a bijection leaves run lengths intact; if the RLE token alphabet size k_RLE(x) is unaffected, IRLE is unchanged.

Relation to Kolmogorov complexity

RLE is a valid, effective coding scheme; hence, relative to a fixed universal machine, the Kolmogorov complexity is upper-bounded by the IRLE code length plus an additive constant for the decoder and any side information required to reconstruct the exact RLE format (e.g., count representation, delimiters).

\[ K_{U}(x) \;\le\; I_{\mathrm{IRLE}}(x) \;+\; c_{U} \;+\; L_{\mathrm{side}}(x) \]

Thus IRLE serves as a computable upper bound surrogate for K. When x is not run-compressible, I_IRLE(x) can substantially exceed K(x); when x has long runs, IRLE may approach K(x) up to a modest overhead.

Limitations and remarks

Pattern selectivity. IRLE reacts only to run structure. Alternating or quasi-periodic data with short runs (e.g., 010101…) will look complex to IRLE even if other compressors exploit regularity.

Count representation. The per-run count coding (fixed-width byte, variable-length integer, etc.) affects |RLE(x)| and k_RLE(x). Our implementation sidesteps format specifics by deriving the per-token length from the number of distinct tokens observed in the actual encoded stream, via log₂ k_RLE(x).

Normalization baselines. The source range endpoints I_IRLE^min and I_IRLE^max are computed on an all-zero and a random sequence of the same length, respectively. The upper endpoint uses one random draw, so it is an empirical ceiling, not an expectation.

Edge cases. Empty input yields 0 by convention; single-element inputs return 1 to avoid degeneracy, matching the provided code. When k = 1, MaxInfo collapses to the MinInfo formula.

Pseudocode

function IRLE_value(sequence x):
    n := length(x)
    if n == 0: return 0
    if n == 1: return 1

    # 1) RLE encode as alternating [value, count] tokens
    rle := empty list
    i := 1
    while i ≤ n:
        v := x[i]
        c := 1
        while (i + c ≤ n) and (x[i + c] == v):
            c := c + 1
        append rle with v
        append rle with c
        i := i + c

    # 2) Distinct tokens in encoded stream
    Sigma_rle := set(rle)
    k := size(Sigma_rle)
    L := length(rle)

    # 3) Unnormalized IRLE
    irle := L * log2(k)

    # 4) Empirical source-range endpoints
    zeros := sequence of n zeros
    irle_min := IRLE_on_encoded_stream( RLE(zeros) )  # same formula as above
    random_seq := random_sequence_of_length(n)        # same alphabet policy as implementation
    irle_max := IRLE_on_encoded_stream( RLE(random_seq) )

    # 5) Target-range endpoints (global MinInfo/MaxInfo)
    k_atomic := number_of_distinct_atomic_values_in(x)
    Imin := log2(n + 1)
    if k_atomic == 1:
        Imax := log2(n + 1)
    else:
        Imax := log2( (k_atomic^(n + 1) - 1) / (k_atomic - 1) )

    # 6) Normalize into [Imin, Imax] via affine scaling
    return scaleToRange(irle, irle_min, irle_max, Imin, Imax)

Additional formulas and useful bounds

Lower bound via runs. Regardless of how counts are coded, the number of encoded tokens is at least the number of runs m(x), and often proportional to it.

\[ \bigl| \mathrm{RLE}(x) \bigr| \;\ge\; m(x) \]

Canonical pair serialization. If the encoder always emits (value, count) pairs, then (ignoring container overhead) |RLE(x)| = 2 m(x). The per-token code length is bounded by log₂ k_RLE(x).

\[ I_{\mathrm{IRLE}}(x) \;=\; 2\, m(x)\, \log_{2}\!\bigl( \lvert \Sigma_{\mathrm{RLE}}(x) \rvert \bigr) \quad \text{(idealized pair output)} \]

Concatenation effect (qualitative). For sequences x and y, if the last symbol of x equals the first of y, the boundary runs merge, typically making I_IRLE(xy) slightly below I_IRLE(x)+I_IRLE(y).

RLE Information (IRLE)

What is IRLE?

Short history

Physical meaning

Mathematical properties

Relation to Kolmogorov complexity

Limitations and remarks

Pseudocode

Additional formulas and useful bounds

Prespa

Need Help?

Learn More

Get in Touch