SCM Information (ISCM)

What is SCM Information?

SCM Information (ISCM) is a composition-aware information measure that searches over possible fixed-size decompositions of data and selects the composition that yields the smallest two-part Shannon-based description length. Practically, ISCM tries chunk sizes, scores each by a model-plus-data code length built from Shannon Information (IS), normalizes these scores to be comparable across chunk sizes, and returns the minimum.

Short history

ISCM is a practical, MDL-inspired extension of classical Shannon-based coding. It integrates a two-part description (a simple model of recurring chunks plus the sequence encoded with that model) and searches over chunk sizes to capture periodicity, motifs, and compositional structure with a computable score.

Physical meaning

ISCM quantifies how efficiently a dataset can be expressed by repeating building blocks of a fixed size. If the data is well-explained by a short recipe (few distinct chunks) and a simple arrangement, ISCM becomes small; if the data behaves randomly with no reusable chunks, ISCM remains large. Thus ISCM reflects the efficiency of a structural organization in the data.

Mathematical definition

Let the data length be denoted by N and the number of distinct atomic symbols by K. For a chunk size r, let m be the number of non-overlapping r-length chunks (implementation uses integer division) and let the set of distinct r-length chunks be Ur. Define a two-part score as the sum of: (i) the model cost over distinct chunks and (ii) the data cost of the chunk sequence itself, both measured by Shannon Information (IS). A normalization aligns scores across different r.

\[ m \;=\; \left\lfloor \frac{N}{r} \right\rfloor \]
\[ I_{\mathrm{model}}(r,x)\;=\;\sum_{u\in\mathcal{U}_r} I_{\mathrm{S}}(u) \quad\text{and}\quad I_{\mathrm{data}}(r,x)\;=\;I_{\mathrm{S}}(\text{sequence of }m\text{ chunks}) \]
\[ I_{\max}(N,K,r)\;=\; m\Big( r\,\log_2\!\big(\min(r,K)\big)\;+\;\log_2\!\big(\min\!\big(m, K^{\,r}\big)\big) \Big) \]
\[ I_{\max}^{(1)}(N,K)\;=\;I_{\max}(N,K,1) \]
\[ I_{\mathrm{SCM}}(x)\;=\;\min_{\,1\le r\le \left\lfloor N/2\right\rfloor}\; \frac{I_{\mathrm{model}}(r,x)+I_{\mathrm{data}}(r,x)}{I_{\max}(N,K,r)}\;\cdot\; I_{\max}^{(1)}(N,K) \]

In this definition, Imax(N,K,r) serves as a per-r scale, and Imax(1)(N,K) sets a common target scale so scores for different r become comparable. The implementation takes the minimum of these normalized scores over r.

Mathematical properties

Non-negativity and bounds. ISCM is non-negative and bounded above by the r=1 scale.

\[ 0\;\le\; I_{\mathrm{SCM}}(x)\;\le\; I_{\max}^{(1)}(N,K) \]

Dominance by the r=1 score. Because the minimum includes r=1, ISCM never exceeds the plain Shannon-based score at r=1 (on the common scale).

\[ I_{\mathrm{SCM}}(x)\;\le\; I_{\mathrm{S}}(\text{atoms-only representation}) \]

Relabeling invariance. ISCM is invariant under permutations of atomic symbols because IS depends only on empirical frequencies, not on labels.

Sensitivity to periodicity and motifs. For perfectly r-periodic data with one repeating r-length chunk, the sequence part is cost-free and ISCM is controlled by the model cost of a single chunk.

\[ x=\underbrace{c\,c\,\cdots\,c}_{m\ \text{times}},\ \ |c|=r \quad\Longrightarrow\quad I_{\mathrm{SCM}}(x)\;\le\; I_{\mathrm{S}}(c) \]

Random-like behavior. For i.i.d. data with near-uniform frequencies, reusable chunks do not reduce description length; the minimizing r typically equals 1 and ISCM approaches the atoms-only score on the same scale.

Concatenation is not generally additive. ISCM is a minimum over r and can violate additivity and subadditivity across concatenation because the optimal r for a concatenation may differ from those of its parts.

Replication bound. Repeating a string t times cannot increase ISCM by more than a factor t (achieved when repetitions do not create new compressible structure).

\[ I_{\mathrm{SCM}}(x^{t})\;\le\; t\cdot I_{\mathrm{SCM}}(x) \]

Relation to Kolmogorov complexity

ISCM is a computable, two-part, Shannon-based code length. As with general compression-based bounds, ISCM upper-bounds prefix Kolmogorov complexity up to a machine-dependent constant and the side information needed to reconstruct the chosen composition and coding details.

\[ K_{U}(x)\;\le\; I_{\mathrm{SCM}}(x)\;+\;c_{U}\;+\;L_{\mathrm{side}}(x) \]

Here cU is a universal-machine constant and Lside(x) includes the chunk size, the set of distinct chunks, and any auxiliary coding headers necessary to reproduce the same construction.

Limitations and implementation remarks

Chunking boundary. The reference implementation uses m = floor(N / r), discarding any tail shorter than r. This may bias results for small N or when the tail carries structure. Alternatives include padding or allowing a final shorter chunk.

\[ m \;=\; \left\lfloor \frac{N}{r} \right\rfloor \]

Normalizer. The per-r upper bound Imax(N,K,r) equalizes scales across r by accounting for the capacity of the r-chunk alphabet and the chunk sequence. The global absolute scale Imax(1)(N,K) makes different r directly comparable.

External normalization. The provided InfoNormalizer optionally rescales ISCM into an instance-specific range determined by a homogeneous baseline (“MinInfo”) and a randomized baseline (“MaxInfo”). Conceptually this is an affine map from the raw domain to a task-specific codomain:

\[ \mathrm{scale}(v)\;=\; \min\!\big(\mathrm{newMax},\ \mathrm{newMin} \;+\; \tfrac{v-\mathrm{origMin}}{\mathrm{origMax}-\mathrm{origMin}}\cdot(\mathrm{newMax}-\mathrm{newMin})\big) \]

Sample size. For reliability, a minimum estimation size (e.g., 500) is recommended because small samples yield high-variance empirical frequencies.

WordPress/Gutenberg-safe formulas. Do not place formulas inline. Always use the MathML block below and avoid raw < and > inside the brackets. For example, instead of inserting “cU” inline, place a dedicated block:

\[ K_{U}(x)\;\le\; I_{\mathrm{SCM}}(x)\;+\;c_{U}\;+\;L_{\mathrm{side}}(x) \]

Pseudocode

The following pseudocode mirrors the reference implementation, including the per-r normalization and global scaling.

function ISCM(values):
    if size(values) <= 1: return 0

    N := size(values)
    K := number_of_distinct_atomic_symbols(values)

    if K == 1: return 0

    absoluteMax := I_max(N, K, 1)
    best := +infinity

    for r in 1..floor(N/2):
        parts := breakApart(values, r, non_overlapping=true)   // length m = floor(N/r)
        m := size(parts)

        // Set of distinct r-length chunks
        U_r := unique(parts)

        // Two-part cost
        model_cost := 0
        for u in U_r:
            model_cost += IS(u)            // Shannon Information of chunk content

        data_cost := IS(parts)             // Shannon Information of the chunk sequence

        // Per-r upper bound and cross-r normalization
        max_r := I_max(N, K, r)
        raw := (model_cost + data_cost)
        score := (max_r == 0) ? 0 : (raw / max_r) * absoluteMax

        if score < best:
            best := score

    return best


function I_max(N, K, r):
    m := floor(N / r)
    term1 := r * log2(min(r, K))                  // capacity of chunk content
    term2 := log2(min(m, K^r))                    // capacity of the chunk sequence
    return m * (term1 + term2)