Author: Charles Dana · Monce SAS · Co-authored with Claude
I(X;Y) = Σ P(x,y) · log P(x,y) / P(x)P(y) — the Shannon signature that decides which features matter.
SAT-based explainable multiclass classifier. Zero dependencies. Pure Python.
Snake's constructor now grows its own feature space from the data you give it
— and keeps it fully explainable. It auto-recognizes columns with expansion
potential, derives many candidate numeric columns from them, and lets aggressive
Shannon mutual information keep only the informative few. Dense text becomes
native TF-IDF columns; continuous numerics gain Gaussian-position KPIs. Every
derived column is an ordinary "N" column with a human-readable name like
damaging_mutations::tfidf(TP53) that surfaces verbatim in get_audit().
Proven on the honest protocol. On the seed-42 20/80 drug-sensitivity split (6 drugs, 64 layers, TF-IDF fit on train only, test predicted once), native zero-dependency domain extension lifts avg R² from 0.144 (raw) to 0.200 — beating the published TF-IDF result (0.171) natively, with no sklearn or pandas in the loop. The win comes from flooding wide then cutting hard: keep the top-10 derived columns by MI and you beat the paper's fixed-20 (0.217 vs 0.198 on the sklearn oracle). Less is more — the signal is concentrated.
Per-drug, at a leaner budget. The lift is not a pooled artifact — it holds drug
by drug. On the 80/20 split at just 10 layers (workers=10, IQR-mean regression,
abstaining rows backstopped by the train mean), expand="auto" beats expand=False
on all six drugs, lifting average R² from 0.118 → 0.218 (+0.100):
| Drug | expand=False R² |
expand="auto" R² |
Δ | Kept derived columns |
|---|---|---|---|---|
| dabrafenib | 0.2245 | 0.2441 | +0.0195 | damaging_mutations, amplified_genes |
| elephantin | −0.0371 | 0.0348 | +0.0719 | damaging_mutations, protein_changes, amplified_genes |
| nutlin3a | −0.0003 | 0.2220 | +0.2223 | damaging_mutations, amplified_genes |
| olaparib | 0.2867 | 0.3514 | +0.0647 | damaging_mutations, amplified_genes |
| plx4720 | −0.0063 | 0.1997 | +0.2061 | damaging_mutations, amplified_genes |
| trametinib | 0.2390 | 0.2565 | +0.0175 | damaging_mutations, amplified_genes |
| Average | 0.1178 | 0.2181 | +0.1003 | TOKENSET on every drug |
The leaner the raw model, the more expansion earns: at 10 layers two drugs
(nutlin3a, plx4720) were worse than predicting the mean without it (negative R²) and
TOKENSET turned them into genuinely useful models. Only text families fired — this
dataset has no continuous numeric features besides the target, so GAUSSIAN stayed
dormant, exactly as the MI gate should decide. Reproduce with
_drug_expand_test.py.
The invariant that makes it safe: raw source columns are ALWAYS kept. The MI
gate prunes only derived columns, so expansion can only ever add signal on top
of your original features — never erase information. Improvement is on-average, not
guaranteed per-dataset (derived columns still affect bucket routing), but the floor
is your raw model. Turn it off with expand=False for byte-exact v5.4.8 behavior.
from algorithmeai import Snake
# Domain extension is automatic — Snake detects dense text + continuous numerics
model = Snake(df, target_index="ln_ic50", n_layers=64, expand="auto") # default
# Derived columns trace to named recipes, queryable like any column
[h for h in model.header if "::" in h]
# ['damaging_mutations::tfidf(TP53)', 'damaging_mutations::tfidf(KRAS)',
# 'age::gauss_z', 'age::gauss_density', 'age::gauss_cdf', ...]
model.get_audit(X) # the derived names appear verbatim in the reasoning
model.to_json("m.json") # 10 top-level keys unchanged — a v5.4.8 loader still reads itFor each continuous numeric column, Snake fits μ, σ on train and derives three
position KPIs — all pure math, zero dependencies:
col::gauss_z— standardized deviation(x − μ) / σcol::gauss_density—exp(−½z²), a centrality signal (1.0 at the mean, decaying into the tails) that a single raw threshold cannot expresscol::gauss_cdf—Φ(z)viamath.erf, the value's percentile rank in the fitted normal
Two families weren't enough. Point Snake at a number-theory regression —
predict Ω(n), the count of prime factors of n with multiplicity, from cheap
surface features (digit stats, small-modulus residues n mod 2, 3, 5, …) — and
v5.5.0 domain extension hurt: expansion delta went negative. The signal is
real (Erdős–Kac) but it lives almost entirely in divisibility: n mod 2 == 0
already guarantees a factor of 2. A residue column's meaning is membership,
not magnitude — so routing it through GAUSSIAN's position KPIs (z / density / cdf) was pure noise, and it crowded the raw signal.
v5.5.1 adds the CATEGORICAL family and the routing rule that makes it fire:
- Cardinality is the router. A column (numeric or text) with
2 ≤ n_unique ≤ 100is categorical, not continuous. It one-hots each recurrent value into acol==v0/1 column — an exact-match bit Snake reads with a single literal (no stochastic threshold to miss).mod2==0,last_digit==0. - Flood-filter parity with TF-IDF. A value must recur (
count ≥ 2, the samemin_dffloor TOKENSET uses) to become a candidate — one-hotting singletons would memorize row IDs. Flood every recurrent value, then Shannon MI keeps the global top-10 across all sources (booleans need no binning — MI is an O(n) two-bucket count, and it ranks the divisibility bits in exact order). - GAUSSIAN now self-gates on normality. It only fires on a high-card numeric
whose empirical CDF agrees with the fitted normal to ≥ 0.95 (a KS-style score;
N(0,1)→0.985,U(0,1)→0.940). Non-normal columns get no position KPIs — the other half of "stop adding trash."
The result on the prime-factor benchmark (benchmark_categorical.py,
4000 train / 1000 test, 15 layers, seed 42, identical features to RF/GB):
| model | R² | MAE |
|---|---|---|
| baseline (predict mean) | −0.003 | — |
Snake expand=off |
0.345 | 1.113 |
Snake expand="auto" (v5.5.1) |
0.388 | 1.074 |
| RandomForest | 0.488 | 1.015 |
| GradientBoosting | 0.530 | 0.967 |
Expansion delta flips −0.006 → +0.043: CATEGORICAL turns the family that
hurt into one that helps, on the exact problem that exposed the gap. Snake
still trails gradient boosting here — smooth additive numeric regression is GB's
home turf, and Snake's lookalike vote can't compose mod2 ∧ mod4 ∧ … the way
boosting stages do — but the lesson generalizes past math: expansion helps when
the family matches the column's structure (membership) rather than its statistics
(position). The same rule lifts the drug-sensitivity benchmark too (sex,
age_category, subtype now one-hot instead of being ignored).
Domain extension didn't bolt anything onto the engine — it rides the four pillars Snake already had: the parallel batch architecture scores the wider feature space across every core, pandas flows in and out duck-typed, the audit names each derived column, and Shannon MI is the filter that keeps the flood honest.
And it ships under Charles's three priorities, in order — compatibility first, accountability second, explainability third. v5.4.8 models load intact, perfect-fit on train is preserved across the version boundary, and 336 tests pass (+35 new) with zero regressions.
Knobs (all overridable on the constructor):
expand="auto" | False | {col: family},EXPAND_TOP_K=100(flood width),EXPAND_MI_KEEP=10(aggressive cut, now global per family — ≤10 CATEGORICAL + ≤10 TOKENSET, MI competes across all sources),EXPAND_MAX_CARD=100(CATEGORICAL cardinality trigger),EXPAND_GAUSS_MIN_FIT=0.95(GAUSSIAN normality gate),EXPAND_CAT_MIN_ROWS=30(CATEGORICAL population floor),EXPAND_FAMILIES(ablate families),EXPAND_MIN_DISTINCT=8. Charts above are generated from real measured numbers by_speedwork/make_charts.py.
Snake speaks pandas end to end — train from a DataFrame, predict on a
DataFrame, score a single Series — without you ever converting a thing.
import pandas as pd
from algorithmeai import Snake
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
snake = Snake(df_train, target_index="target") # fit
df_test["target"] = snake.get_prediction(df_test) # predict the whole test set, in row order
df_test.to_csv("submission.csv", index=False) # ship itThat's the entire train → predict → submit cycle. No .values, no .to_dict(),
no batching loop, no glue — read a CSV, fit, assign the column back. The
get_prediction(df_test) call fans every row across your CPU cores under the hood
and hands back a list in row order, so the assignment just lines up.
Zero-dependency, by design. Snake never imports pandas. It duck-types: a DataFrame is "anything with a callable
.to_dict()and a.columnsattribute", a Series is ".to_dict()but no.columns". So pandas flows through only because you brought it — install Snake without pandas and nothing changes.
The constructor already accepts a DataFrame; target_index is just the label
column's name (or position). Everything else — type detection, dedup, profiles —
works exactly as with list[dict].
import pandas as pd
from algorithmeai import Snake
df = pd.read_csv("orders.csv") # columns: family, denomination, factory, thickness, qty
model = Snake(df, target_index="family", n_layers=12, oppose_profile="industrial")Every prediction method takes the feature DataFrame and returns a plain list in row order, so assignment back onto the frame is positional and clean. Under the hood it's the v5.4.8 parallel batch path — the rows fan out across your CPU cores.
X = df.drop(columns=["family"])
df["prediction"] = model.get_prediction(X) # ["LAMINATED", "MONOLITHIC", ...]
df["confidence"] = [max(p.values()) for p in model.get_probability(X)]
df["forecast"] = model.get_regression(X) # continuous targets (candles)
# Then it's just pandas — group, filter, route on Snake's own output:
df.groupby("factory")["confidence"].mean()
low_conf = df[df["confidence"] < 0.80] # send these to a fallbackThis is what makes Snake different from a black-box classifier on a DataFrame:
every prediction comes with its reasoning and its evidence, and both are
just columns. get_audit(df) returns one human-readable trace per row; the
labeled lookalikes are the actual training rows that voted for each answer.
df["prediction"] = model.get_prediction(X)
df["audit"] = model.get_audit(X) # one reasoning trace per row (str)
df["lookalikes"] = model.get_lookalikes_labeled(X) # [[idx, class, condition, "c"|"n"], ...] per row
# Why was this row called LAMINATED? Read it.
print(df.loc[0, "audit"])
# How many training rows backed it, and were they core or noise?
las = df.loc[0, "lookalikes"]
core = [la for la in las if la[3] == "c"]
print(f"{len(las)} lookalikes, {len(core)} core")get_augmented(df) returns a list of dicts that each carry the original row
(your metadata) plus Prediction, Probability, Audit, and Lookalikes.
Wrap it in pd.DataFrame(...) and you get a fully enriched frame in one line —
your id and feature columns ride along untouched, next to everything the model
knows about each row.
enriched = pd.DataFrame(model.get_augmented(X))
# columns: <your original columns...> + Prediction + Probability + Audit + Lookalikes
enriched[["id", "Prediction"]] # metadata + answer
enriched["n_lookalikes"] = enriched["Lookalikes"].str.len() # evidence count, per row
enriched.query("Prediction == 'LAMINATED'")["Audit"].iloc[0] # full reasoning for a classPredict, explain, and keep your keys — all in a single pass over the data. Filter on confidence, drill into any row's audit, count its supporting lookalikes, route the uncertain ones to a fallback. The submission isn't just labels; it's a queryable record of why.
A single row off a DataFrame is a pd.Series — pass it directly. It behaves
exactly like the equivalent dict, all the way through the full audit trail.
row = X.iloc[0] # a pd.Series
model.get_prediction(row) # "LAMINATED"
model.get_probability(row) # {"LAMINATED": 1.0, "MONOLITHIC": 0.0}
print(model.get_audit(row)) # human-readable reasoning, no conversion
model.get_lookalikes_labeled(row) # the training rows that voted, with core/noise origin| You pass | You get back | Notes |
|---|---|---|
dict |
single result | unchanged — the classic path |
pd.Series (one row) |
single result | treated as a dict |
list[dict] |
list of results, input order |
parallel across cores |
pd.DataFrame |
list of results, row order |
parallel across cores; assign with df["col"] = ... |
One guarantee worth stating plainly: the result is identical across all three
batch call styles. model.get_prediction(df), model.get_prediction(df.to_dict("records")),
and [model.get_prediction(r) for r in rows] return the same list — DataFrame
support is pure ergonomics, never a different code path. (Covered by the test suite.)
Real DataFrames have holes. Snake takes a real-world frame — NaNs, mixed
columns, missing fields — and just works, with no .fillna(), no
.dropna(), no pre-cleaning from you.
df = pd.read_csv("orders.csv") # fa=0.31, fb=NaN, thickness=8.0, factory=NaN, ...
model = Snake(df, target_index="family")
model.get_prediction(df.drop(columns=["family"])) # scores every row, NaN and allThree rules, applied identically at training and inference:
- Types come from the DataFrame, not from sniffing strings. Snake reads
df.dtypesoff a copy of your frame, so a numeric column stays numeric even withNaNin it. (The subtle trap this kills:str(float("nan")) == "nan", whose letters used to flip a whole numeric column to text — silently halving its signal. A 15%-NaNcolumn could cost 20+ points of held-out accuracy with no error raised. Gone.) NaNis filled by design —0.0for numeric fields,""for text fields — everywhere, so a missing value is a well-defined input, never a crash and never a poisoned literal.- Missing columns are treated as fully absent and filled. If a query frame
lacks a feature the model trained on, that feature is taken as all-missing and
filled with its default — the row still scores against every clause instead of
silently mismatching.
model.get_prediction(df_without_some_column)works.
This holds across every method that takes a DataFrame, Series, or batch —
get_prediction, get_probability, get_lookalikes(_labeled), get_candle,
get_regression, get_audit, get_augmented, and the get_batch_* fast paths.
Zero-dependency throughout: Snake reads .dtypes and detects NaN by duck-typing
(v != v), never by importing pandas or numpy.
Also in v5.4.8 — parallel batch + stripped serving. Every prediction method is polymorphic over
list[dict]too, fanning across cores exactly as it does for a DataFrame.to_json(stripped=True)drops the training population so many workers fit in RAM and stay CPU-bound — the basis for a multi-core serving pool. → Batch inference · Stripped models
Snake already explains every prediction; v5.4.7 explains the whole dataset — and tells you, honestly, when there's nothing to explain. A permutation-free χ² significance test on debiased mutual information flags
is_noise=trueon random data andstrongon real signal, reports calibration vs the true base rate, and runs zero-dependency offline (the plain-language narration is one optional cloud call). → Full section
Snake is a SAT-based lookalike voting classifier. For each prediction, it finds training samples that "look alike" via Boolean clause matching, then votes by their labels. The result: a fully explainable classifier where every prediction comes with a human-readable audit trail.
Input X → Match SAT clauses → Find lookalikes → Vote → Prediction + Audit
Predicted outcome:
setosa(93.3%) Because:"petal_length" <= 2.45AND"petal_width" <= 0.8Matched 15 lookalikes, all classsetosa.
v5.4.8 makes every prediction method polymorphic — pass a list of dicts (or a pandas DataFrame, or a single pd.Series) and Snake divides the batch across your CPU cores (processes, not threads), returning results in input order, exactly matching the sequential call (5.4× on 10 cores, 20k rows). Train from a DataFrame, predict on a DataFrame, df['pred'] = model.get_prediction(X) — all duck-typed, so the library still imports nothing. Adds stripped serialization (to_json(stripped=True)) — drop the training population to serve the hot path from a lean, RAM-light artifact, the basis for a multi-core worker pool.
v5.4.7 adds get_synthetic — a synthetic audit at dataset scale. Snake scans a batch of points locally (0 tokens, sub-ms each) and reports deterministic diagnostics: prediction spread, calibration vs the true base rate, debiased feature mutual-information, and a label-free noise detector that tells you honestly when your data looks like random luck. An optional one-call cloud narration (via the monceai SDK) explains the finding as a tiny science experiment — hypothesis / experiment / result. The local path stays zero-dependency; the cloud voice is imported lazily.
v5.4.6 adds distribution candles and regression — Snake now does both classification and regression from the same model. Every prediction yields a candle (high/q3/median/q1/low/mean/iqr_mean/std/n) summarising the lookalike distribution; the IQR-trimmed mean is the regression estimate. On a synthetic linear regression (n=800 train), +7.4pp R² over get_prediction and 2.3× faster in batch.
v5.4.0 added Shannon MI-weighted feature selection and lookahead literal selection — the first principled feature importance mechanism in Snake. On 500-feature datasets with 20 informative features, +12.6pp AUROC over v5.2.1. Zero regressions on classical datasets.
pip install git+https://github.com/Monce-AI/algorithmeai-snake.gitDev install:
git clone https://github.com/Monce-AI/algorithmeai-snake.git
cd algorithmeai-snake
pip install -e .Python 3.9+. Zero dependencies — uses only the standard library.
from algorithmeai import Snake
# Train from a list of dicts (production pattern)
data = [
{"species": "setosa", "petal_length": 1.4, "petal_width": 0.2},
{"species": "setosa", "petal_length": 1.3, "petal_width": 0.3},
{"species": "versicolor", "petal_length": 4.5, "petal_width": 1.5},
{"species": "versicolor", "petal_length": 4.1, "petal_width": 1.3},
{"species": "virginica", "petal_length": 5.2, "petal_width": 2.0},
{"species": "virginica", "petal_length": 5.0, "petal_width": 1.9},
# ... more rows
]
model = Snake(data, n_layers=5, bucket=250)
# Predict
X = {"petal_length": 4.3, "petal_width": 1.4}
print(model.get_prediction(X)) # "versicolor"
print(model.get_probability(X)) # {"setosa": 0.0, "versicolor": 0.87, "virginica": 0.13}
print(model.get_audit(X)) # Full reasoning trace
# Save & reload
model.to_json("model.json")
model = Snake("model.json") # Auto-detected by .json extension
print(model.get_prediction(X)) # Same resultSnake votes by lookalikes. When the target is a class, votes give a probability dict. When the target is a float, votes give a distribution — and a distribution is exactly what a trader reads off a chart: a high, a low, a body, a wick.
For each prediction, Snake assembles the lookalike y values into a Candle:
▲ high ── max(y over lookalikes)
│
│
╔╧╗ q3 ── 75th percentile
║ ║
║─║ median ── 50th percentile
║ ║
╚╤╝ q1 ── 25th percentile
│
│
▼ low ── min(y over lookalikes)
The candle is a pure-distribution object — there is no order on a set of lookalikes, so OHLC is replaced by a five-number summary plus mean, IQR-trimmed mean, std, and n.
from algorithmeai import Snake
# Train on a continuous target — y is a float, no binning required.
data = [{"x1": ..., "x2": ..., "y": 12.34}, ...]
model = Snake(data, target_index="y", n_layers=20, bucket=80, noise=0.5, workers=4)
# Single-point candle
c = model.get_candle({"x1": 4.3, "x2": 1.4})
c.high, c.q3, c.median, c.q1, c.low # five-number summary
c.mean, c.iqr_mean, c.std, c.n # point estimates + dispersion
c.to_dict() # JSON-friendly
# Regression — the IQR-trimmed mean of the lookalike set
y_hat = model.get_regression({"x1": 4.3, "x2": 1.4})
# Batch (uses the Cython batch lookalike fast-path when available)
candles = model.get_batch_candles(X_test)
y_hats = model.get_batch_regression(X_test)get_prediction is mode-like — it picks the most-frequent lookalike value, which is brittle on continuous targets. iqr_mean averages the middle 50% of the lookalike distribution, discarding the wicks. It's robust like the median and smooth like the mean.
| Method | What it does | R² | Time (200 preds) |
|---|---|---|---|
get_prediction (classification path) |
most-frequent lookalike y | 0.8734 | 248 ms |
get_regression / iqr_mean |
average of lookalikes within [Q1, Q3] | 0.9477 | 106 ms |
Synthetic regression: y = 3·x₁ − 2·x₂ + 1.5·x₃ + 𝒩(0, 1), n=800 train / 200 test, Snake(n_layers=20, bucket=80, noise=0.5). Batch is faster because all four candle/regression methods share a single Cython lookalike fetch.
Wide wicks → "model is unsure, downside/upside tail real." Tight body around the median → "high consensus." The candle gives the trader (or risk function) a confidence interval for free, on every prediction, without bootstrapping or sampling.
c = model.get_candle(X)
range_width = c.high - c.low # uncertainty proxy
body_width = c.q3 - c.q1 # consensus width
skew = (c.q3 - c.median) - (c.median - c.q1) # >0 = upside skew| Method | Returns | Description |
|---|---|---|
get_candle(X) |
Candle |
Distribution of lookalike y values |
get_batch_candles(Xs) |
list[Candle] |
Batched, shares the Cython lookalike path |
get_regression(X) |
float |
IQR-trimmed mean of the candle |
get_batch_regression(Xs) |
list[float] |
Batched regression |
Candle is a dataclass exposing high, q3, median, q1, low, mean, iqr_mean, std, n plus to_dict().
The candle path applies to any continuous target — financial returns, lab measurements, sensor readings, latencies, prices. The model is unchanged; only the post-processing of lookalikes differs.
The Shannon signature. On high-dimensional data, oppose() used to pick features uniformly at random. With 20 informative features out of 500, that's a 4% hit rate on signal. The rest is noise.
v5.4.0 precomputes mutual information MI(feature; target) for each feature — the exact quantity decision trees maximize at each split (Quinlan 1986). oppose() now picks features proportionally to MI weight:
# Before (v5.2.1): uniform random
index = choice(candidates)
# After (v5.4.0): MI-weighted
index = choices(candidates, weights=MI_weights)[0]The math:
MI(X; Y) = Σ P(x,y) · log₂ P(x,y) / (P(x) · P(y))
Computed once in O(n×m), passed to all workers. Numeric features are quantile-binned into 20 bins. Text features use raw values (capped at 200 unique).
Impact on feature selection (Hard Madelon, 500 features):
| Informative features | Uniform P(signal) | MI-weighted P(signal) | Lift |
|---|---|---|---|
| 5 / 500 | 1.0% | 3.1% | 3.1× |
| 20 / 500 | 4.0% | 10.8% | 2.7× |
| 50 / 500 | 10.0% | 25.0% | 2.5× |
MI weights decide which features to try. Lookahead decides which literal to keep.
Instead of taking the first literal oppose() generates, construct_clause now generates K=5 candidates and picks the one that covers the most Ts (training positives). Higher coverage = tighter clause = fewer literals needed = better generalization.
# Before: one shot
lit = oppose(choice(Ts), F)
# After: best of K
candidates = [oppose(choice(Ts), F) for _ in range(K)]
lit = max(candidates, key=lambda l: coverage(l, Ts))On classical data, this produces fewer, tighter clauses:
| Dataset | v5.2.1 clauses | v5.4.0 clauses | Reduction |
|---|---|---|---|
| Breast Cancer | 872 | 783 | -10.2% |
| Digits | 10,195 | 9,061 | -11.1% |
| Wine | 396 | 343 | -13.4% |
# Control lookahead: K=1 disables it, K=10 for maximum quality
model = Snake(data, lookahead=5) # default
model = Snake(data, lookahead=1) # v5.2.1 behavior
model = Snake(data, lookahead=10) # slower, tighter clausesTested on Hard Madelon (500 features, weak Gaussian signal in n informative features, noise in the rest) and classical sklearn datasets. Same seeds, same splits.
High-dimensional (the MI sweet spot):
| Dataset (500 features) | v5.2.1 AUROC | v5.4.0 AUROC | Delta |
|---|---|---|---|
| 5 informative, signal=1.2 | 0.526 | 0.727 | +20.1pp |
| 20 informative, signal=0.8 | 0.587 | 0.712 | +12.6pp |
| 50 informative, signal=0.8 | 0.711 | 0.850 | +13.9pp |
| 10 informative, signal=0.8 | 0.562 | 0.634 | +7.2pp |
Classical (no regression):
| Dataset | Features | v5.2.1 AUROC | v5.4.0 AUROC | Delta |
|---|---|---|---|---|
| Breast Cancer | 30 | 0.996 | 0.998 | +0.2pp |
| Iris | 4 | 0.993 | 0.998 | +0.5pp |
| Wine | 13 | 0.999 | 1.000 | +0.1pp |
| Digits | 64 | 0.997 | 0.996 | -0.1pp |
Run python benchmark_mi.py to reproduce.
Snake accepts five input formats. The first key/column is the target by default.
| Format | Example | Notes |
|---|---|---|
| List of dicts | Snake([{"label": "A", ...}]) |
Production pattern. First key = target |
| CSV file | Snake("data.csv", target_index=3) |
Pandas-formatted CSV |
| DataFrame | Snake(df, target_index="species") |
Duck-typed — no pandas dependency |
| List of tuples | Snake([("cat", 4, "small"), ...]) |
First element = target, auto-headers |
| List of scalars | Snake(["apple", "banana", ...]) |
Self-classing, dedupes to unique |
List of dicts (recommended):
model = Snake([
{"survived": 1, "class": 3, "sex": "male", "age": 22},
{"survived": 0, "class": 1, "sex": "female", "age": 38},
])CSV file:
model = Snake("titanic.csv", target_index=0)DataFrame:
model = Snake(df, target_index="survived")DataFrames work for inference too — model.get_prediction(df) predicts every
row in parallel and returns a list in row order. See DataFrames & pandas.
List of tuples:
model = Snake([("cat", 4, "small"), ("dog", 40, "large"), ("cat", 5, "small")])List of scalars (self-classing — useful for synonym deduplication):
model = Snake(["44.2 LowE", "44.2 bronze", "Float 4mm clair"])Complex targets (dict/list values as targets):
data = [
{"label": {"color": "red", "size": "big"}, "feature": "round"},
{"label": {"color": "blue", "size": "small"}, "feature": "square"},
]
model = Snake(data, n_layers=5)
pred = model.get_prediction({"feature": "round"}) # returns {"color": "red", "size": "big"}Snake(Knowledge, target_index=0, excluded_features_index=(), n_layers=5, bucket=250, noise=0.25, vocal=False, saved=False, progress_file=None, workers=1, oppose_profile="auto", lookahead=5)| Parameter | Type | Default | Description |
|---|---|---|---|
Knowledge |
str / list / DataFrame | — | CSV path, JSON model path, list of dicts/tuples/scalars, or DataFrame |
target_index |
int / str | 0 |
Target column index or name |
excluded_features_index |
tuple/list | () |
Column indices to exclude from training |
n_layers |
int | 5 |
Number of SAT layers to build (more = more accurate, slower) |
bucket |
int | 250 |
Max samples per bucket before splitting |
noise |
float | 0.25 |
Cross-bucket noise ratio for regularization |
vocal |
bool | False |
Print training progress |
saved |
bool | False |
Auto-save model after training (CSV flow only) |
progress_file |
str/None | None |
File path for JSON training progress updates |
workers |
int | 1 |
Parallel workers for layer construction (>1 uses multiprocessing) |
oppose_profile |
str | "auto" |
Literal generation strategy: auto, balanced, linguistic, industrial, cryptographic, scientific, categorical |
lookahead |
int | 5 |
Literal candidates per oppose call. 1 = v5.2.1 behavior. Higher = tighter clauses, slower |
| Method | Returns | Description |
|---|---|---|
get_prediction(X) |
value | Most probable class |
get_probability(X) |
dict | {class: probability} for all classes |
get_lookalikes(X) |
list | [[index, class, condition], ...] matched training samples |
get_lookalikes_labeled(X) |
list | [[index, class, condition, origin], ...] with "c" (core) or "n" (noise) |
get_augmented(X) |
dict | Input enriched with Lookalikes, Probability, Prediction, Audit |
get_audit(X) |
str | Full human-readable reasoning trace |
get_candle(X) |
Candle |
Distribution of lookalike y values — high/q3/median/q1/low/mean/iqr_mean/std/n |
get_batch_candles(Xs) |
list[Candle] |
Batched candles, shares Cython lookalike fast-path |
get_regression(X) |
float |
IQR-trimmed mean of the candle (continuous targets) |
get_batch_regression(Xs) |
list[float] |
Batched regression |
get_synthetic(Xs, interpret=True) |
dict | Dataset-scale synthetic audit (v5.4.7) — deterministic summary + optional cloud narration |
Every X above can also be a list of dicts — Snake then parallelizes across
your CPU cores and returns a list of results in input order. See
Batch inference.
X = {"petal_length": 4.3, "petal_width": 1.4}
model.get_prediction(X) # "versicolor"
model.get_probability(X) # {"setosa": 0.0, "versicolor": 0.87, "virginica": 0.13}
model.get_lookalikes(X) # [[42, "versicolor", [0, 5]], [87, "versicolor", [3]]]
model.get_augmented(X) # {**X, "Lookalikes": ..., "Probability": ..., "Prediction": ..., "Audit": ...}Every prediction method is polymorphic: give it a single dict and you get a single result; give it a list of dicts and Snake fans the work across a proportionate number of CPU processes and returns a list in the same order. Same method, same name — you don't learn a new API to go fast.
# Single datapoint — one result (unchanged)
model.get_prediction({"petal_length": 4.3, "petal_width": 1.4}) # "versicolor"
# A list — Snake divides it across your cores, returns a list in input order
batch = [{"petal_length": 4.3, "petal_width": 1.4}, ...] # 20,000 rows
model.get_prediction(batch) # ["versicolor", "setosa", ...] (len == 20,000)
model.get_probability(batch) # [ {...}, {...}, ... ]
model.get_lookalikes(batch) # [ [...], [...], ... ]
model.get_regression(batch) # [ 3.14, 2.71, ... ]The result is exact, not approximate. Each datapoint is scored independently
by the very same single-dict code path, so the batch result is element-for-element
identical to a plain [model.get_prediction(x) for x in batch] — just faster.
Snake's layers are independent and votes are merged without dedup, so splitting a
batch across processes can't change an answer.
It uses processes, not threads (pure-Python inference is GIL-bound, so threads
wouldn't help). Snake sizes the pool to min(cpu_count, len(batch)) — one balanced
chunk per worker — so it scales with the hardware you actually have. Measured 5.4×
on a 10-core machine for a 20k-row batch; the speedup tracks core count.
Best practices:
- Hand Snake the whole list at once. Don't loop in Python calling the single-dict method — that's the slow path you're trying to avoid. One call with the full list lets Snake schedule the cores.
- Batch has a small fixed cost (forking workers). Below
model._parallel_threshold(default 64) Snake runs inline automatically — tiny batches don't pay for a pool. - Cap the pool with
model._max_workers = Nif you're sharing the box with other work (e.g. a web server's own workers). Defaults to all cores.
The same polymorphism extends to pandas — pass a DataFrame or a pd.Series
and Snake handles it natively. See DataFrames & pandas
for the full workflow.
Audit output (Routing AND + Lookalike AND):
### BEGIN AUDIT ###
Prediction: versicolor
Layers: 5, Lookalikes: 47
LOOKALIKE SUMMARY
================================================
versicolor 87.2% (41/47) █████████████████░░░
e.g. sample with petal_length 4.5
virginica 10.6% (5/47) ██░░░░░░░░░░░░░░░░░░
e.g. sample with petal_length 5.0
PROBABILITY
================================================
P(versicolor) = 87.2% █████████████████░░░
P(virginica) = 10.6% ██░░░░░░░░░░░░░░░░░░
================================================
LAYER 0
================================================
Routing AND (bucket 1/2, 78 members):
"petal_width" > 0.8
Lookalike AND (12 matches):
Lookalike #42 [versicolor]: 4.5
AND: "petal_length" <= 5.0 AND "petal_width" <= 1.7
...
>> PREDICTION: versicolor
### END AUDIT ###
get_audit(X) explains one prediction. get_synthetic(Xs) explains a whole dataset: it scans a batch, aggregates deterministic diagnostics locally (0 tokens, sub-ms per point), and — optionally — makes a single cloud call to narrate the finding in plain language.
The design has two layers, and the split is deliberate:
- Local core (
interpret=False) — 100% standard library, zero network, zero tokens. This is what the public free tier always gets. - Cloud voice (
interpret=True, default) — one call to the monceai SDKJson, imported lazily so Snake stays zero-dependency.monceaidepends onalgorithmeai, never the reverse — no circular dependency.
model = Snake(train, target_index="sensitive", n_layers=20, bucket=200, oppose_profile="industrial")
# Pure local — deterministic, offline, free. Pass labeled rows to also get held-out metrics.
summary = model.get_synthetic(test, interpret=False)["summary"]{
"n_points": 95, "n_classes": 2, "task": "binary", "n_layers": 20,
"prediction_distribution_pct": {"0": 57.9, "1": 42.1},
"true_base_rate_pct": {"0": 50.4, "1": 49.6},
"calibration_gap_pts": 7.5, # predicted vs true majority rate
"mean_confidence": 0.881,
"low_confidence_rate_pct": 11.6,
"coverage_rate_pct": 100.0, # share of points a bucket actually matched
"top_features_mi": [["subtype", 0.61], ["hotspot_mutations", 0.60], ...],
"signal_strength": "strong", # none / weak / moderate / strong
"is_noise": false, # honest: true when data looks like chance
"n_labeled": 95,
"holdout_accuracy_pct": 82.1, # only when X carries the target column
"holdout_auroc": 0.856 # no retrain — uses the labels you passed
}# With the cloud voice — one Json call narrates it as a science experiment.
out = model.get_synthetic(test)
print(out["hypothese"]) # "We bet the genetic clues could sort cell lines into responders vs not."
print(out["experience"]) # "Snake ran a 20-layer logic check over all 95 points, spending zero AI tokens."
print(out["resultat"]) # "The bet held: 82.1% correct (AUROC 0.856), strongest clue = subtype —
# though it's a touch over-optimistic, so trust accuracy over confidence."If monceai is not installed, interpret=True raises a clear error pointing at the install and the interpret=False escape hatch — never a bare ImportError.
The headline property: Snake refuses to manufacture signal from noise. Two safeguards back the is_noise flag:
- Effect size — mutual information per feature, Miller-Madow bias-corrected. On finite data, an unrelated feature still shows MI > 0 because empty (feature, class) cells never appear; we subtract the analytic null
(bins−1)(classes−1)/(2n·ln2). - Significance — under independence,
2n·MI·ln2 ∼ χ²withdf = (bins−1)(classes−1).is_noisefires only when no feature clears ≈3σ above its null. This is a permutation-free significance test, not a tuned threshold.
| Data | signal_strength |
is_noise |
held-out AUROC |
|---|---|---|---|
| Random labels (coin-flip) | none |
true |
≈ 0.50 |
Real signal σ(1.8a − 1.3b) |
strong |
false |
0.84 |
Same pipeline, opposite verdicts — exactly what you want from an explainable classifier: silent when there's nothing, confident when there is.
Xs— a list of dicts (a lone dict is treated as a batch of one). Include the target column to receive held-outaccuracy/AUROC; omit it for an unlabeled scan (everything except those two fields still works).- Returns
{"summary": {...}}, plus"hypothese"/"experience"/"resultat"wheninterpret=True.
Each lookalike carries an origin label: core (c) = routed to the bucket by condition, noise (n) = randomly injected from the full population for regularization. This splits Snake's probability into independent signals.
model = Snake(data, target_index="label", n_layers=77, bucket=150, noise=0.40, workers=10)
# Labeled lookalikes — each entry: [global_idx, target_value, condition, origin]
lookalikes = model.get_lookalikes_labeled(X)
for idx, target, cond, origin in lookalikes:
print(f"#{idx} [{target}] ({origin})") # e.g. "#42 [1] (c)", "#8 [0] (n)"
# Weighted probability — trust core more than noise
def weighted_prob(lookalikes, target_class, w_c=2, w_n=1):
total = sum(w_c if la[3] == "c" else w_n for la in lookalikes)
hits = sum((w_c if la[3] == "c" else w_n) for la in lookalikes if str(la[1]) == str(target_class))
return hits / total if total > 0 else 0.5
# Split signals
core_only = [la for la in lookalikes if la[3] == "c"]
noise_only = [la for la in lookalikes if la[3] == "n"]Key finding: The optimal weight ratio (w_c, w_n) depends on n_layers. At low layer counts, core dominates — noise is a distraction. At high layer counts, noise becomes a genuine complementary signal because full-population diversity compounds across stochastic layers.
| Config | Core AUROC | Noise AUROC | Divergence winner | Best weighting |
|---|---|---|---|---|
| 7 layers, noise=0.25 | 0.895 | 0.768 | Core 81% | Pure core |
| 77 layers, noise=0.40 | 0.891 | 0.877 | Noise 59% | Noise-heavy |
Backwards compatible — old models without origins default to "c" for all lookalikes.
Snake supports 7 oppose profiles — each a tuned literal generation strategy for a data archetype. The oppose() function is untouched; profiles are substitute functions that control which literal types get generated and at what probability.
# Auto (default) — Snake scans your data and picks the best profile
model = Snake(data, oppose_profile="auto")
# Explicit — you know your data
model = Snake(data, oppose_profile="cryptographic")
model = Snake(data, oppose_profile="linguistic")| Profile | Best for | Key literal types | Speed |
|---|---|---|---|
auto |
Any data — scans population, picks one | Depends on detection | — |
balanced |
Unknown data, mixed types | Equal weight across all 24 types | Medium |
linguistic |
NLP, free text, author attribution | LEV (edit distance), JAC (bigram similarity), PFX/SFX | Slower |
industrial |
Product codes, SKUs, short labels | T (substring), TN/TLN (structural), splits | Fast |
cryptographic |
Hashes, IDs, encoded data | ENT (entropy), HEX (hex ratio), CFC (char freq), REP (repeat) | Medium |
scientific |
Measurements, sensors, lab data | NZ (z-score), NL (log-scale), NMG (magnitude) | Fast |
categorical |
Surveys, tags, enums | TWS/TPS/TSS (splits) + T (substring) at 60% combined | Fast |
30 literal types (was 7): the original 7 (T, TN, TLN, TWS, TPS, TSS, N) plus 23 new types across distance, positional, charclass, crypto, and scientific families. Each literal is still [index, value, negat, tag] — same format, same apply_literal.
Auto-detection scans text features for avg length, length variance, digit ratio, uppercase ratio, special char ratio, and delimiter density. Pure numeric data → scientific. Long varied text → linguistic. Short codes with digits → industrial. High special chars → cryptographic. Many delimiters → categorical. Mixed or unclear → balanced.
Backwards compatible — old models without oppose_profile default to the original oppose(). New literal types return False for unknown tags (graceful degradation).
Same data, same split (80/20, seed=42), zero preprocessing. Snake uses 15 layers, bucket=250, workers=10. RF/GB use 100 estimators (sklearn defaults). No feature engineering on either side.
Pure numeric (float matrices):
| Dataset | Features | Classes | Random Forest | GradBoost | Snake (best profile) | Profile |
|---|---|---|---|---|---|---|
| Iris | 4 | 3 | 100.0% | 100.0% | 100.0% | all tie |
| Wine | 13 | 3 | 100.0% | 94.4% | 100.0% | original |
| Breast Cancer | 30 | 2 | AUROC 0.997 | AUROC 0.995 | AUROC 0.999 | original |
| Digits | 64 | 10 | 97.2% | 96.9% | 96.4% | original |
Mixed text + numeric (the Snake sweet spot):
| Dataset | Features | Classes | Snake AUROC | Snake Acc | Best profile | vs original |
|---|---|---|---|---|---|---|
| Classic Titanic (w/ Names) | 8 | 2 | 0.924 | 87.2% | cryptographic | +3.8pp |
| Spaceship Titanic | 12 | 2 | 0.840 | 78.6% | balanced | +3.7pp |
Snake beats RF and GB on Breast Cancer AUROC (0.999 vs 0.997 vs 0.995). Ties on Iris/Wine. Within 0.8pp on Digits. On mixed text+numeric data, profiles add up to +3.8pp AUROC over the original oppose — no preprocessing required.
# Save (full model — keeps everything, the default)
model.to_json("model.json")
# Load (auto-detected by .json extension)
model = Snake("model.json")Backwards compatible — v0.1 flat JSON files (with clauses + lookalikes at top level) are automatically wrapped into the bucketed format on load.
A saved model includes the full training population — typically the large majority of the file. Inference doesn't need it: prediction, probability, lookalikes, candle, and regression run entirely off the SAT layers. Save a stripped model to drop the population and keep only what serving needs:
# One-time: produce a lean artifact for production
model.to_json("model.stripped.json", stripped=True)
# In the server: load the stripped model — far less RAM per instance
served = Snake("model.stripped.json")
served.get_prediction(X) # works
served.get_probability(batch) # works — and parallelizesWhy it matters for a worker pool. To use every core you run several Snake workers, each holding the model. A stripped model is small, so you can fit many copies in RAM and stay CPU-bound (every core busy doing useful clause work) instead of RAM-bound. This is the same split the production Snake API uses to serve at scale.
The trade-off is explicit and safe. get_audit and get_augmented render
real training examples ("e.g. 44.2 LowE"), so they need the population. Call
them on a stripped model and you get a clear error telling you to load the full
model — never a silent wrong answer:
served.get_audit(X)
# RuntimeError: get_audit() needs the training population, but this model was
# loaded stripped (no population). ... Re-save with to_json(stripped=False) ...Rule of thumb: keep one full model (for audits / RAG / debugging) and serve from the stripped one (for the high-throughput hot path).
# Score each layer on validation data, keep the top N
model.make_validation(val_data, pruning_coef=0.5)
# pruning_coef=0.5 keeps the best 50% of layers
# Save the pruned model
model.to_json("model_pruned.json")val_data is a list of dicts (same format as training data, must include the target field).
Input X
│
▼
┌─────────────────────────────────────────┐
│ MI-Weighted Feature Selection │
│ │
│ _precompute_feature_mi() │
│ MI(feature; target) for each column │
│ oppose() picks features ∝ MI weight │
└─────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Lookahead Literal Selection │
│ │
│ Generate K=5 candidate literals │
│ Pick the one covering the most Ts │
│ → Tighter clauses, fewer per formula │
└─────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Bucket Chain (IF/ELIF/ELSE) │
│ │
│ IF condition_0(X): → Bucket 0 │
│ ELIF condition_1(X): → Bucket 1 │
│ ELSE: → Bucket N │
│ │
│ Each condition = AND of SAT literals │
│ Each bucket ≤ 250 samples │
└─────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Local SAT (per bucket) │
│ │
│ For each target class: │
│ Build minimal clauses separating │
│ positive from negative samples │
│ │
│ 30 literal types (7 families): │
│ T/TN/TLN — substring/structural │
│ TWS/TPS/TSS — split counts │
│ N — numeric threshold │
│ LEV/JAC — edit distance/bigrams │
│ PFX/SFX — prefix/suffix length │
│ ENT/HEX/REP/CFC — crypto features │
│ NZ/NL/NMG — z-score/log/magnitude │
└─────────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Lookalike Voting │
│ │
│ Find training samples matching X │
│ via SAT clause satisfaction │
│ │
│ Vote by target labels → probability │
│ Return max probability class │
└─────────────────────────────────────────┘
Repeated across n_layers independent layers. Final prediction aggregates all lookalikes across all layers.
# Train
snake train data.csv --layers 5 --bucket 250 --noise 0.25 -o model.json --vocal
# Predict
snake predict model.json -q '{"petal_length": 4.3, "petal_width": 1.4}'
snake predict model.json -q '{"petal_length": 4.3}' --audit
# Model info
snake info model.jsonComplexity:
- Training:
O(n_layers * n_buckets * m * bucket_size²)— dominated by SAT clause construction - Inference:
O(n_layers * n_clauses_in_matched_bucket)— fast for small buckets - MI precompute:
O(n * m)once — negligible vs training - Lookahead:
O(K * |Ts|)per literal — offset by fewer literals per clause - Memory: Entire population stored in memory; model JSON includes full training data
Scaling guidance:
| Dataset Size | Recommendation |
|---|---|
| < 500 samples | bucket=250 works fine, single bucket per layer |
| 500–5,000 | bucket=250 creates 2–20 buckets per layer, good performance |
| 5,000+ | Training gets slow — consider n_layers=3-5, bucket=500 |
| 10,000+ | Segment by a certain field, train per-segment models |
model = Snake("model.json")
prob = model.get_probability(X)
confidence = max(prob.values())
prediction = model.get_prediction(X)
if confidence >= 0.51:
return prediction # Snake is confident
else:
return fuzzy_fallback(X) # Fall back to another matchermodel = Snake("model.json")
results = []
for row in batch:
features = {k: v for k, v in row.items() if k != "target"}
results.append({
"prediction": model.get_prediction(features),
"probability": model.get_probability(features),
})audit = model.get_audit(X)
# Multi-line string with:
# - Lookalike summary with examples per class
# - Probability distribution
# - Per-layer: Routing AND + Lookalike AND explanations
# - Final prediction
# Feed to an LLM for explanation generation.Snake's boolean test language is fully specified in oppose_types.snake. Every literal is:
MEASURE(field) > threshold (negat flips to <=)
The file defines:
- §1 Measures — 20 primitive functions (identity, len, entropy, levenshtein, zscore, ...)
- §2 Literals — 30 boolean test types with oppose rules, eval rules, and format templates
- §3 Profiles — weight vectors over literal types
New literal types are added as cartridges: define the measure, the oppose rule (how to compute threshold from T and F), and the eval rule (how to test a new field). The .snake format is human-readable and serves as the single source of truth for Snake's discriminator library.
-
Target type must match at prediction time. If training targets were
int(e.g.,0,1), predictions returnint. Ifstr(e.g.,"cat"), predictions returnstr. Mixing types silently fails to match. -
Feature types are fixed at training time. A feature detected as Numeric (
N) does numeric comparison. Passing a string at prediction time for that feature causes aTypeError. -
CSV must be pandas-formatted. The CSV parser handles quoted fields with commas but expects
pandas.DataFrame.to_csv()format. Non-pandas CSVs may parse incorrectly. -
No incremental training. To add data, rebuild from scratch.
make_validationonly prunes layers. -
Model JSON contains the full training population. A model trained on 5,000 rows produces a large JSON file. Be mindful of disk/memory.
-
excluded_features_indexonly works with CSV flow. For list[dict] and DataFrame, filter your data before passing it in. -
Binary True/False targets become int 0/1. Compare predictions against
0/1(int), not"True"/"False"(str).
Snake has minimal error handling by design:
| Situation | What Happens | How to Avoid |
|---|---|---|
| Empty list | ValueError |
Check before constructing |
| Empty DataFrame | ValueError |
Check len(df) > 0 |
| Wrong file extension | Crashes | Use .csv or .json extensions |
| File not found | FileNotFoundError |
Check path exists |
| Malformed JSON | json.JSONDecodeError |
Validate JSON before loading |
target_index not found |
ValueError / IndexError |
Check column name/index exists |
| Only 1 unique target | Trains but trivially predicts that class | Need at least 2 classes |
Prediction with {} |
Uniform probability | Pass at least some features |
| Unknown keys in prediction | Ignored silently | Use same key names as training |
No exceptions during prediction. get_prediction, get_probability, get_lookalikes, get_audit all handle edge cases gracefully (worst case: uniform probability, empty lookalikes).
Snake training is non-deterministic. The oppose() function uses random.choice extensively. Two runs on the same data produce different models with different clause sets. This is by design — the power comes from stochastic clause generation + deterministic selection.
There is no random.seed() call in the Snake code. If you need reproducibility:
import random
random.seed(42)
model = Snake(data, n_layers=5)Test assertions are probabilistic (e.g., "at least 50% of training data has >50% confidence") rather than exact value checks.
Snake uses Python's logging module. Each instance gets its own logger (snake.<id>).
- Buffer handler — always attached at DEBUG level, captures everything to
self.log. This is howto_json()persists the training log. - Console handler — attached only when
vocal:vocal=True: INFO level (training progress)vocal=2: DEBUG level (per-target SAT progress)vocal=False: no console output
The banner only prints when vocal=True.
Snake includes optional Cython-accelerated hot paths for apply_literal, apply_clause, and traverse_chain. When compiled, these provide ~3× speedups for both training and inference. The lookahead coverage check in v5.4.0 automatically benefits from apply_literal_fast.
# Install with Cython support
pip install -e ".[fast]"
python setup.py build_ext --inplaceWithout Cython, Snake runs in pure Python with identical behavior. The Cython extension is auto-detected at import time.
pytest # all 346 tests
pytest tests/test_expansion.py # v5.5.0 domain extension — 35 tests, every invariant
pytest tests/test_snake.py # input modes, save/load, augmented, vocal, dedup, parallel training
pytest tests/test_buckets.py # bucket chain, noise, routing, audit, dedup
pytest tests/test_core_algorithm.py # oppose, construct_clause, construct_sat
pytest tests/test_validation.py # make_validation / pruning
pytest tests/test_edge_cases.py # errors, type detection, extreme params
pytest tests/test_cli.py # CLI train/predict/info via subprocess
pytest tests/test_logging.py # logging buffer, JSON persistence, banner
pytest tests/test_audit.py # Routing AND, Lookalike AND, audit end-to-end
pytest tests/test_stress.py # stress tests, batch equivalence
pytest tests/test_ultimate_stress.py # extended stress tests
pytest tests/test_oppose_profiles.py # oppose profiles, new literal types, auto-detection, JSON roundtrip
pytest tests/test_feature_mi.py # MI precompute, weighted sampling, lookahead, JSON roundtrip
pytest tests/test_candle.py # distribution candles, regression, IQR-trimmed mean346 tests across 14 files. Tests use tests/fixtures/sample.csv (15 rows, 3 classes) with small n_layers (1–3) and bucket (3–5) for speed.
- CATEGORICAL expansion family — the third domain-extension family. Any column (numeric or text) with
2 ≤ n_unique ≤ EXPAND_MAX_CARD(100) is categorical, not continuous: it one-hots each recurrent value (count ≥ EXPAND_DF_MIN, the samemin_dffloor TOKENSET uses) into acol==v0/1 derived column — an exact-match bit Snake reads with one literal. Names render verbatim in the audit (mod2==0,status==refurb). Motivated by a number-theory regression (predict Ω(n), prime-factor count) where the signal is divisibility (membership), not magnitude — v5.5.0 routed residues through GAUSSIAN and hurt (expansion delta negative). With CATEGORICAL the delta flips −0.006 → +0.043 R² onbenchmark_categorical.py. - GAUSSIAN normality gate — GAUSSIAN now only fires on a high-card numeric whose empirical CDF agrees with the fitted normal to ≥
EXPAND_GAUSS_MIN_FIT(0.95, a KS-style score). Non-normal columns (uniform, skewed) get no position KPIs instead of injecting noise. Calibrated:N(0,1)→0.985,U(0,1)→0.940. - Global per-family MI cut —
_apply_mi_gatenow pools all candidates of a family across every source and keeps the global topEXPAND_MI_KEEP, rather than top-K per source. A single rich source competes head-to-head; noise sources can't claim free slots. Fixes a multi-source dilution regression. Each family gets its own budget (≤10 CATEGORICAL, ≤10 TOKENSET). - MI binning fix —
_precompute_feature_minow bins low-cardinality numerics (incl. one-hot 0/1 columns) by exact value when they recur (n / n_unique ≥ 2), not quantile boundaries. The old quantile cut collapsed a 2-value column into one bin and reported MI=0 — silently starving every low-card numeric, not just expansion columns. Lifted raw-Snake prime-factor R² 0.305 → 0.345 on its own. - The bar held, seamlessly. v5.4.8 models load intact (
expansions=[], perfect-fit preserved, verified against genuine v5.4.8-trained JSON); v5.5.1to_json()round-trips to an identical object (0 prediction mismatches, expansions identical); 10 top-level JSON keys unchanged;expand=Falsebyte-exact v5.4.8. 346 tests, +10 CATEGORICAL, zero regressions.
- Automatic domain extension — the constructor auto-detects columns with expansion potential and grows derived numeric columns from them, filtered by aggressive Shannon MI. Two families ship: TOKENSET (native TF-IDF on dense text — matches
sklearn.TfidfVectorizer(token_pattern=r'\S+', max_features=K, min_df=2)cell-for-cell on shared vocab) and GAUSSIAN (per continuous numeric column:::gauss_zz-score,::gauss_density=exp(−½z²)centrality,::gauss_cdf=Φ(z)viamath.erf). All pure-Python, zero dependencies. - Flood-then-cut, proven on the honest split — flood candidates wide (
EXPAND_TOP_K=100/source), keep the top-EXPAND_MI_KEEP=10 by MI. On the seed-42 20/80 drug split (6 drugs, 64 layers, fit on train, test predicted once): raw R² 0.144 → native expansion 0.200, beating the published TF-IDF result (0.171) natively. Less-is-more: top-10 (0.217) > fixed-20 (0.198) > top-40 (0.163) on the sklearn oracle. - Replicated at a leaner budget (80/20, 10 layers). Independent re-run,
workers=10, IQR-mean regression with train-mean backstop on abstaining rows:expand="auto"beatsexpand=Falseon all 6 drugs, average R² 0.118 → 0.218 (+0.100). Two drugs (nutlin3a, plx4720) flip from negative R² to ~0.20 — the leaner the raw model, the more expansion earns. Only TOKENSET fired (no continuous features besides the target). Reproduce with_drug_expand_test.py. - The invariant — raw columns always kept. The MI gate prunes only derived columns, so expansion can only add signal, never erase the original features. Improvement is on-average (derived cols still affect bucket routing); the floor is your raw model.
- Compatibility first, by design. The 10 top-level
to_jsonkeys are unchanged —expand/expansionsride insideconfig. v5.4.8 models load intact (expansions=[]),expand=Falseis byte-exact v5.4.8, and perfect-fit on train is preserved across the version boundary (300/300 on reload). Derived names surface verbatim inget_audit(). - +35 tests in
tests/test_expansion.pyacross 11 sections (compatibility, accountability, explainability, the raw-always-kept invariant, persistence round-trips, both families, inference contract, perfect-fit, helpers, determinism). 336 total, zero regressions. Cython-transparent: expansion runs in_normalize_featuresbefore the accel fast-paths. - 12 README SVG charts generated from real measured numbers by
_speedwork/make_charts.py.
- Graceful
NaNhandling for pandas — a real DataFrame (NaNs, mixed columns, missing fields) now trains and predicts with no pre-cleaning. Three rules, applied identically at train and inference: (1) types read offdf.dtypeson a copy of the frame, so a numeric column stays numeric even withNaNin it — killing the silentstr(float("nan")) == "nan"trap that flipped numeric columns to text and could cost 20+ points of held-out accuracy with no error raised; (2)NaNfilled by design —0.0for numeric,""for text; (3) missing columns treated as full-NA and filled, so a row always scores against every clause. Covered on every DataFrame/Series/batch method (get_prediction/get_probability/get_lookalikes(_labeled)/get_candle/get_regression/get_audit/get_augmented/get_batch_*). Zero-dependency:NaNdetected byv != v, types by duck-typed.dtypes.kind. +6 regression tests (301 total). - Parallel batch inference — every prediction method is polymorphic: pass a
list[dict], apd.DataFrame, or a singlepd.Seriesand Snake divides the batch across CPU processes (GIL-bound pure Python), returning results in input order, exactly matching the sequential call. Measured ~5× on a 10-core box for a 20k-row batch. Inline below_parallel_threshold(default 64). - Stripped serialization —
to_json(stripped=True)drops the training population so workers stay RAM-light and CPU-bound; loads serve the hot path (prediction/probability/lookalikes/candle/regression) only. The basis for a multi-core serving pool. - Seamless pandas I/O — train from a DataFrame, predict on a DataFrame, score a
Series, and readget_augmented(df)straight intopd.DataFrame(...)— all duck-typed, the library still imports nothing.
- Distribution candles:
get_candle(X)andget_batch_candles(Xs)return aCandledataclass summarising the lookalike y values —high, q3, median, q1, low, mean, iqr_mean, std, nplusto_dict(). Pure-distribution object: no temporal order, so OHLC is replaced with quartiles. - Regression:
get_regression(X)andget_batch_regression(Xs)return floats — the IQR-trimmed mean of the lookalike distribution. On a synthetic linear regression (y = 3·x₁ − 2·x₂ + 1.5·x₃ + 𝒩(0,1), n=800 train), R² = 0.9477 vsget_predictionR² = 0.8734 (+7.4pp) at 2.3× the throughput (batch shares the Cython lookalike fast path). - No new params: works on any
Snakemodel trained with continuous targets. The model is unchanged; only post-processing of the lookalike set is new. - 10 new tests in
tests/test_candle.py(266 tests total) - New exports:
Candle,compute_candlefromalgorithmeai
- Enforced datatypes parameter: optional
enforced_datatypesconstructor argument lets callers pin the target/feature types instead of relying on auto-detection. Useful when training data is too small to disambiguateT(text) fromN(numeric) and the caller knows the intent.
- New literal type: SIM — subsequence match ratio with N-style midpoint thresholding. Combines T's substring generation with N's continuous logic.
_text_score(field, ref)returns the fraction ofref's characters found infieldin order. O(len(field)), single pass, 8× substring cost. Snake's SAT construction picks the ref string and threshold — the literal stores[ref, threshold]and evaluatesscore >= threshold - New oppose profile: HEF — SIM-dominant profile for part number matching. Short alphanumeric strings (5-30 chars) where customers send variant refs for the same catalog article. Weights: SIM:35 TEQ:15 PFX:13 SFX:13 JAC:10 T:8. No LEV — SIM handles fuzzy matching at 30× less cost
- LEV exact DP raised to 256 chars — was 32. Bag-of-chars fallback only kicks in above 256 chars. All practical strings now use exact Wagner-Fischer DP
- Tested on HEF production data: Pernat (9 articles)
"S700376REC"→ARAA03-700376-A2at 90% confidence (was 20% with industrial profile). The SIM literal catches that "700376" is a subsequence of "S700376REC" - 256 tests passing, zero regressions
- MI-weighted feature selection:
_precompute_feature_mi()computes Shannon mutual informationMI(feature; target)for every feature in O(n×m). All 7 oppose profiles + the originaloppose()now pick features proportionally to MI weight instead of uniformly. On 500-feature datasets: P(oppose picks informative feature) jumps from 4% to 10.8% (2.7× lift) - Lookahead literal selection:
_oppose_lookahead()generates K=5 candidate literals per position inconstruct_clause, keeps the one covering the most Ts. Produces 10-13% fewer clauses on classical datasets — tighter clauses generalize better - +20pp AUROC on hard high-dimensional data: 5 informative features out of 500 at signal=1.2: v5.2.1 AUROC 0.526 → v5.4.0 AUROC 0.727. 50/500 at signal=0.8: 0.711 → 0.850
- Zero classical regressions: Breast Cancer +0.2pp, Iris +0.5pp, Wine +0.1pp (perfect 1.000), Digits -0.1pp (noise)
- New parameter
lookahead=5: Controls literal candidates per oppose call.1= v5.2.1 behavior. Stored in JSON config, backwards compatible (defaults to 5 for old models) - Worker pipeline: MI weights and lookahead passed to parallel workers via
_init_worker - 256 tests across 12 files (20 new MI/lookahead tests)
- Benchmark script:
benchmark_mi.py— A/B comparison of v5.2.1 vs v5.4.0 on Hard Madelon + classical datasets
- O(n) string distance: Levenshtein uses exact DP for strings ≤32 chars, O(n) char-frequency distance for longer strings. No truncation — signal preserved, compute bounded
- Dataset-specific profiles: 8
.snakeprofile files inprofiles/with empirical annotations (PIMA, Breast Cancer, Titanic, Spaceship, Wine Quality, Mushroom, Digits, Adult Income) - Snake vs RF/GB benchmark: Snake beats Random Forest on Breast Cancer AUROC (0.999 vs 0.997). Ties on Iris/Wine. Within 0.8pp on Digits. Same data, zero preprocessing
- Classic Titanic with Names: cryptographic profile achieves 0.924 AUROC / 87.2% accuracy (+3.8pp over original)
- Meta classifier removed from codebase
- Linguistic profile deprecated: never won where expected, blocks training on long text. Pure NLP is out of Snake's scope
- Cython bool-safe:
str(field)casts in_accel.pyxhandle bool/None values without TypeError
- 7 oppose profiles:
auto,balanced,linguistic,industrial,cryptographic,scientific,categorical. Each profile is a tuned literal generation strategy — weighted random draws across 6 text families + 5 numeric families.oppose()itself is untouched - 23 new literal types (30 total): distance (LEV, JAC), positional (PFX, SFX), charclass (TUC, TDC, TSC), crypto (ENT, HEX, REP, CFC), numeric extended (ND, NZ, NL, NMG), exact match (TEQ), affix (TSW, TEW), zero test (NZR), range (NRG), vowel ratio (TVR)
- FA/TA single-char matching:
_gen_text_substringnow includes character-level discrimination (chars unique to T or F), matching originaloppose()'s most powerful pattern - Oppose type formalism:
oppose_types.snake— complete specification of all 30 literal types + 7 profiles in a human-readable DSL. Defines measures, oppose rules, eval rules, and format templates - Auto-detection: Scans population text features (avg length, variance, digit ratio, special ratio, delimiter density). Pure numeric → scientific, long varied text → linguistic, short codes → industrial
- Spaceship Titanic: industrial profile achieves 78.0% optimal accuracy (vs original 77.2%), 0.8038 AUROC. Balanced achieves best AUROC (0.8093). Breast Cancer: scientific hits 98.2% / 0.9987 AUROC (vs original 96.5%)
- Meta classifier removed — was experimental, unused in production
- Cython support: All 30 literal types in
_accel.pyxwith C-level helpers. Bool-safestr(field)casts - 236 tests across 11 files (62 new profile tests, 20 Meta tests removed)
- Lookalike origin labeling: Every lookalike now carries
"c"(core) or"n"(noise) origin. Newget_lookalikes_labeled(X)method returns[index, class, condition, origin]per match. Enables weighted probability with(w_c, w_n)tuning - Full-population noise: Noise sourced from the entire population minus core (was: remaining minus core). Deep-chain buckets now access global diversity
- Origins in JSON: Each bucket stores
"origins"parallel to"members". Backwards compatible — old models default to all-core - Regime discovery: Core vs noise signal quality depends on
n_layers. Low layers = trust core. High layers = blend both
- Perfected audit system: Two clean AND statements per layer — Routing AND (explains bucket routing) and Lookalike AND (per-sample clause negation). Replaces the stub from v4.3.x
- Parallel training:
workers=Nenables multiprocessing for layer construction. Each worker gets a unique RNG seed - Progress tracking:
progress_fileparameter writes JSON progress updates during training - Cython batch acceleration:
batch_get_lookalikes_fastamortizes routing by grouping queries per bucket per layer
- Logging migration: Replaced
print()+ string accumulation with Pythonlogging. Per-instance logger, buffer handler always captures, StreamHandler to stdout only whenvocal - Extensive test suite: 92 tests across 7 files. New: core algorithm, validation, edge cases, CLI, logging
- Cython training acceleration: 4 new functions in
_accel.pyxforconstruct_clause,build_condition,_construct_local_sat - Bug fix — Binary True/False targets:
floatconversion("True")returned0.0, collapsing all True/False targets to 0. Fixed in both flows - Spaceship Titanic benchmark: 78.4% test accuracy (Kaggle top ~80–81%)
Proprietary. Source code is available for viewing and reference only.
See LICENSE for details. For licensing inquiries: [email protected]