DeepVector — How Much Retrieval Signal Lives in Mamba’s Latent State?

I had a hunch worth testing. Selective state-space models like Mamba scale linearly in sequence length and carry a recurrent state that, in principle, summarizes arbitrarily long context without re-attention. That is exactly the shape of a long-context retriever. So: does an off-the-shelf, code-pretrained Mamba-2 have enough retrieval signal in its hidden states to compete with a dense encoder that was actually trained for retrieval — and if so, what matching operation extracts it?

Two weeks, eight phases, one H100, and a workshop-quality writeup later, I have an answer that is more interesting than I expected. Mean-pooling is a disaster. ColBERT-style MaxSim over per-token Mamba latents pulls a 35× absolute-hit lift out of the same encoder. Four different attempts at “more frozen architecture beyond MaxSim” all fail. The thing that finally cleared the MaxSim ceiling wasn’t more architecture — it was composing the latents with tree-sitter and an LLM.

Full paper is on GitHub. This post is the short version of what we found and where I’d push next.

The setup

Encoder: mistralai/Mamba-Codestral-7B-v0.1. Code-pretrained, Mamba-2, the only HuggingFace-loadable Mamba-2 checkpoint of usable scale (Mamba-3 weights aren’t public yet). Benchmark: SWE-Bench Lite — 323 GitHub issues across 18 Python repos, gold target is the single source file the issue’s patch modified. Baseline: Voyage code-3, a retrieval-trained commercial dense encoder. Indexing protocol: per-repo single-commit, FAISS exact IndexFlatIP over L2-normalized chunk embeddings. Hardware: one Lambda Cloud H100 PCIe under bf16 with mamba_ssm CUDA kernels (pure-PyTorch Mamba-2 OOMs on 1500-token chunks).

Two metrics: raw (averaged over all 323) and indexable (averaged over the 285 instances where the gold file is in our index at the chosen commit).

Phase 1: mean-pooling is a disaster

	Voyage code-3 (indexable)	Codestral pooled (indexable)
Recall@10	0.9509	0.3439
MRR	0.7373	0.1794

Codestral’s mean-pooled vectors are not collapsed — pairwise cosine means stay between 0.67 and 0.83 across all 18 repos, with non-trivial spread (std 0.086–0.208). The vectors point in different directions. The retrieval still fails.

Inspecting per-query top-10 lists exposes the actual mechanism: a small set of central files dominates pooled top-10 across queries that bear no obvious topical relationship to those files. In Django, django/http/request.py shows up in 47% of queries’ top-10. In sympy, sympy/solvers/ode.py shows up in 60%. I call this the “popular file attractor” — when the encoder produces a 4096-d query vector via mean-pooling, files whose own mean vector is high-magnitude or central cosine-correlate broadly with whatever the query is. The signal that should distinguish them is at the per-token level. The centroid hides it.

Phase 2: MaxSim recovers a 35× lift

If pooling destroys the signal, keep all the tokens. ColBERT’s MaxSim:

MaxSim(Q, F) = sum over q in Q of max over f in F of cos(q, f)

On a discriminating subset of 80 instances where Voyage retrieves the gold in top-10 and Codestral mean-pooling does not (the cleanest “mean-pooling specifically failed here” probe), MaxSim over per-token Codestral latents recovers 35 of those 80 cases. Mean-pooling in the same harness recovers 1.

	Pooled (this harness)	MaxSim
Recall@10	0.0125	0.4375
MRR	0.0250	0.2001

That is a 35× lift in absolute hits, with no architectural change to the encoder and no training. Latency cost: 0.03 s, ~1%. The signal is in the per-token states; pooling was the mistake. MaxSim is the natural floor for “what frozen Codestral latents can do for retrieval.”

Then I spent four phases trying to beat that floor with more frozen architecture. None of them worked.

Phases 3–5, 8: four ways frozen architecture fails

Multi-head random orthogonal projections (H ∈ 32, plus an explicitly normalized late-interaction variant) all underperform single-head MaxSim by 0.0125–0.025 R@10. Verdict: CEILING. A random projection redistributes signal uniformly across heads; summing per-head MaxSim recovers the original similarity up to a constant. A learned projection might separate retrieval-relevant dimensions from syntactic ones. A random one, by construction, cannot.

Off-the-shelf cross-encoder rerankers stacked on the MaxSim top-100 shortlist. MS-MARCO-MiniLM (33M, English Q-A web text) gives R@10 = 0.30. BGE-v2-m3 (568M, multilingual general-domain) gives R@10 = 0.4125. Both worse than the MaxSim filter alone. Verdict: FILTER_LIMITED. These rerankers learned what relevance looks like from web text. Code-retrieval relevance — the declaration site of FILE_UPLOAD_PERMISSIONS, not files that mention it — is out of distribution. (I do not generalize to code-specific rerankers; that’s future work §7.2.)

Multi-granularity composites combining file-, chunk-, sliding-window-, and token-level scores via per-query min-max normalization. Best composite (mg_max) hits R@10 = 0.2875 vs MaxSim’s 0.4375. Verdict: HURTS, Δ = −0.15. Min-max normalization compresses MaxSim’s discriminative variance into [0, 1] alongside near-uniform coarse-granularity scores; the per-file max selection then frequently picks a noisy coarse signal over the reliable fine one. Same failure mode regression-to-the-weakest you’d expect from averaging confidence across signals of unequal quality.

Nested Mamba encoding. This was the one I had the highest hopes for: extract top-level functions/classes via tree-sitter byte ranges, run each function as its own forward pass to get a per-function vector (last-position last-layer state, Mamba’s recurrent summary), then run the sequence of function vectors through Codestral via inputs_embeds=... to get a file vector. Hierarchical encoding, all using the same frozen 7B model. On 58 strict django candidates, L0 MaxSim hits R@10 = 0.4310. The best nested composite hits 0.1897. Verdict: HURTS, Δ = −0.2414. This is the most extreme negative result of the four. The mechanism: Codestral was trained to consume token embeddings drawn from a discrete vocabulary, not full-precision summary vectors. inputs_embeds over function-vector sequences puts the model in a regime its weights have never seen. L2 file similarity scores at R@10 = 0.0172 — barely above random — are consistent with the model producing essentially incoherent representations on this input distribution.

Four independent axes, four negative or zero deltas. This isn’t “MaxSim is the optimal frozen matcher”; it’s “frozen architectural elaboration on the latents alone does not extract additional signal at the scale and configuration I tested.” Whatever’s left — and there is plenty left, given Voyage’s 0.95 — is gated by something else.

Phases 6–7: composition is what cleared the ceiling

The first frozen technique to beat MaxSim came from changing what was being composed, not how.

Phase 6 — tree-sitter symbolic routing. Build an inverted index per repo: function names, class names, top-level methods, ALL_CAPS module-level constants, dotted import paths. For each query, regex-extract identifier candidates from the issue text (CamelCase, snake_case, ALL_CAPS, dotted paths, backticked code), look them up in the index, get a candidate file pool, expand 1-hop via the import graph, fall back to the full corpus if the pool is empty. Run MaxSim restricted to the pool.

	full MaxSim	tree-index routed	Δ
Recall@10	0.4375	0.4750	+0.0375
MRR	0.2001	0.2512	+0.0511

Verdict: MODEST_LIFT. First frozen technique in seven phases to clear the MaxSim ceiling. Mechanism is straightforward: when the issue contains the gold file’s identifier literally, the index routes precisely. 31% of queries fall back to the full corpus because regex captured natural-language artifacts, not real symbols.

Phase 7 — augment regex IDs with LLM-generated ones. Same routing pipeline, but prepend the regex IDs with identifiers Codestral generates given the issue text. The prompt asks for specific, distinctive class/method/constant names “likely involved in the bug,” with an explicit avoid-list for generic terms. Greedy decode, max 200 new tokens, parse line-by-line.

	Phase 6 (regex)	Phase 7 (regex + LLM)	Δ over MaxSim
Recall@1	0.1500	0.1750	+0.0750
Recall@10	0.4750	0.5000	+0.0625
MRR	0.2512	0.2698	+0.0697

The R@1 lift (0.10 → 0.175, +75% relative) is the more important number than R@10. LLM expansion specifically improves the model’s ability to surface the gold at rank 1.

The mechanism is mechanistically transparent. Phase 6 had 25 of 80 queries fall back to full-corpus search because no regex identifier matched the index. Phase 7 has 10. The LLM converted 15 of 80 queries (18.75%) from “no regex anchor” into “routed.” Codestral’s generative recall of plausible Django/sympy/sklearn class names bridges queries that describe behavior without naming the implementing identifier. Median LLM gen latency: 5.96 s on H100. That is the lift.

What I think this means

Two paths beyond the frozen-MaxSim ceiling. The training-based path: a learned matching head over Codestral’s frozen representation, optimized with a retrieval objective. The frozen-multi-head CEILING result argues this is the load-bearing direction for getting from R@10 = 0.50 to R@10 ≥ 0.85. Estimated cost: $10K–$50K of GPU-time. I did not run it.

The composition-based path: combine the latents with deterministic symbolic structure (tree-sitter) and generative knowledge (Codestral as identifier-recall oracle). This produced the actual lift in seven phases of trying. The path forward is more sophisticated graph structure (call graphs, type-flow graphs, test-coverage graphs) and better LLM-augmentation prompting / fine-tuning specifically for code-retrieval identifier expansion. Cheaper to iterate on, composable with the training path.

The right answer is likely both: a learned matching head consuming both the per-token latents and the tree-index/LLM-derived symbolic features as a unified scoring function. That’s the cleanest cross-cutting next experiment.

There is one bigger reframing I came away from this with. The pooled-vs-MaxSim gap is larger than every architectural sophistication delta I tested combined. For this size class of Mamba-2 encoder on this benchmark, the matching operation matters more than the architectural sophistication of the matching head. Most of the field’s effort goes into the latter. At least on frozen encoders, the former is where the lift lives.

What’s in the box

Eight experimental phases, all results in data/results/*.json, fully reproducible from a single H100.
A workshop-quality paper (~9,000 words) covering setup, results, sixteen explicitly-listed limitations, seven future-work experiments with cost estimates.
src/encoder.py, src/maxsim.py, src/frozen_methods.py, src/multigranular.py, src/tree_index.py, src/llm_expansion.py, src/nested_encoding.py — all the matching operations and routing/expansion code, ~2K lines total.
scripts/cloud_run.sh — hardened H100 runner with --dry-run, hard-fail on missing CUDA kernels, no silent fallback to OOM-prone pure-PyTorch Mamba-2.
Strict-sanity reproduction harness: every phase reproduces the cached pooled baseline within 50% top-10 overlap, or the run aborts with exit code 2.

Who this is for

If you’re building long-context retrieval and you’ve been wondering whether SSMs change the picture: this is a concrete, replicable answer for one well-specified configuration. The short version is that an off-the-shelf, code-pretrained Mamba-2 has real per-token retrieval signal, and that signal is recoverable without training via the right matcher — but closing the gap to retrieval-trained dense encoders almost certainly requires either training a head or composing with non-latent signals.

If you’re working on code retrieval specifically: the tree-sitter + LLM identifier expansion pattern is cheap, frozen, and gives measurable lift on a hard subset. It is an obvious component of any production code-retrieval pipeline I’d build today, and the $0 of GPU time it adds at index time is hard to beat.

If you’re deciding whether to fine-tune a Mamba-based retriever from scratch versus train a small matching head over a frozen code-pretrained Mamba: this work is the empirical floor of the latter path. The frozen ceiling is R@10 ≈ 0.50 on a hard subset. A trained head has roughly 0.45 R@10 of headroom to play for.

Repo on GitHub. Issues and pull requests welcome — particularly on Phases 7.1 (trained matching head) and 7.7 (trained nested Mamba), which are the two next experiments I’d run if I had a free $20K of GPU-time and wasn’t already three paragraphs into the next side project.

DeepVector — How Much Retrieval Signal Lives in Mamba's Latent State?