State Runtime — Can a Hybrid SSM Be the Application?

I had been chewing on a thought experiment. If a state-space model carries a recurrent state that summarizes everything it has seen, why not let that be the application state? No useState. No REST router. No Redux. The user clicks a button, the click flows into the model as a token, the model’s hidden state updates, and the model emits the new UI as JSON.

It is a stupid idea in the way that good demos are stupid. So I built it. Two days, one Mac, one RunPod RTX PRO 6000, and a long argument with mamba-ssm’s build system later, I have a working “Hybrid State Runtime” — and a clearer view of where the architecture wins, where it loses, and what the bottleneck actually is.

Full source lives in the Lab — sign in to unlock the GitHub repo. This post is the short version of what worked, what didn’t, and where I’d push next.

The architecture

Three processes, one fact store, one WebSocket.

                  UI manifest / patch
   ┌────────────┐ ◀──────────────── ┌──────────────┐
   │ Canvas     │                   │ Engine       │
   │ React/Vite │ ────────────────▶ │ FastAPI      │
   │ no state   │  user click/cmd   │ + Jamba2-3B  │
   └────────────┘                   └──────┬───────┘
                                           │ SQL UPDATE
                                           ▼
                                    ┌──────────────┐
                                    │ Postgres     │
                                    │ (alerts)     │
                                    └──────────────┘

The canvas is a dumb terminal. It owns no application state. It listens to a WebSocket. When it gets a JSON manifest, it walks the array and renders a registry of pre-defined Tailwind components: AlertTable, MetricCard, LineChart, SettingsForm, ToastNotification. When the user clicks a button or types in an Omnibox, it ships the raw event up the same socket. There is no routing. There is no fetch.

The engine runs Jamba2-3B (a hybrid SSM/Transformer from AI21) under PyTorch with xgrammar for constrained decoding. It maintains one event log per WebSocket session. When something arrives, it appends and asks the model for the next JSON object. The output is constrained to one of:

UIStateManifest — paint the screen
DatabaseAction — emit SQL the engine will execute against Postgres
UIPatch — RFC 6902 patch against the previously-emitted manifest

Postgres is the only durable fact store. The model is never allowed to talk to it. The engine intercepts whatever SQL the model emits, runs it through SQLAlchemy with a whitelist guard, and feeds the result back into the model’s prompt stream before asking for the next manifest.

The whole spine is ~600 lines of Python and ~400 lines of JSX.

What worked

End-to-end, the demo runs:

Browser opens, the engine generates a UIStateManifest describing an alert table with three rows from Postgres.
User clicks Resolve on row 2. The browser sends {"user_action":"resolve_alert","alert_id":2}.
Engine appends to the event log, asks the model for a DatabaseAction. Model emits {"kind":"sql","sql":"UPDATE alerts SET status='resolved' WHERE id=2"}.
Engine intercepts. SQLAlchemy runs the UPDATE against Postgres. SQL_RESULT is appended to the event log.
Engine emits a UIPatch (replace /components/0/rows/1/status = "resolved") over the socket.
Canvas applies the patch in-place. One row’s status flips from open to resolved. No re-render of the rest of the screen.

The user can also type in the Omnibox: “show me database metrics”. The engine forces the schema target to UIStateManifest, the model swaps the entire screen — drops the AlertTable, returns a manifest with three MetricCards and two LineCharts. No frontend code knows what “metrics” means. The model decided.

That part of the pitch works. The model genuinely is the application logic.

What was hard

Two things almost killed the demo: model intelligence and decode speed.

Speed

I started on a Mac. Apple Silicon has no CUDA, and mamba-ssm ships only CUDA kernels. Transformers fell back to a pure-Python SSM scan. First turn: 13.4 seconds for a 90-token JSON manifest. ~150 ms/token. Acceptable for a demo where you can wait for coffee, useless for “click and watch it happen”.

Moving to a CUDA box helped less than I expected. RTX PRO 6000, Mamba kernels installed, Jamba2-3B in bfloat16. First turn: still 7.5 seconds. ~83 ms/token, vs ~13 ms/token in raw model.generate(...) without constraints. The model itself was fast. Six times that was overhead.

The journey, with timings on the same RTX PRO 6000, same model, same prompt:

What changed	Click round-trip
outlines + verbose prompt + indented JSON	14.5 s
Trim system prompt 5×	13.7 s
Cache outlines `Generator` per schema	13.0 s
Swap outlines → xgrammar	4.5 s
Stop on grammar termination (`matcher.is_terminated()`)	2.1 s
Compact JSON (`any_whitespace=False`)	1.13 s
Skip second model call (derive UIPatch from SQL deterministically)	~0.5 s

Two findings I want to underline.

Outlines was the biggest single tax. Constrained decoding via outlines adds a per-token CPU↔GPU mask sync that, at our 3B model size, was contributing ~60 ms/token of pure overhead. Swapping to xgrammar (the constrainer used by SGLang and vLLM) cut that to ~6 ms/token. I had to write a tiny custom LogitsProcessor because the bundled xgrammar.contrib.hf wrapper has a bug — it passes a tensor where accept_token expects a Python int and segfaults under transformers.generate.

Don’t ask the model to do mechanical work. After SQL, I was asking the model for a second UIPatch. But the SQL UPDATE alerts SET status='X' WHERE id=N already determined exactly what the patch should be — find the row in the canonical manifest where id=N, set /components/0/rows/<index>/status to X. The engine does that in ~1 ms in Python. Skipping the second model call saves ~600 ms/click. The model still decides which row to mutate (it emitted the SQL); the engine just stops asking it to also write the bookkeeping reflection.

Even at 0.5 s, we are well above the original “50 ms” goal. The remaining floor is two factors: (a) Jamba2-3B’s raw decode at ~13 ms/token, and (b) the engine re-prefills the entire ~2000-character event-log prompt every turn. The whole pitch of an SSM is that you don’t have to do that — the recurrent state already encodes everything prior. To get under 50 ms we need to bypass model.generate() and manage Jamba’s cache_params ourselves so the SSM state carries forward across turns. That’s the next experiment.

Intelligence

Jamba2-3B is small. Asking it to emit a structurally complex JSON manifest under a discriminated-union schema with a free choice of DatabaseAction vs UIPatch did not work reliably:

Branch ambiguity. On a click, the model would sometimes emit a direct UIPatch instead of running SQL first. The patch was usually wrong (it never saw the new DB state), and the actual DB never got updated. Fix: stop letting the model pick the schema on click turns. The engine forces DatabaseAction for clicks, then forces UIPatch for the post-SQL turn. The model still chooses what SQL to run; it doesn’t choose whether to run any.
Index/id confusion. Asked to emit a JSON Pointer path like /components/0/rows/<i>/status, the model would put the alert’s id in the <i> slot — a classic 1-vs-0-indexed-vs-natural-key mistake. I tried a system-prompt warning. I tried injecting ROWS_INDEX: rows.0=id1, rows.1=id2, rows.2=id3 into the event log every turn. The 3B kept getting it wrong. The fix that actually worked was deterministic patch derivation server-side (see above). Same fix solved a separate problem and stopped the model from doing arithmetic it shouldn’t have to.

These aren’t fundamental limits. A bigger model (Jamba-1.5-Large, or any 70B-class with constrained decoding) would handle both. But running a 70B locally for a single-user demo is silly — and at 70B we’re nowhere near 50 ms anyway. The honest takeaway is that “hybrid-SSM-as-application” requires either a bigger model (cost) or more deterministic seams between model decisions and mechanical bookkeeping. The latter is the cheaper path and probably the right pattern for any production system that uses an LLM as control flow.

What I learned

A few things I genuinely changed my mind on:

Constrained decoding overhead is the single biggest knob. I expected the model to be the bottleneck. The model was fine. The constrainer was 6× slower than the model. Anyone benchmarking LLMs against latency targets without naming their constrainer is benchmarking the constrainer.
Mamba’s O(1) state advantage is invisible if you re-prefill. Transformers’ generate() discards the KV/SSM state at the end of every call. The whole architectural advantage of an SSM — recurrent state that carries arbitrarily long history — is lost the moment you treat each turn as a fresh prompt. Realizing the SSM pitch in production requires owning the cache-management layer. That’s where I’d go next if I had two more days.
Don’t ask a 3B model to do bookkeeping. Use the model to decide what and why. Use code to compute where and how. Two of the three problems on this project were variations of “we asked the model to do something the engine could compute deterministically from facts the engine already had”. Removing those rounds cuts both latency and error rate.
The architecture is genuinely cool when it works. Watching the screen rebuild itself in response to “show me database metrics” with no frontend logic involved is a different kind of demo. It is also, predictably, the wrong tool for almost every actual product. But as a forcing function for thinking about what the model is allowed to decide vs what the engine should compute, it has been a useful exercise.

What’s next

If I keep going on this, in priority order:

Mamba state reuse across turns. Real test of the SSM pitch. Targeting sub-50 ms click-to-paint.
Streaming patches. Apply ops to the canvas as tokens arrive instead of waiting for the full JSON to parse.
Speculative decoding. A 0.5B draft model proposing tokens that the 3B verifies. Cheap 2× speedup.
Bigger model in STUB_MODE for the schema-rich screens. Use a 3B for hot path, a 70B-class only for the “redesign the whole screen on a USER_COMMAND” turn that happens twice an hour.

Grab it from the Lab — sign in to unlock the GitHub repo. It boots locally on Mac (slow) or on a CUDA box via the included Dockerfile and RunPod recipe. If you push the Mamba-state-reuse refactor before I get to it, send me the PR and the bench number.