Back to Blogs

GDN Hybrid Attention Under a 16 MB Model Budget

A Parameter Golf non-record submission accepted on 2026-04-04.

Published: April 2026 Reading Time: 8 min Topics: Long-context LLMs, efficient training, hybrid sequence mixers

One-line result

This study tests whether the Gated Delta Net hybrid pattern from OLMo Hybrid helps a very small language model train at long context lengths. In this constrained setting, the answer was conditional but useful: full attention was better at 8k context, while the hybrid crossed over between 8k and 16k and was substantially better at 32k under the same 600-second single-H100 budget.

At 32k context, the hybrid reached a 2-seed mean final_int8_zlib_roundtrip score of 1.4709 val_bpb, compared with 1.7059 val_bpb for the full-attention baseline.

Research question

Parameter Golf rewards models that do more with less: small parameter budgets, compressed submissions, and careful training choices. That makes it a useful testbed for a practical ML systems question:

When context length grows, should a small model spend its budget on full quadratic attention everywhere, or should most layers use a cheaper recurrent mixer while a few layers preserve full attention?

The motivation came from OLMo Hybrid's use of Gated Delta Net (GDN) layers alongside full-attention layers. The obvious scaling argument says the hybrid should become cheaper at long sequence lengths, because most layers avoid O(L^2) attention. The less obvious question is whether that advantage still matters in a tiny model, after compression, with only 600 seconds of training.

The contribution of this submission is not just adding a new mixer layer. It measures where the architectural trade-off changes sign under a fixed wall-clock and compressed-model objective.

Model comparison

The experiment compares two 9-layer models with similar parameter counts:

Model Mixer layout Width Parameters
Full-attention baseline 9 attention layers 512 17.06M
GDN hybrid 7 GDN layers + 2 full-attention layers 448 17.42M

Both models use the same broad training recipe: RoPE, RMSNorm, grouped-query attention where attention is used, Muon optimizer for matrix parameters, Adam-style optimization for scalar/control parameters, tied embeddings, and the same FineWeb validation protocol used by the competition.

The hybrid keeps full causal attention at layers 3 and 7. The remaining layers use a Gated Delta Net mixer implemented with flash-linear-attention's chunk_gated_delta_rule, causal depthwise convolution on Q/K/V, learned per-head gates, and decay parameters. The hybrid width is reduced from 512 to 448 so the comparison stays near the same model-size regime.

Experimental design

Each run used:

The key fairness choice was schedule scaling. The hybrid did not receive a longer training budget. Step times were measured, the available optimizer-step budget was estimated for each model within 600 seconds, and warmup/warmdown schedules were rescaled to preserve similar training-phase ratios across contexts.

This matters because the experiment is not only about per-step quality. It is about quality reached within a fixed wall-clock budget.

Results

Throughput

Context Baseline ms/step Hybrid ms/step Hybrid speed Baseline steps in 600s Hybrid steps in 600s
8k 594 757 0.79x 1011 789
16k 952 847 1.12x 629 712
32k 2202 1361 1.62x 271 440

At 8k, FlashAttention was fast enough that the GDN layer overhead did not pay for itself. At 16k, the hybrid became faster. At 32k, it completed about 62% more optimizer steps in the same wall-clock budget.

Final validation bpb after compression round trip

Context Baseline val_bpb Hybrid val_bpb Delta, hybrid - baseline Winner
8k 1.3507 1.3810 +0.0303 Baseline
16k 1.4353 1.3999 -0.0354 Hybrid
32k 1.7059 1.4709 -0.2350 Hybrid

The important pattern is the crossover. The hybrid was not universally better. It became better once the context length was long enough for the compute savings to dominate the recurrent-kernel overhead.

At 32k, the result was stable across two seeds:

Seed Hybrid 32k val_bpb Hybrid 32k val_loss
1341 1.4703 2.4825
42 1.4716 2.4847
Mean 1.4709 2.4836

What the learning curves showed

At 32k context, the hybrid was ahead from the first validation checkpoint and kept improving after the baseline ran out of time.

Step Baseline val_bpb Hybrid val_bpb
502.52552.1231
1002.15791.9097
1501.94801.8040
2001.82201.6803
2501.71901.5973
2711.7035~1.5927
300-1.5459
350-1.5071
400-1.4805
440-1.4694

This is the clearest evidence that the hybrid's advantage was not just a noisy final evaluation. It had both better early validation behavior and more optimization steps available inside the fixed budget.

Interpretation

The result suggests a practical rule for this model family:

In a 17M-parameter, 600-second, single-H100 training regime, full attention remains preferable at 8k context, but a GDN hybrid becomes preferable at 16k and above.

The mechanism appears to be a combination of two effects.

First, long-context full attention becomes expensive quickly. Replacing 7 of 9 full-attention layers with GDN layers changes the throughput picture once sequence length is large enough.

Second, the recurrent state may be a reasonable inductive bottleneck for very small models. A 17M-parameter model may not be able to make efficient use of all-to-all 32k attention in every layer. The hybrid forces most layers to compress history while still giving two layers global attention access.

That second point is a hypothesis, not a proven claim. The data supports it enough to justify follow-up experiments, but the current contribution is the measured crossover under a controlled budget.

Next experiments

  1. Ablate the attention layer placement. The current hybrid uses attention at layers 3 and 7. Testing {2, 6}, {4, 8}, and a single-attention-layer variant would show whether the win comes from the GDN majority or from this exact placement.
  2. Sweep the GDN-to-attention ratio. A 6:3 or 8:1 split might improve the 16k/32k trade-off depending on whether the model is bottlenecked by global mixing or throughput.
  3. Tune the 8k regime separately. The hybrid had signs of better per-step loss at 8k but lost on wall-clock. A kernel-aware batch/token schedule might recover some of that.
  4. Run more seeds at 16k. The 32k result was very stable across two seeds. The 16k crossover is smaller and deserves more replication.

Why this is meaningful

This is a small but useful systems-oriented result: it identifies a specific context-length crossover where a modern hybrid sequence mixer becomes worthwhile under a compressed, fixed-budget training objective.

It also demonstrates several research-engineering habits that matter in model-development work:

The strongest part of the result is that it has a boundary. The hybrid did not simply "win"; it lost at 8k, crossed over around 16k, and became clearly better at 32k. That boundary makes the experiment more useful than a one-row benchmark, because it says something about when the architecture is worth its complexity.

Code and submission

The full implementation and discussion are available in the Parameter Golf submission PR:

The branch includes the hybrid model implementation, baseline comparison, experiment notes, logs, and submission metadata. The PR conversation is the best place to see the experiment in its original competition context.