GDN Hybrid Attention Under a 16 MB Model Budget

One-line result

This study tests whether the Gated Delta Net hybrid pattern from OLMo Hybrid helps a very small language model train at long context lengths. In this constrained setting, the answer was conditional but useful: full attention was better at 8k context, while the hybrid crossed over between 8k and 16k and was substantially better at 32k under the same 600-second single-H100 budget.

At 32k context, the hybrid reached a 2-seed mean final_int8_zlib_roundtrip score of 1.4709 val_bpb, compared with 1.7059 val_bpb for the full-attention baseline.

Research question

Parameter Golf rewards models that do more with less: small parameter budgets, compressed submissions, and careful training choices. That makes it a useful testbed for a practical ML systems question:

When context length grows, should a small model spend its budget on full quadratic attention everywhere, or should most layers use a cheaper recurrent mixer while a few layers preserve full attention?

The motivation came from OLMo Hybrid's use of Gated Delta Net (GDN) layers alongside full-attention layers. The obvious scaling argument says the hybrid should become cheaper at long sequence lengths, because most layers avoid O(L^2) attention. The less obvious question is whether that advantage still matters in a tiny model, after compression, with only 600 seconds of training.

The contribution of this submission is not just adding a new mixer layer. It measures where the architectural trade-off changes sign under a fixed wall-clock and compressed-model objective.

Model comparison

The experiment compares two 9-layer models with similar parameter counts:

Model	Mixer layout	Width	Parameters
Full-attention baseline	9 attention layers	512	17.06M
GDN hybrid	7 GDN layers + 2 full-attention layers	448	17.42M

Both models use the same broad training recipe: RoPE, RMSNorm, grouped-query attention where attention is used, Muon optimizer for matrix parameters, Adam-style optimization for scalar/control parameters, tied embeddings, and the same FineWeb validation protocol used by the competition.

The hybrid keeps full causal attention at layers 3 and 7. The remaining layers use a Gated Delta Net mixer implemented with flash-linear-attention's chunk_gated_delta_rule, causal depthwise convolution on Q/K/V, learned per-head gates, and decay parameters. The hybrid width is reduced from 512 to 448 so the comparison stays near the same model-size regime.

Experimental design

Each run used:

Hardware: 1x H100 SXM
Wall-clock: 600 seconds per run
Contexts tested: 8k, 16k, 32k
Evaluation: final_int8_zlib_roundtrip on the full FineWeb validation split
Primary score: validation bits per byte after int8 quantization, zlib compression, reload, and evaluation

The key fairness choice was schedule scaling. The hybrid did not receive a longer training budget. Step times were measured, the available optimizer-step budget was estimated for each model within 600 seconds, and warmup/warmdown schedules were rescaled to preserve similar training-phase ratios across contexts.

This matters because the experiment is not only about per-step quality. It is about quality reached within a fixed wall-clock budget.

Results

Throughput

Context	Baseline ms/step	Hybrid ms/step	Hybrid speed	Baseline steps in 600s	Hybrid steps in 600s
8k	594	757	0.79x	1011	789
16k	952	847	1.12x	629	712
32k	2202	1361	1.62x	271	440

At 8k, FlashAttention was fast enough that the GDN layer overhead did not pay for itself. At 16k, the hybrid became faster. At 32k, it completed about 62% more optimizer steps in the same wall-clock budget.

Final validation bpb after compression round trip

Context	Baseline val_bpb	Hybrid val_bpb	Delta, hybrid - baseline	Winner
8k	1.3507	1.3810	+0.0303	Baseline
16k	1.4353	1.3999	-0.0354	Hybrid
32k	1.7059	1.4709	-0.2350	Hybrid

The important pattern is the crossover. The hybrid was not universally better. It became better once the context length was long enough for the compute savings to dominate the recurrent-kernel overhead.

At 32k, the result was stable across two seeds:

Seed	Hybrid 32k val_bpb	Hybrid 32k val_loss
1341	1.4703	2.4825
42	1.4716	2.4847
Mean	1.4709	2.4836

What the learning curves showed

At 32k context, the hybrid was ahead from the first validation checkpoint and kept improving after the baseline ran out of time.

Step	Baseline val_bpb	Hybrid val_bpb
50	2.5255	2.1231
100	2.1579	1.9097
150	1.9480	1.8040
200	1.8220	1.6803
250	1.7190	1.5973
271	1.7035	~1.5927
300	-	1.5459
350	-	1.5071
400	-	1.4805
440	-	1.4694

This is the clearest evidence that the hybrid's advantage was not just a noisy final evaluation. It had both better early validation behavior and more optimization steps available inside the fixed budget.

Interpretation

The result suggests a practical rule for this model family:

In a 17M-parameter, 600-second, single-H100 training regime, full attention remains preferable at 8k context, but a GDN hybrid becomes preferable at 16k and above.

The mechanism appears to be a combination of two effects.

First, long-context full attention becomes expensive quickly. Replacing 7 of 9 full-attention layers with GDN layers changes the throughput picture once sequence length is large enough.

Second, the recurrent state may be a reasonable inductive bottleneck for very small models. A 17M-parameter model may not be able to make efficient use of all-to-all 32k attention in every layer. The hybrid forces most layers to compress history while still giving two layers global attention access.

That second point is a hypothesis, not a proven claim. The data supports it enough to justify follow-up experiments, but the current contribution is the measured crossover under a controlled budget.

Next experiments

Ablate the attention layer placement. The current hybrid uses attention at layers 3 and 7. Testing {2, 6}, {4, 8}, and a single-attention-layer variant would show whether the win comes from the GDN majority or from this exact placement.
Sweep the GDN-to-attention ratio. A 6:3 or 8:1 split might improve the 16k/32k trade-off depending on whether the model is bottlenecked by global mixing or throughput.
Tune the 8k regime separately. The hybrid had signs of better per-step loss at 8k but lost on wall-clock. A kernel-aware batch/token schedule might recover some of that.
Run more seeds at 16k. The 32k result was very stable across two seeds. The 16k crossover is smaller and deserves more replication.

Why this is meaningful

This is a small but useful systems-oriented result: it identifies a specific context-length crossover where a modern hybrid sequence mixer becomes worthwhile under a compressed, fixed-budget training objective.

It also demonstrates several research-engineering habits that matter in model-development work:

faithful implementation of a recent architecture idea in a different codebase;
adaptation to a new constraint regime rather than assuming scale results transfer directly;
controls around wall-clock, parameter count, compression, and validation protocol;
reporting negative and conditional results, not only wins;
public code, scripts, logs, and submission metadata.

The strongest part of the result is that it has a boundary. The hybrid did not simply "win"; it lost at 8k, crossed over around 16k, and became clearly better at 32k. That boundary makes the experiment more useful than a one-row benchmark, because it says something about when the architecture is worth its complexity.

Code and submission

The full implementation and discussion are available in the Parameter Golf submission PR:

The branch includes the hybrid model implementation, baseline comparison, experiment notes, logs, and submission metadata. The PR conversation is the best place to see the experiment in its original competition context.