Caistro Labs

The memory bottleneck

Scaling compute effectively is the core problem in modern ML research. While standard image models and diffusion variants might appear manageable on paper, high-resolution batches, multi-aspect training, and EMA passes rapidly fragment VRAM. When multiple jobs share a single card, peak activation memory collides and the scheduler fails.

The standard industry fix is horizontal scaling — buying more compute. However, hardware scaling is cost-prohibitive for most research requirements.

The structural inefficiency is the frozen backbone. On modern models, the frozen weights dominate VRAM, yet only one block is actively computing at any given moment. The rest of the backbone sits idle, bottlenecking the GPU.

ABST is the execution framework that fixes this. The frozen blocks remain on the host CPU. Immediately prior to computation, the specific block required is streamed to the GPU. The trainable adapters — fractionally small, requiring persistent gradient accumulation — remain entirely resident on the GPU. The result: a 3–4× increase in concurrent training capacity on equivalent hardware. We've been running it on our RTX 6000 Pro fleet since early 2026.

The LLM proof point

While ABST was designed to handle the demanding VRAM constraints of concurrent diffusion models, diffusion stacks lack standardization for reproducible benchmarking. Large Language Models provide a more rigorous baseline.

A 72B quantized backbone requires approximately 40 GB of VRAM. Standard QLoRA cannot fit this on a 24 GB consumer GPU under any configuration. By utilizing ABST, a Qwen2.5-72B QLoRA training run peaks at 17.1 GB VRAM.

The cost is per-step time. ABST trades raw speed for access, enabling a 72B QLoRA training run to execute entirely on a single 24 GB card.

Why this is hard

QLoRA in one line:

  LoRA(x) = W_q · x + B · (A · x)

W_q is the 4-bit frozen weight. A and B are the trainable low-rank adapters. Every layer's W_q must be resident on the GPU at the exact moment that layer is computed. For Qwen2.5-72B, the quantized backbone is ~40 GB, which strictly exceeds the capacity of a 24 GB card.

The standard approaches either change the math or require additional hardware:

Gradient checkpointing trades VRAM for FLOPs by recomputing activations on the backward pass. Gradients stay mathematically identical — you pay roughly a third more compute per step.
CPU offload of optimizer state (ZeRO-Offload, FSDP) keeps weights on GPU and pushes the optimizer to the host between steps. The weights themselves remain resident while compute is running.
Pipeline and tensor parallelism shard the model across multiple GPUs. Solves the memory problem by adding cards. Ineffective on a single card.

None of these keep almost all of the weights off the GPU during layer computation and still produce the exact same gradient as standard QLoRA.

The execution model

A transformer is a strict sequence of block computations. Each block depends only on the output of the previous one:

  h_0 = embed(x)
  h_1 = block_1(h_0)
  h_2 = block_2(h_1)
  ...
  h_N = block_N(h_{N-1})
  y   = lm_head(h_N)

At step i, only block_i's weights are needed on the GPU. Everything before it has already computed; everything after it hasn't started. That sequentiality is the opening ABST exploits.

The streaming pattern:

╭────────────────────────────────────────────────────────────────────────────────────────╮ 
│  CPU — frozen backbone
b1  b2  b3  b4  b5  b6  ...  bN
(~95% of weights, always here)  │ 
╰────────────────────────────────────────────────────────────────────────────────────────╯ 
                                             ▲                                             
      release after backward; fetch b_{i+1}  │ copy block_i just before it runs            
                                             ▼                                             
           ╭──────────────────────────────────────────────────────────────────╮            
           │  GPU — block_i  +  adapters
(adapters resident for every block)  │            
           ╰──────────────────────────────────────────────────────────────────╯

Streaming pattern. The backbone lives on CPU; one block at a time is copied to GPU and released. Adapters never leave the GPU.

The VRAM reduction follows directly:

  V_resident = N · V_block + V_adapters + V_activations
  V_ABST     = b · V_block + V_adapters + V_activations

N is the total number of blocks; b is the number held resident at any moment — typically 1 or 2. At 72B scale, this means over 90% of backbone weights are on the CPU at any point during training.

Bitwise-validated mathematical parity

ABST does not approximate. The forward pass computes the identical logits. The backward pass computes the identical gradients.

To eliminate the variance of nondeterministic CUDA kernels, strict parity testing was executed at 7B, 32B, and 72B scales using fixed datasets, tight seeding, and pinned deterministic attention backends. Every step's loss and every LoRA parameter compared element-wise:

╭─────────────┬─────────────┬───────────────┬─────────────────╮
│ Model       │ LoRA params │ Max loss diff │ Max weight diff │
├─────────────┼─────────────┼───────────────┼─────────────────┤
│ Qwen2.5-7B  │  40,370,176 │      0.00e+00 │        0.00e+00 │
│ Qwen2.5-32B │ 134,217,728 │      0.00e+00 │        0.00e+00 │
│ Qwen2.5-72B │ 210,534,400 │      0.00e+00 │        0.00e+00 │
╰─────────────┴─────────────┴───────────────┴─────────────────╯

The gradients are not approximations; they are strictly bit-identical across two GPU generations (Ampere RTX 3090, Blackwell RTX PRO 6000) and two PyTorch versions (2.8 and 2.10). The streaming pattern operates independently of the underlying kernel stack.

The trade-off: PCIe transfer overhead

Every block transfer requires a host-to-device copy across PCIe. Without overlap, the compute stream stalls between blocks — the GPU sits idle waiting for the next weights to arrive:

┌─── No prefetch — compute stalls waiting on PCIe ──┐
│                                                   │
│            ┆             ┆             ┆          │
│ ─── compute ──────────────────────────────────────│
│ block i    ████░░        ┆             ┆          │
│ block i+1  ┆       ████░░┆             ┆          │
│ block i+2  ┆             ┆ ████░░      ┆          │
│ ─── copy ─────────────────────────────────────────│
│ fetch i+1  ┆   ▓▓▓▓░░    ┆             ┆          │
│ fetch i+2  ┆           ▓▓▓▓░░          ┆          │
│            ┆             ┆             ┆          │
└───────────────────────────────────────────────────┘

Serial: every compute interval has to wait on the next copy.

ABST mitigates this latency through prefetching. While block i computes on the primary CUDA stream, block i+1 transfers on an independent prefetch stream. The same operational overlap applies during the backward pass:

┌─── With prefetch — copy hidden under compute ─────┐
│                                                   │
│            ┆             ┆             ┆          │
│ ─── compute ──────────────────────────────────────│
│ block i    ████░░        ┆             ┆          │
│ block i+1  ┆   ████░░    ┆             ┆          │
│ block i+2  ┆       ████░░┆             ┆          │
│ block i+3  ┆           ████░░          ┆          │
│ ─── copy ─────────────────────────────────────────│
│ fetch i+1  ┆███░░        ┆             ┆          │
│ fetch i+2  ┆    ███░░    ┆             ┆          │
│ fetch i+3  ┆        ███░░┆             ┆          │
│            ┆             ┆             ┆          │
└───────────────────────────────────────────────────┘

Prefetched on a separate CUDA stream: by the time compute needs the next block, the weights have already arrived.

Measured on consumer 24 GB hardware (RTX 3090, PCIe Gen4):

╭─────────────┬───────────────┬───────────────┬─────────╮
│ Model       │ No prefetch   │ Prefetch      │ Speedup │
├─────────────┼───────────────┼───────────────┼─────────┤
│ Qwen2.5-7B  │   991 ms/step │   751 ms/step │   1.32× │
│ Qwen2.5-32B │ 3,971 ms/step │ 2,680 ms/step │   1.48× │
╰─────────────┴───────────────┴───────────────┴─────────╯

The speedup scales with model size, as larger blocks increase arithmetic intensity and allow the prefetch stream to better mask the transfer latency. The same pattern holds at 1.4–1.6× on datacenter cards (Blackwell, PCIe Gen5).

With prefetch active, the transfer overhead is minimized. The net operational cost is roughly 1.3–1.5× the per-step time of a resident QLoRA run.

Unlocking hardware constraints

The savings compound with model size — the bigger the frozen backbone, the more of it was never computing in the first place:

╭─────────────┬────────────────┬────────────┬─────────────────────╮
│ Model       │ Resident QLoRA │ ABST QLoRA │ 24 GB consumer card │
├─────────────┼────────────────┼────────────┼─────────────────────┤
│ Qwen2.5-7B  │         8.3 GB │     3.5 GB │ fits both           │
│ Qwen2.5-32B │        25.3 GB │     6.3 GB │ overflow → fits     │
│ Qwen2.5-72B │        81.6 GB │    17.1 GB │ impossible → fits   │
╰─────────────┴────────────────┴────────────┴─────────────────────╯

On our own infrastructure the gain shows up in concurrency. The same RTX 6000 Pro that used to run one full-resolution Flux training job now runs three or four in parallel — a 3–4× cut in the marginal cost of a training run, just what our scheduler reports at the end of the week.

Ultimately, ABST changes the operational baseline. By moving the primary bottleneck from VRAM capacity to PCIe bandwidth, we eliminate the memory cost of idle weights and fully utilize the compute we already have.

ABST: more compute per VRAM