Render Cost Prediction

Reference sheet for computing GPU render cost of 2D scene operations before drawing. All constants and formulas are derived from GPU pipeline structure, not empirical tuning.

Rendering Optimization Strategies — implemented optimizations
Chromium Compositor Research — reference architecture

Core Principle: Fill Rate Dominance

2D GPU rendering is memory-bandwidth bound, not compute bound. The fragment shader for a rect fill is ~1 ALU op; even a Gaussian blur pass is ~10 ALU ops per pixel. Modern GPUs execute trillions of ALU ops/sec, but memory bandwidth is 50-200 GB/s. Each pixel read/write is 4-16 bytes.

Therefore:

frame_cost ≈ total_pixels_touched / memory_bandwidth

This relationship is linear. Double the pixels, double the time. No surprises, no non-linear scaling — as long as you stay within VRAM and don't hit texture cache thrashing (rare in 2D; access is spatially coherent).

This means render cost can be pre-computed as an ALU/pixel budget: count the pixels the GPU will touch, apply structural multipliers per effect, and compare against a calibrated device budget.

Effect Cost Constants

These are not magic numbers or tuning parameters. They are the structural pass counts of each rendering operation — how many full-area read-write cycles the GPU performs.

Effect	Pixel Multiplier	Derivation
Plain shape (rect, ellipse, polygon)	`1×`	Single fill pass
Additional fill (N fills on one node)	`+1×` per extra fill	Each fill is a separate pass
Additional stroke	`+1×` per stroke	Separate pass
Non-rect clip path	`+1×`	Mask pass + masked content
Rect clip	`+0×`	Hardware scissor — free
Blend mode (non-normal)	`+1×`	Requires offscreen isolation layer
Group opacity (alpha < 1.0 on group)	`+1×`	`save_layer` for isolated compositing
Gaussian blur	`+3×`	Downsample pyramid (~1.33×) + blur + upsample + composite
Drop shadow	`+5×`	Draw shape (1×) + blur pipeline (3×) + composite back (1×)
Inner shadow	`+5×`	Same as drop shadow, inverted mask
Backdrop filter (background blur)	`+3×`	Snapshot dst + blur + composite
Layer blur (on node itself)	`+3×`	Offscreen + blur + composite
Image fill	`+0×` over base	Texture sample replaces color fill — same bandwidth
Multiple shadows	`+5×` per shadow	Each shadow is independent

Blur Radius Independence

Skia (and most GPU frameworks) implement Gaussian blur via a downsample pyramid, not a brute-force kernel convolution:

large sigma → downsample 2× → downsample 2× → ... → blur at reduced size → upsample

Total pixel work = area × (1 + 1/4 + 1/16 + ...) ≈ area × 1.33 (geometric series), plus the blur pass at reduced resolution. The cost is approximately constant regardless of blur radius. The pyramid absorbs the radius.

`save_layer` / `save_layer_alpha` — The Hidden Spike Source

save_layer is the single most expensive primitive in Skia. It allocates an offscreen surface, renders content into it, then composites back.

save_layer_cost = layer_bounds_area × zoom² × 2  (write to offscreen + read back)

Critical: they cascade multiplicatively with nesting depth.

save_layer              ← offscreen A (full group bounds)
  save_layer            ← offscreen B (child bounds)
    save_layer          ← offscreen C (grandchild bounds)
      draw rect
    restore             → composite C into B
  restore               → composite B into A
restore                 → composite A into target

Three nested layers on the same area = area × 6 bandwidth, not area × 2.

Implicit `save_layer` triggers

Skia inserts save_layer implicitly for these conditions. The cost estimator must account for them even when the application code does not call save_layer explicitly:

Trigger	Reason
Non-normal blend mode on a group	Isolated offscreen to blend against dst
Group opacity (alpha < 1.0 with children)	Children must composite together first, then alpha applied once
Blur / backdrop filter	Reads from dst, needs snapshot
Clip + antialiasing on groups	Soft-edge mask requires offscreen
`ColorFilter` on a group	Applied after children composite

Per-Node Cost Formula

fn estimated_fill_pixels(node: &Node, zoom: f32, viewport: &Rect) -> f64 {
    let screen_area = clipped_area(&node.bounds, viewport) * (zoom * zoom) as f64;

    // Base draw
    let mut passes: f64 = 1.0;

    // Extra fills/strokes beyond the first
    passes += (node.fill_count.saturating_sub(1)) as f64;
    passes += node.stroke_count as f64;

    // Effects
    for shadow in &node.shadows {
        if shadow.visible {
            passes += 5.0; // shape + blur pipeline + composite
        }
    }
    if node.has_blur() {
        passes += 3.0; // downsample + blur + composite
    }
    if node.has_backdrop_blur() {
        passes += 3.0;
    }

    // Isolation layers (implicit save_layer)
    if node.blend_mode != BlendMode::Normal {
        passes += 1.0; // offscreen + composite
    }
    if node.opacity < 1.0 && node.has_children() {
        passes += 1.0; // group opacity isolation
    }

    // Clip
    if node.has_non_rect_clip() {
        passes += 1.0; // mask pass
    }

    screen_area * passes
}

Cache Hit vs. Miss Cost

A compositor/picture cache hit replaces the full rasterization pipeline with a single texture blit:

State	Effective multiplier	What happens
Cache miss	`passes ×` (from table above)	Full rasterization: path tessellation, fill, effects
Cache hit	`~0.1×`	Single texture-sampled quad draw

The cost difference is 100-1000×. Cache state is a binary signal — the single largest contributor to per-node cost variance.

Device Fill Rate Reference

The total pixel budget depends on device fill rate — the one value that varies per hardware. Everything else is derived from geometry and scene structure.

Calibration

Render a known workload (e.g., full-screen solid rect) and measure:

pixels_per_ms = (screen_width × screen_height) / render_time_ms

Reference Values (order-of-magnitude)

Platform	Expected pixels_per_ms
Desktop GPU (discrete)	~500M
Desktop GPU (integrated)	~100M
WebGL (WASM, desktop)	~50-100M
WebGL (WASM, mobile)	~10-30M

Chromium Reference

Chromium's cc/ compositor collects similar metrics but uses them differently:

Metric	Chromium Location	Chromium Usage
`TotalOpCount()`	`cc/paint/display_item_list.h`	Solid-color analysis gate
`num_slow_paths_up_to_min_for_MSAA()`	`cc/paint/display_item_list.h`	Page-level GPU raster veto
`has_save_layer_ops()`	`cc/paint/display_item_list.h`	LCD text decision
`has_non_aa_paint()`	`cc/paint/display_item_list.h`	Antialiasing decisions
`BytesUsed()` / `OpBytesUsed()`	`cc/paint/display_item_list.h`	Tracing / debugging
`AreaOfDrawText()`	`cc/paint/display_item_list.h`	Text coverage statistics
Solid color analysis	`cc/tiles/tile_manager.cc`	Skip rasterization for uniform tiles (`kMaxOpsToAnalyze = 5`)

Chromium does not perform per-tile raster cost prediction. Tile scheduling is purely spatial (viewport distance + scroll velocity) with a memory budget constraint. Their architecture tolerates stale tiles (multi-threaded raster catches up across frames). Ours cannot — we render single-threaded with a hard per-frame deadline, requiring predictive budgeting.

Local source: /Users/softmarshmallow/Documents/Github/chromium/cc/

Skia `Picture` Metrics (Available for Free)

Skia's Picture object exposes complexity metrics that are already computed during recording and cost nothing to query:

Method	What it returns	Use
`approximate_op_count()`	Number of draw operations recorded	Secondary complexity signal
`approximate_bytes_used()`	Serialized size of the picture	Memory pressure / complexity proxy

These are stored fields, not computations. They complement the pixel-area model by capturing path complexity variance (a 1000-op picture with complex beziers vs. a 3-op picture with simple rects at the same pixel area).

Linearity Bounds

The fill-rate model is linear under these conditions:

Condition	Linear?	Notes
Work above ~10K pixels	Yes	Below this, GPU launch overhead dominates (flat floor)
Spatial texture access (normal 2D)	Yes	Bandwidth-bound, no cache thrashing
Random texture access	Can be super-linear	Rare in 2D rendering
Tile-based GPU (mobile)	Mostly	Large nodes spanning many tiles add per-tile overhead
Thermal throttling	N/A	Between-frame variance, not within-frame
VRAM pressure / swapping	Non-linear	Catastrophic; avoid by staying within budget

For typical 2D canvas rendering (spatial access, nodes > 10K pixels), the linear model holds.

Core Principle: Fill Rate Dominance​

Effect Cost Constants​

Blur Radius Independence​

save_layer / save_layer_alpha — The Hidden Spike Source​

Implicit save_layer triggers​

Per-Node Cost Formula​

Cache Hit vs. Miss Cost​

Device Fill Rate Reference​

Calibration​

Reference Values (order-of-magnitude)​

Chromium Reference​

Skia Picture Metrics (Available for Free)​

Linearity Bounds​