Replace WebFlux buffer ceiling with streaming for Sztabina => Sztab responses (SZ-125)
rk@tigase.net opened 5 days ago

Labels: enhancement reliability backend sztabina-integration Priority: High Milestone: MVP (blocker)


Background

During testing, Sztab crashed with a non-obvious exception originating in the Netty/WebFlux codec layer (see: https://tigase.dev/sztab/~issues/119). Root cause: the Sztab WebClient (consumer) was collecting the full Sztabina (producer) HTTP response body into an in-memory buffer before processing it. For large git payloads — blobs, diffs, pack data, or tree listings — this buffer overflowed the WebFlux default codec limit.

Sztabina had no visibility into the overflow; it served the response normally. The failure manifested entirely on the Sztab side as a cryptic DataBufferLimitException or similar Reactor Netty codec error, with no indication in the stack trace that response size was the actual cause. This class of failure is silent until it isn't — it will recur on any sufficiently large repository in production.


Tentative Fix (currently in place)

The WebClient codec configuration was patched with an approximate max-in-memory size ceiling:

WebClient.builder()
    .codecs(configurer -> configurer
        .defaultCodecs()
        .maxInMemorySize(N_BYTES)) // approximate ceiling
    .build();

This is a stopgap. It is brittle because:

  • The ceiling is a guess; any sufficiently large repo or file will breach it again.
  • It buffers the entire payload in JVM heap, creating GC pressure on large operations.
  • It silently works until it doesn't — no graceful degradation, same opaque exception.
  • It is not a fix; it is a raised floor on the same trap.

Desired Solution

Replace the collect-to-body pattern with proper reactive streaming wherever Sztab consumes Sztabina responses.

Before (buffered — current stopgap):

webClient.get()
    .uri(...)
    .retrieve()
    .bodyToMono(byte[].class) // full payload materialized in heap
    .block();

After (streaming):

webClient.get()
    .uri(...)
    .retrieve()
    .bodyToFlux(DataBuffer.class)
    .map(dataBuffer -> {
        byte[] chunk = new byte[dataBuffer.readableByteCount()];
        dataBuffer.read(chunk);
        DataBufferUtils.release(dataBuffer); // prevent Netty memory leak
        return chunk;
    });

For endpoints that serve content directly to the browser (e.g. blob download), the preferred pattern is to pipe the Flux<DataBuffer> directly into the Spring ServerHttpResponse without ever collecting:

DataBufferUtils.write(flux, serverHttpResponse.getBody())
    .then();

This keeps memory flat regardless of payload size.


Scope — Full Audit Required

Since the affected scope is not fully characterized, all Sztab service classes that call Sztabina via WebClient must be audited. Suspected or confirmed affected interactions:

  • File blob fetch
  • Diff / patch fetch
  • Repo tree listing (lower individual risk but should be verified)
  • Any pack / clone proxying if present
  • Any other Sztabina-sourced response currently collected with bodyToMono(byte[].class) or bodyToMono(String.class)

A grep for bodyToMono and bodyToFlux in the Sztabina client layer is the recommended starting point for the audit.


Acceptance Criteria

  •  All WebClient calls to Sztabina that carry potentially large payloads use bodyToFlux(DataBuffer.class) or equivalent streaming extraction — no bodyToMono(byte[].class) or bodyToMono(String.class) on unbounded responses.
  •  DataBufferUtils.release() is called on every consumed DataBuffer to prevent Netty off-heap memory leaks.
  •  The maxInMemorySize override is removed or reduced back to the WebFlux default once streaming is fully in place.
  •  A large file (>10MB blob) can be fetched and rendered without a DataBufferLimitException or observable heap spike.
  •  Regression-tests with a repo containing at least one large binary blob and a large diff.
  •  No DataBufferLimitException appears in logs under normal repo browsing.

  • rk@tigase.net commented 5 days ago

    Estimate

    2 developer-days (7–12.5h)

    TaskEstimate
    Grep + audit all WebClient call sites in the Sztabina client layer, categorize by risk1–2h
    Streaming refactor for blob fetch (highest risk — pipe to ServerHttpResponse)2–3h
    Streaming refactor for diff / patch fetch1–2h
    Streaming refactor for tree listing (verify or clear)0.5–1h
    Any pack / clone proxying if present1–2h
    Remove maxInMemorySize override, smoke test locally0.5h
    Regression pass (large blob + large diff)1–2h

    Main wildcard: call sites that feed into a non-trivial processing pipeline between receipt and response. Pure pass-through is straightforward; any parsing or transformation in the middle makes the streaming refactor more involved.

  • rk@tigase.net changed title 5 days ago
    Previous Value Current Value
    Replace WebFlux buffer ceiling with streaming for Sztabina→Sztab responses
    Replace WebFlux buffer ceiling with streaming for Sztabina => Sztab responses
issue 1 of 1
Type
Bug
Priority
Normal
Assignee
Version
1.10.1
Sprints
n/a
Customer
n/a
Issue Votes (0)
Watchers (3)
Reference
SZ-125
Please wait...
Page is in error, reload to recover