Replace WebFlux buffer ceiling with streaming for Sztabina => Sztab responses (SZ-125)
rk@tigase.net opened 1 month ago

Labels: enhancement reliability backend sztabina-integration Priority: High Milestone: MVP (blocker)


Background

During testing, Sztab crashed with a non-obvious exception originating in the Netty/WebFlux codec layer (see: https://tigase.dev/sztab/~issues/119). Root cause: the Sztab WebClient (consumer) was collecting the full Sztabina (producer) HTTP response body into an in-memory buffer before processing it. For large git payloads — blobs, diffs, pack data, or tree listings — this buffer overflowed the WebFlux default codec limit.

Sztabina had no visibility into the overflow; it served the response normally. The failure manifested entirely on the Sztab side as a cryptic DataBufferLimitException or similar Reactor Netty codec error, with no indication in the stack trace that response size was the actual cause. This class of failure is silent until it isn't — it will recur on any sufficiently large repository in production.


Tentative Fix (currently in place)

The WebClient codec configuration was patched with an approximate max-in-memory size ceiling:

WebClient.builder()
    .codecs(configurer -> configurer
        .defaultCodecs()
        .maxInMemorySize(N_BYTES)) // approximate ceiling
    .build();

This is a stopgap. It is brittle because:

  • The ceiling is a guess; any sufficiently large repo or file will breach it again.
  • It buffers the entire payload in JVM heap, creating GC pressure on large operations.
  • It silently works until it doesn't — no graceful degradation, same opaque exception.
  • It is not a fix; it is a raised floor on the same trap.

Desired Solution

Replace the collect-to-body pattern with proper reactive streaming wherever Sztab consumes Sztabina responses.

Before (buffered — current stopgap):

webClient.get()
    .uri(...)
    .retrieve()
    .bodyToMono(byte[].class) // full payload materialized in heap
    .block();

After (streaming):

webClient.get()
    .uri(...)
    .retrieve()
    .bodyToFlux(DataBuffer.class)
    .map(dataBuffer -> {
        byte[] chunk = new byte[dataBuffer.readableByteCount()];
        dataBuffer.read(chunk);
        DataBufferUtils.release(dataBuffer); // prevent Netty memory leak
        return chunk;
    });

For endpoints that serve content directly to the browser (e.g. blob download), the preferred pattern is to pipe the Flux<DataBuffer> directly into the Spring ServerHttpResponse without ever collecting:

DataBufferUtils.write(flux, serverHttpResponse.getBody())
    .then();

This keeps memory flat regardless of payload size.


Scope — Full Audit Required

Since the affected scope is not fully characterized, all Sztab service classes that call Sztabina via WebClient must be audited. Suspected or confirmed affected interactions:

  • File blob fetch
  • Diff / patch fetch
  • Repo tree listing (lower individual risk but should be verified)
  • Any pack / clone proxying if present
  • Any other Sztabina-sourced response currently collected with bodyToMono(byte[].class) or bodyToMono(String.class)

A grep for bodyToMono and bodyToFlux in the Sztabina client layer is the recommended starting point for the audit.


Acceptance Criteria

  •  All WebClient calls to Sztabina that carry potentially large payloads use bodyToFlux(DataBuffer.class) or equivalent streaming extraction — no bodyToMono(byte[].class) or bodyToMono(String.class) on unbounded responses.
  •  DataBufferUtils.release() is called on every consumed DataBuffer to prevent Netty off-heap memory leaks.
  •  The maxInMemorySize override is removed or reduced back to the WebFlux default once streaming is fully in place.
  •  A large file (>10MB blob) can be fetched and rendered without a DataBufferLimitException or observable heap spike.
  •  Regression-tests with a repo containing at least one large binary blob and a large diff.
  •  No DataBufferLimitException appears in logs under normal repo browsing.

  • rk@tigase.net commented 1 month ago

    Estimate

    2 developer-days (7–12.5h)

    TaskEstimate
    Grep + audit all WebClient call sites in the Sztabina client layer, categorize by risk1–2h
    Streaming refactor for blob fetch (highest risk — pipe to ServerHttpResponse)2–3h
    Streaming refactor for diff / patch fetch1–2h
    Streaming refactor for tree listing (verify or clear)0.5–1h
    Any pack / clone proxying if present1–2h
    Remove maxInMemorySize override, smoke test locally0.5h
    Regression pass (large blob + large diff)1–2h

    Main wildcard: call sites that feed into a non-trivial processing pipeline between receipt and response. Pure pass-through is straightforward; any parsing or transformation in the middle makes the streaming refactor more involved.

  • rk@tigase.net changed title 1 month ago
    Previous Value Current Value
    Replace WebFlux buffer ceiling with streaming for Sztabina→Sztab responses
    Replace WebFlux buffer ceiling with streaming for Sztabina => Sztab responses
  • rk@tigase.net referenced from other issue 4 weeks ago
  • rk@tigase.net referenced from other issue 2 weeks ago
  • rk@tigase.net commented 2 weeks ago
    rksuma@Ramakrishnans-MacBook-Pro sztab % git checkout -b feature/sz-125-streaming wolnosc
    Switched to a new branch 'feature/sz-125-streaming'
    rksuma@Ramakrishnans-MacBook-Pro sztab % 
    
  • rk@tigase.net changed state to 'In Progress' 2 weeks ago
    Previous Value Current Value
    Open
    In Progress
  • rk@tigase.net commented 2 weeks ago

    Related fixes: https://tigase.dev/sztab/~issues/120

    Current flow:

    Sztabina computes diff => writes entire response to HTTP body
    Sztab WebClient receives => buffers entire body in JVM heap => processes
    
    

    For a 50MB diff that means 50MB sits in heap before Sztab can do anything with it. Spring WebFlux enforces a 16MB limit on that buffer - anything larger throws DataBufferLimitException.

    With streaming:

    Sztabina computes diff => writes chunks as they're ready
    Sztab WebClient receives chunks => processes each chunk => forwards to browser
    
    

    Memory stays flat: at any moment only one chunk is in flight, not the whole payload.

  • rk@tigase.net commented 2 weeks ago

    With streaming in place between Sztabina and Sztab, Sztabina UI was able to create a PR with a massive diff (50 commutes) and a very large diff as a result:

    Screenshot 2026-04-27 at 6.08.11 PM.pngScreenshot 2026-04-27 at 6.08.11 PM

    Streaming works!

  • rk@tigase.net referenced from other issue 2 weeks ago
  • rk@tigase.net commented 2 weeks ago

    The primary goal is achieved:

    •  DataBufferLimitException eliminated 
      
    •   Pagination implemented and validated against LLVM 
      
    •   Follow-on (SZ-155) created and deferred 
      
    rksuma@Ramakrishnans-MacBook-Pro sztab % git branch -d feature/sz-125-streaming
    Deleted branch feature/sz-125-streaming (was c2aa07b).
    rksuma@Ramakrishnans-MacBook-Pro sztab % git push origin --delete feature/sz-125-streaming
    remote:  
    remote: Create a pull request for 'feature/sz-125-streaming' by visiting:
    remote:     https://tigase.dev/sztab/~pulls/new?target=1325:wolnosc&source=1325:feature/sz-125-streaming
    remote:  
    To https://tigase.dev/sztab.git
     - [deleted]         feature/sz-125-streaming
    rksuma@Ramakrishnans-MacBook-Pro sztab % git fetch --prune
    rksuma@Ramakrishnans-MacBook-Pro sztab % 
    rksuma@Ramakrishnans-MacBook-Pro sztab % git branch               
    git branch -r            
      backup-SZ-Error-Handling-UX-stash
      backup/SZ44-chaos
      bugfix/sz-32-backend-fixes
      experiment/git-compare
      feature/Issue-DSL-Query-Engine
      feature/SZ-48-gitrepo-insane-changes
      feature/first-rev-community-user
    * wolnosc
      origin/HEAD -> origin/wolnosc
      origin/backup/SZ44-chaos
      origin/bugfix/SZ-77-repoType-inference
      origin/bugfix/issue-17
      origin/bugfix/logout-redirect
      origin/bugfix/sz-32-backend-fixes
      origin/bugfix/sz23-improve-domain-model
      origin/bugfix/sz23-improve-domain-model-new
      origin/documentation/sz18
      origin/experiment/git-compare
      origin/feature/SZ-105-Default-Issue-Assignee
      origin/feature/SZ-48-Branch-Management-UI
      origin/feature/SZ-48-gitrepo-insane-changes
      origin/feature/SZ-50-Issue-Management-UI
      origin/feature/SZ-73-git-bot-mitigation
      origin/feature/SZ-94-sztabctl
      origin/feature/commit-discovery-pr-creation
      origin/feature/first-rev-community-user
      origin/feature/jpa-spec-queries
      origin/feature/lambda-refactor-userservice
      origin/feature/repo-browser-jgit
      origin/feature/sz-14-authz-rbac-static
      origin/feature/sz-26-web-ui-scaffolding
      origin/feature/sz-39-admin-user-management
      origin/feature/sz-4-spring-security-authn-authz
      origin/improvement/sz26-sztab-backend-for-web-ui
      origin/master
      origin/task/oci-staging-k3s-staging-cluster
      origin/wolnosc
    rksuma@Ramakrishnans-MacBook-Pro sztab % 
    
    
  • rk@tigase.net changed state to 'Closed' 2 weeks ago
    Previous Value Current Value
    In Progress
    Closed
issue 1 of 1
Type
Bug
Priority
Normal
Assignee
Version
1.10.1
Sprints
n/a
Customer
n/a
Issue Votes (0)
Watchers (2)
Reference
SZ-125
Please wait...
Page is in error, reload to recover