Replace WebFlux buffer ceiling with streaming for Sztabina => Sztab responses (SZ-125)

rk@tigase.net opened 2 months ago

Labels: enhancement reliability backend sztabina-integration Priority: High Milestone: MVP (blocker)

Background

During testing, Sztab crashed with a non-obvious exception originating in the Netty/WebFlux codec layer (see: https://tigase.dev/sztab/~issues/119). Root cause: the Sztab WebClient (consumer) was collecting the full Sztabina (producer) HTTP response body into an in-memory buffer before processing it. For large git payloads — blobs, diffs, pack data, or tree listings — this buffer overflowed the WebFlux default codec limit.

Sztabina had no visibility into the overflow; it served the response normally. The failure manifested entirely on the Sztab side as a cryptic DataBufferLimitException or similar Reactor Netty codec error, with no indication in the stack trace that response size was the actual cause. This class of failure is silent until it isn't — it will recur on any sufficiently large repository in production.

Tentative Fix (currently in place)

The WebClient codec configuration was patched with an approximate max-in-memory size ceiling:

WebClient.builder()
    .codecs(configurer -> configurer
        .defaultCodecs()
        .maxInMemorySize(N_BYTES)) // approximate ceiling
    .build();

This is a stopgap. It is brittle because:

The ceiling is a guess; any sufficiently large repo or file will breach it again.
It buffers the entire payload in JVM heap, creating GC pressure on large operations.
It silently works until it doesn't — no graceful degradation, same opaque exception.
It is not a fix; it is a raised floor on the same trap.

Desired Solution

Replace the collect-to-body pattern with proper reactive streaming wherever Sztab consumes Sztabina responses.

Before (buffered — current stopgap):

webClient.get()
    .uri(...)
    .retrieve()
    .bodyToMono(byte[].class) // full payload materialized in heap
    .block();

After (streaming):

webClient.get()
    .uri(...)
    .retrieve()
    .bodyToFlux(DataBuffer.class)
    .map(dataBuffer -> {
        byte[] chunk = new byte[dataBuffer.readableByteCount()];
        dataBuffer.read(chunk);
        DataBufferUtils.release(dataBuffer); // prevent Netty memory leak
        return chunk;
    });

For endpoints that serve content directly to the browser (e.g. blob download), the preferred pattern is to pipe the Flux<DataBuffer> directly into the Spring ServerHttpResponse without ever collecting:

DataBufferUtils.write(flux, serverHttpResponse.getBody())
    .then();

This keeps memory flat regardless of payload size.

Scope — Full Audit Required

Since the affected scope is not fully characterized, all Sztab service classes that call Sztabina via WebClient must be audited. Suspected or confirmed affected interactions:

File blob fetch
Diff / patch fetch
Repo tree listing (lower individual risk but should be verified)
Any pack / clone proxying if present
Any other Sztabina-sourced response currently collected with bodyToMono(byte[].class) or bodyToMono(String.class)

A grep for bodyToMono and bodyToFlux in the Sztabina client layer is the recommended starting point for the audit.

Acceptance Criteria

All WebClient calls to Sztabina that carry potentially large payloads use bodyToFlux(DataBuffer.class) or equivalent streaming extraction — no bodyToMono(byte[].class) or bodyToMono(String.class) on unbounded responses.
DataBufferUtils.release() is called on every consumed DataBuffer to prevent Netty off-heap memory leaks.
The maxInMemorySize override is removed or reduced back to the WebFlux default once streaming is fully in place.
A large file (>10MB blob) can be fetched and rendered without a DataBufferLimitException or observable heap spike.
Regression-tests with a repo containing at least one large binary blob and a large diff.
No DataBufferLimitException appears in logs under normal repo browsing.

Activities

rk@tigase.net commented 2 months ago

Estimate

2 developer-days (7–12.5h)

Task	Estimate
Grep + audit all `WebClient` call sites in the Sztabina client layer, categorize by risk	1–2h
Streaming refactor for blob fetch (highest risk — pipe to `ServerHttpResponse`)	2–3h
Streaming refactor for diff / patch fetch	1–2h
Streaming refactor for tree listing (verify or clear)	0.5–1h
Any pack / clone proxying if present	1–2h
Remove `maxInMemorySize` override, smoke test locally	0.5h
Regression pass (large blob + large diff)	1–2h

Main wildcard: call sites that feed into a non-trivial processing pipeline between receipt and response. Pure pass-through is straightforward; any parsing or transformation in the middle makes the streaming refactor more involved.

rk@tigase.net changed title 2 months ago

Previous Value	Current Value
Replace WebFlux buffer ceiling with streaming for Sztabina→Sztab responses	Replace WebFlux buffer ceiling with streaming for Sztabina => Sztab responses

rk@tigase.net referenced from other issue 2 months ago

Monitor Sztab JVM vital signs (SZ-143)

Closed
rk@tigase.net referenced from other issue 1 month ago

Task: Validate diff handling against large public repositories (SZ-120)

Closed

rk@tigase.net commented 1 month ago

rksuma@Ramakrishnans-MacBook-Pro sztab % git checkout -b feature/sz-125-streaming wolnosc
Switched to a new branch 'feature/sz-125-streaming'
rksuma@Ramakrishnans-MacBook-Pro sztab %

rk@tigase.net changed state to 'In Progress' 1 month ago
Previous Value Current Value
Open

In Progress
rk@tigase.net commented 1 month ago
Related fixes: https://tigase.dev/sztab/~issues/120

Current flow:

Sztabina computes diff => writes entire response to HTTP body Sztab WebClient receives => buffers entire body in JVM heap => processes

For a 50MB diff that means 50MB sits in heap before Sztab can do anything with it. Spring WebFlux enforces a 16MB limit on that buffer - anything larger throws DataBufferLimitException.

With streaming:

Sztabina computes diff => writes chunks as they're ready Sztab WebClient receives chunks => processes each chunk => forwards to browser

Memory stays flat: at any moment only one chunk is in flight, not the whole payload.
rk@tigase.net commented 1 month ago

With streaming in place between Sztabina and Sztab, Sztabina UI was able to create a PR with a massive diff (50 commutes) and a very large diff as a result:

Screenshot 2026-04-27 at 6.08.11 PM

Streaming works!
rk@tigase.net referenced from other issue 1 month ago

SZ-125 follow-on: Load more / infinite scroll for large PR diffs (SZ-155)

Open

Previous Value	Current Value
Open	In Progress

rk@tigase.net commented 1 month ago

The primary goal is achieved:

```
 DataBufferLimitException eliminated 
```

  Pagination implemented and validated against LLVM

  Follow-on (SZ-155) created and deferred

rksuma@Ramakrishnans-MacBook-Pro sztab % git branch -d feature/sz-125-streaming
Deleted branch feature/sz-125-streaming (was c2aa07b).
rksuma@Ramakrishnans-MacBook-Pro sztab % git push origin --delete feature/sz-125-streaming
remote:  
remote: Create a pull request for 'feature/sz-125-streaming' by visiting:
remote:     https://tigase.dev/sztab/~pulls/new?target=1325:wolnosc&source=1325:feature/sz-125-streaming
remote:  
To https://tigase.dev/sztab.git
 - [deleted]         feature/sz-125-streaming
rksuma@Ramakrishnans-MacBook-Pro sztab % git fetch --prune
rksuma@Ramakrishnans-MacBook-Pro sztab % 
rksuma@Ramakrishnans-MacBook-Pro sztab % git branch               
git branch -r            
  backup-SZ-Error-Handling-UX-stash
  backup/SZ44-chaos
  bugfix/sz-32-backend-fixes
  experiment/git-compare
  feature/Issue-DSL-Query-Engine
  feature/SZ-48-gitrepo-insane-changes
  feature/first-rev-community-user
* wolnosc
  origin/HEAD -> origin/wolnosc
  origin/backup/SZ44-chaos
  origin/bugfix/SZ-77-repoType-inference
  origin/bugfix/issue-17
  origin/bugfix/logout-redirect
  origin/bugfix/sz-32-backend-fixes
  origin/bugfix/sz23-improve-domain-model
  origin/bugfix/sz23-improve-domain-model-new
  origin/documentation/sz18
  origin/experiment/git-compare
  origin/feature/SZ-105-Default-Issue-Assignee
  origin/feature/SZ-48-Branch-Management-UI
  origin/feature/SZ-48-gitrepo-insane-changes
  origin/feature/SZ-50-Issue-Management-UI
  origin/feature/SZ-73-git-bot-mitigation
  origin/feature/SZ-94-sztabctl
  origin/feature/commit-discovery-pr-creation
  origin/feature/first-rev-community-user
  origin/feature/jpa-spec-queries
  origin/feature/lambda-refactor-userservice
  origin/feature/repo-browser-jgit
  origin/feature/sz-14-authz-rbac-static
  origin/feature/sz-26-web-ui-scaffolding
  origin/feature/sz-39-admin-user-management
  origin/feature/sz-4-spring-security-authn-authz
  origin/improvement/sz26-sztab-backend-for-web-ui
  origin/master
  origin/task/oci-staging-k3s-staging-cluster
  origin/wolnosc
rksuma@Ramakrishnans-MacBook-Pro sztab %

rk@tigase.net changed state to 'Closed' 1 month ago
Previous Value Current Value
In Progress

Closed
Login to comment

Previous Value	Current Value
In Progress	Closed

Type	Bug
Priority	Normal
Assignee	rk@tigase.net
Version	1.10.1
Sprints	n/a
Customer	n/a

Issue Votes (0)

Watchers (2)

Reference

SZ-125