-
Task Hours Clone Linux kernel into staging Sztabina and create test project 1h Create PRs between release tags (v6.7..v6.8) 0.5h Run diff/detail endpoints, capture results 0.5h Analyze failure point — at what diff size does 16MB break 1h Document findings, update SZ-79 with results 0.5h Evaluate streaming approach if buffer fails 1.5h Total 5h -
Testing Sztab with https://github.com/llvm/llvm-project.git
-
Created LLvm project in sztab
-
Clone LLVM into Sztab server:
$ cd llvm-project $ git remote add llvm https://github.com/llvm/llvm-project.git $ git fetch llvm $ git push origin refs/remotes/llvm/main:refs/heads/main -
-
Steps taken:
- Created llvm-project project in Sztab UI, cloned the empty Sztabina repo locally.
- Added GitHub LLVM as a remote and attempted git fetch --depth=1 — timed out on first attempt, succeeded on retry at ~6 MB/s pulling 1.43 GiB.
- Force-pushed the shallow clone to Sztabina — rejected: "shallow update not allowed." Sztabina correctly refuses shallow history.
- Ran git fetch --unshallow llvm to convert to full history — pulled an additional 1.63 GiB (5.8M objects).
- Force-pushed the full repo to Sztabina — rejected: HTTP 400 from Caddy mid-transfer after writing all 2.58 GiB at 72 MB/s.
The Caddyfile has max_size 100MB for git traffic, but the LLVM push was 2.58GB — 25x over the limit. Caddy cut it off with the 400. Tentatively amending Caddy with
request_body { max_size 5GB }
to resume the test.
Screenshot 2026-04-23 at 11.25.46 AM -
Increased Caddy max payload to 5GB and restarted reverse proxy:
handle @git { request_body { max_size 100MB }rksuma@Ramakrishnans-MacBook-Pro caddy % docker exec release-caddy-1 caddy reload --config /etc/caddy/Caddyfile rksuma@Ramakrishnans-MacBook-Pro caddy % docker exec release-caddy-1 cat /etc/caddy/Caddyfile | grep -A3 "request_body" request_body { max_size 5GB } forward_auth sztab-backend:8181 { rksuma@Ramakrishnans-MacBook-Pro caddy %Retesting:
rksuma@Ramakrishnans-MacBook-Pro llvm-project % cd ~/siva/llvm-project git push origin refs/remotes/llvm/main:refs/heads/main --force Enumerating objects: 6987304, done. Counting objects: 100% (6987304/6987304), done. Delta compression using up to 12 threads Compressing objects: 100% (1321520/1321520), done. error: RPC failed; HTTP 400 curl 22 The requested URL returned error: 400 send-pack: unexpected disconnect while reading sideband packet Writing objects: 100% (6987304/6987304), 2.61 GiB | 73.57 MiB/s, done. Total 6987304 (delta 5698288), reused 6861140 (delta 5572763), pack-reused 0 (from 0) fatal: the remote end hung up unexpectedly Everything up-to-date rksuma@Ramakrishnans-MacBook-Pro llvm-project %Still 400. The request_body limit isn't the only gate — something else is rejecting it. Checking the Caddy access log for the actual error detail...
rksuma@Ramakrishnans-MacBook-Pro llvm-project % docker exec release-caddy-1 tail -20 /data/logs/sztab-access.log "bytes_read":327283 ← only 327KB read before rejection "status":400 "Content-Type":["text/plain; charset=utf-8"] <== Sztabina's own response //... rksuma@Ramakrishnans-MacBook-Pro llvm-project %The http/400 is coming from Sztabina itself, not Caddy this time. Caddy is now passing the request through (the 5GB limit is working), but Sztabina is rejecting the pack after only 327KB. This is a Sztabina-side limit.
-
Looking at the Sztabina git_http_handler.go:
// ServeHTTP handles Git HTTP smart protocol requests // Handles: /git/{repo}.git/info/refs, /git/{repo}.git/git-upload-pack, etc. func (h *GitHTTPHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { ... }I find that I did not configure body size limit in the handler: it delegates straight to git-http-backend via CGI. So the 400 is coming from git-http-backend itself, which has its own limit on pack size via the GIT_HTTP_MAX_REQUEST_BUFFER environment variable, which defaults to 10MB. This is the flow:
handler.ServeHTTP(w, r). // this is implemented in git_http_handler.go => spawns git-http-backend as CGI process => git-http-backend reads GIT_HTTP_MAX_REQUEST_BUFFER from env => not set => uses 10MB default => pack is 2.6GB = rejects with 400Will configure this in env:
"GIT_HTTP_MAX_REQUEST_BUFFER=5368709120", -
Resuming testing:
rksuma@Ramakrishnans-MacBook-Pro llvm-project % sleep 10; cd ~/siva/llvm-project git push origin refs/remotes/llvm/main:refs/heads/main --force Enumerating objects: 6987304, done. Counting objects: 100% (6987304/6987304), done. Delta compression using up to 12 threads Compressing objects: 100% (1321520/1321520), done. error: RPC failed; HTTP 400 curl 22 The requested URL returned error: 400 send-pack: unexpected disconnect while reading sideband packetB/s Writing objects: 100% (6987304/6987304), 2.57 GiB | 65.90 MiB/s, done. Total 6987304 (delta 5698415), reused 6861013 (delta 5572763), pack-reused 0 (from 0) fatal: the remote end hung up unexpectedly Everything up-to-date rksuma@Ramakrishnans-MacBook-Pro llvm-project %Failed again: Caddy log shows:
"status":400 "resp_headers":{"Content-Length":["48"],"Content-Type":["text/plain; charset=utf-8"],"Server":["Caddy"]...}I think this http/400 is coming from Sztabina itself: because the response has Content-Type: text/plain; charset=utf-8 — that's Go's http.Error() format. Caddy returns JSON. git-http-backend returns git protocol errors. Only Go's own HTTP layer returns plain text 400s like that.
Hence the bottleneck has shifted right:
(git client) --> (caddy [memory bottleneck fixed]) --> (Sztab: PAT validation) --> (Sztabina [memory bottleneck]) => HTTP 400 request body too large
I think this could be the Go net/http default_limit: I will try increase nt/http's default limit to 5GB
// sztabina/handlers/git_http_handler.go r.Body = http.MaxBytesReader(w, r.Body, 5<<30) handler.ServeHTTP(w, r) -
Testing with the change above:
rksuma@Ramakrishnans-MacBook-Pro llvm-project % sleep 10 && cd ~/siva/llvm-project && git push origin refs/remotes/llvm/main:refs/heads/main --force Enumerating objects: 6987304, done. Counting objects: 100% (6987304/6987304), done. Delta compression using up to 12 threads Compressing objects: 100% (1321520/1321520), done. error: RPC failed; HTTP 400 curl 22 The requested URL returned error: 400 send-pack: unexpected disconnect while reading sideband packet Writing objects: 100% (6987304/6987304), 2.57 GiB | 64.81 MiB/s, done. Total 6987304 (delta 5698402), reused 6861026 (delta 5572763), pack-reused 0 (from 0) fatal: the remote end hung up unexpectedly Everything up-to-date rksuma@Ramakrishnans-MacBook-Pro llvm-project %Failed again.
-
Checked Sztabina main.go: starting web server with default parameters:
log.Fatal(http.ListenAndServe(":8085", nil))Go's default http.Server has a ReadTimeout of zero (unlimited) but no explicit MaxBytesReader at the server level.
I am not sure if the failure is due to size (memory) constraint or timeout constraint. Let me try replacing the default server with one that has no read timeout:
server := &http.Server{ Addr: ":8085", ReadTimeout: 0, WriteTimeout: 0, IdleTimeout: 0, } log.Fatal(server.ListenAndServe()) -
In trying to push the full LLVM repository (~2.6GB) into Sztabina for large repo validation, Sztab kept failing with http/400. I have found an ironed out bottlenecks in two place in the Sztab git data flow. Stuck at the third place:
- Caddy had a 100MB request body limit — raised to 5GB. (Fixed)
- Sztabina's Go HTTP server had a default body limit — added MaxBytesReader to raise it. (Fixed)
- Go's net/http/cgi package — theCGI package that Sztabina uses to delegate to git-http-backend seems ti have a hardcoded 512MB internal read buffer. This is my guess: logs show a consistent cutoff at ~512MB regardless of what limits I set above it.
Trying to find a solution for this. I will have to and a way to sidestep CGI.
-
Tested this fix: Rewrote git_http_handler.go to bypass net/http/cgi entirely. git-http-backend is now spawned directly as an exec.Cmd subprocess with r.Body piped straight to stdin — no intermediate buffering, no size ceiling. In
func (h *GitHTTPHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { ...}Before I had this code:
handler := &cgi.Handler{...} r.Body = http.MaxBytesReader(w, r.Body, 5<<30) handler.ServeHTTP(w, r)When I call handler.ServeHTTP(w, r) I was delegating the entire request handling to it — it took care of spawning git-http-backend, setting up the environment, and piping data in both directions. But CGI package uses buffering that impose the limit, causing the third bottleneck.
Instead of using CGI as a gateway to git-http-backend, I am choosing to spawn git-http-backend directly, pipe the request body (coming from git command_ to its stdin, and stream its stdout back to the client with no intermediate buffering. Updated code:
env := []string{...} // manually constructed CGI env cmd := exec.Command("/usr/libexec/git-core/git-http-backend") cmd.Env = env cmd.Dir = repoPath cmd.Stdin = r.Body // direct pipe — the key change cmd.Stderr = os.Stderr stdout, err := cmd.StdoutPipe() // error handling... if err := cmd.Start(); err != nil { // error handling... // git-http-backend outputs CGI-format response... if err := writeCGIResponse(w, stdout); err != nil { // error handling... if err := cmd.Wait(); err != nil { // error handling...Not quite — the change is larger than that. The entire bottom half of
ServeHTTPchanged. Before it was just:handler := &cgi.Handler{...} r.Body = http.MaxBytesReader(w, r.Body, 5<<30) handler.ServeHTTP(w, r)Now it's:
env := []string{...} // manually constructed CGI env cmd := exec.Command("/usr/libexec/git-core/git-http-backend") cmd.Env = env cmd.Dir = repoPath cmd.Stdin = r.Body // direct pipe — the key change cmd.Stderr = os.Stderr stdout, err := cmd.StdoutPipe() // error handling... if err := cmd.Start(); err != nil { // error handling... // git-http-backend outputs CGI-format response... if err := writeCGIResponse(w, stdout); err != nil { // error handling... if err := cmd.Wait(); err != nil { // error handling...The key change is
cmd.Stdin = r.Bodythat single line is what eliminates the 512MB ceiling. Everything else:
cmd.StdoutPipe () writeCGIResponse() cmd.Wait()is the plumbing needed to make that work correctly. The standard handler, viz.,
cgi.Handler()did all of that invisibly before.
Now I am able to clone the full 2.8GB LLVM repo and have it be resident in Sztab:
rksuma@Ramakrishnans-MacBook-Pro Jerry-project % sleep 5; sleep 5 && cd ~/siva/llvm-project && git push origin refs/remotes/llvm/main:refs/heads/main --force Enumerating objects: 6987304, done. Counting objects: 100% (6987304/6987304), done. Delta compression using up to 12 threads Compressing objects: 100% (1321520/1321520), done. Writing objects: 100% (6987304/6987304), 2.59 GiB | 27.95 MiB/s, done. Total 6987304 (delta 5698354), reused 6861072 (delta 5572763), pack-reused 0 (from 0) remote: Resolving deltas: 6% (341903/5698354) remote: Resolving deltas: 100% (5698354/5698354), done. remote: Checking connectivity: 6987304, done. remote: 2026/04/24 18:51:35 notify: forwarded repo=llvm-project ref=refs/heads/main commits=577839 To http://localhost/git/llvm-project.git + fc4afa769384...dc34d163d8c9 llvm/main -> main (forced update) rksuma@Ramakrishnans-MacBook-Pro llvm-project % rksuma@Ramakrishnans-MacBook-Pro llvm-project % docker exec release-sztabina-1 du -sh /repos/llvm-project.git/ 2.8G /repos/llvm-project.git/ rksuma@Ramakrishnans-MacBook-Pro llvm-project %After resolving this third bottleneck, we have the pipeline:
(git client) --> (Caddy [Fxied with: 5GB]) --> (Sztab: PAT validation) --> (Sztabina [Fixed by bypassing CGI]) --> git-http-backend [direct pipe] | V 2.8GB LLVM residentResult:
6,987,304 objects pushed 5,698,354 deltas resolved 577,839 commits 2.59 GB transferred 2.8G resident in Sztabina -
Now I have successfully cloned the large repo in Sztab. Next step is hitting the diff endpoint against LLVM to trigger DataBufferLimitException on the Spring side — that's the actual SZ-120 goal.
-
Found a new bottleneck: unnecessary unshallow on internal repos
When Sztab requests a diff or commit comparison,
(Sztabina) <== (Sztab) <=============== REST or UI client diff or commitSztabina clones the repo into a temp directory via its own HTTP endpoint (http://sztabina:8085/git/...) and then calls _EnsureFullHistory() _ unconditionally. For internal repos this is wrong on two counts:
- The bare repo at /repos already has full history - there is nothing to unshallow.
- The unshallow fetch goes back through git-http-backend over HTTP, which is fragile and slow for large repos.
The fix is to skip EnsureFullHistory() when util.IsInternalRepo(gitURL) is true. For external repos the behavior is correct and should be preserved.
This is also the root cause of the current CompareCommitsByURL http/500 - the unshallow fetch is failing with fatal: expected acknowledgments due to a git protocol v2 framing issue in the new writeCGIResponse handler. Bypassing EnsureFullHistory for internal repos avoids the problem entirely without needing to fix the protocol framing.
-
Now we have removed all the bottlenecks in cloning and computing diff with a large repo and for a large diff.
We have reached the point where Sztabia is successfully computing the large diff and is returning it using Spring WebFlux buffer and as expected, buffer overflows:
Request: (Sztabina) <== (Sztab) <=============== REST or UI client diff or commit Response: (Sztabina) ---> [WebFlux (DataBufferLimitException) ] --> (Sztab) Buffer overflow!!!2026-04-27T18:19:43.148Z WARN 1 --- [ctor-http-nio-6] r.netty.http.client.HttpClientConnect : [fc59af95-2, L:/172.18.0.4:53312 ! R:sztabina/172.18.0.2:8085] The connection observed an error reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response 2026-04-27T18:19:43.718Z WARN 1 --- [0.0-8181-exec-7] c.s.sztabina.client.impl.SztabinaClient : Failed to diff refs via Sztabina org.springframework.web.reactive.function.client.WebClientRequestException: Connection prematurely closed BEFORE response at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136) ~[spring-webflux-6.1.13.jar!/:6.1.13] Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: *__checkpoint ⇢ Request to POST http://sztabina:8085/repos/compare/diff-by-url [DefaultWebClient] at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136) ~[spring-webflux-6.1.13.jar!/:6.1.13] at reactor.netty.http.client.HttpClientConnect$HttpObserver.onUncaughtException(HttpClientConnect.java:403) ~[reactor-netty-http-1.1.22.jar!/:1.1.22] at reactor.netty.ReactorNetty$CompositeConnectionObserver.onUncaughtException(ReactorNetty.java:708) ~[reactor-netty-core-1.1.22.jar!/:1.1.22] at reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onUncaughtException(DefaultPooledConnectionProvider.java:223) ~[reactor-netty-core-1.1.22.jar!/:1.1.22] at reactor.netty.resources.DefaultPooledConnectionProvider$PooledConnection.onUncaughtException(DefaultPooledConnectionProvider.java:476) ~[reactor-netty-core-1.1.22.jar!/:1.1.22] Suppressed: java.lang.Exception: #block terminated with an error at com.sztab.sztabina.client.impl.SztabinaClient.diffByUrl(SztabinaClient.java:228) ~[!/:na] at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:126) ~[spring-security-web-6.3.3.jar!/:6.3.3] at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:120) ~[spring-security-web-6.3.3.jar!/:6.3.3] Caused by: reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response 2026-04-27T18:19:43.783Z WARN 1 --- [0.0-8181-exec-7] c.s.service.impl.PullRequestServiceImpl : Failed to fetch Git data for PR 2: Diff too large to process (exceeded in-memory buffer limit of 16777216 bytes). 2026-04-27T18:25:32.509Z WARN 1 --- [0.0-8181-exec-9] c.s.p.a.a.WorkflowAvailabilityAspect : Workflow availability check: workflow=PULL_REQUEST_MANAGEMENT, enabled=true, target=ResponseEntity com.sztab.controller.PullRequestController.getPullRequestDetail(Long,Authentication) 2026-04-27T18:25:47.369Z WARN 1 --- [0.0-8181-exec-9] c.s.sztabina.client.impl.SztabinaClient : Failed to compare commits via Sztabina org.springframework.web.reactive.function.client.WebClientRequestException: No route to host: sztabina/172.18.0.2:8085 at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136) ~[spring-webflux-6.1.13.jar!/:6.1.13] Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136) ~[spring-webflux-6.1.13.jar!/:6.1.13] Suppressed: java.lang.Exception: #block terminated with an error at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:126) ~[spring-security-web-6.3.3.jar!/:6.3.3] at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:120) ~[spring-security-web-6.3.3.jar!/:6.3.3] Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: sztabina/172.18.0.2:8085 Caused by: java.net.NoRouteToHostException: No route to host 2026-04-27T18:25:47.478Z WARN 1 --- [0.0-8181-exec-9] c.s.service.impl.PullRequestServiceImpl : Failed to fetch Git data for PR 2: Failed to compare commits via Sztabina ^C rksuma@Ramakrishnans-MacBook-Pro sztab %What is happening:
- Sztabina computes the giant diff and starts streaming it back
- Sztab's WebFlux client starts buffering the response
- At 16MB the WebFlux buffer overflows — Sztab closes the connection abruptly
- Sztabina, mid-stream, suddenly has no client to write to — the connection is gone
- Sztabina's git diff subprocess may still be running and writing to a broken pipe
Now we need to replace WebFlux buffer with streaming.
-
Sztab is able to clone and metabolize large repos (~2.9GB):
6,987,304 objects pushed 5,698,354 deltas resolved 577,839 commits 2.59 GB transferred 2.8G resident in SztabinaNext, I chose to create a PR for commits in the repo between a base that is far away from head. The diff is very large (>16MB) and while passing it to Sztab, Sztabina runs into buffer overflow:
https://tigase.dev/sztab/~issues/120#IssueComment-131491
This is exactly what I wanted to establish before I replaced the buffered exchange between Sztabina and Sztab with streaming (SZ-125: https://tigase.dev/sztab/~issues/125).
{Sztabina} ==> [ WebFlux buffer] ==> {Sztab backend} ==> UI/REST client (overflow)Following the axiom: measure first, fix second (I instrumented Prometheus, established the failure baseline with LLVM), and now have the evidence to justify the streaming fix.
I am now closing SZ-120 and starting on SZ-125:
rksuma@Ramakrishnans-MacBook-Pro sztab % git status On branch wolnosc Your branch is up to date with 'origin/wolnosc'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: sztabina/handlers/git_http_handler.go modified: sztabina/service/gitservice.go Untracked files: (use "git add <file>..." to include in what will be committed) .swo .swp deploy/docker/.swp deploy/helm/sztab/caddy/Caddyfile.bak docs/management/ no changes added to commit (use "git add" and/or "git commit -a") rksuma@Ramakrishnans-MacBook-Pro sztab % git add sztabina/handlers/git_http_handler.go sztabina/service/gitservice.go git commit -m "SZ-120: fix Sztabina internal repo handling for large diff validation - Skip EnsureFullHistory() for internal repos — bare repo already has full history - Forward Git-Protocol header to git-http-backend for protocol v2 support - Confirmed DataBufferLimitException at 16MB WebFlux buffer ceiling with LLVM diff - Evidence establishes streaming requirement for SZ-125" [wolnosc be908c2] SZ-120: fix Sztabina internal repo handling for large diff validation 2 files changed, 38 insertions(+), 48 deletions(-) rksuma@Ramakrishnans-MacBook-Pro sztab %
| Type |
Task
|
| Priority |
Normal
|
| Assignee | |
| Version |
1.10.0
|
| Sprints |
n/a
|
| Customer |
n/a
|
Follow-up to the 16MB buffer fix. Want to validate that the system handles diffs from large real-world repos without hitting the buffer or falling over performance-wise.
Test candidate: LLVM — already resident in Sztabina locally (2.8GB, 577k commits). Start here before touching staging.
Things to check:
16MB will almost certainly be insufficient for large LLVM diffs. That's the point — confirm the failure mode is a clean DataBufferLimitException (per SZ-73) rather than a silent hang or OOM. Results drive the streaming decision.
Linux kernel and Chromium are candidates for staging but overkill for the initial local run. LLVM is enough to establish the failure threshold.