-
Bot Attack Surface Area

Description
1) Test Approach:
- Choose tools to measure impact of Bots
- Choose tools to induce Bot like stress
- Form a baseline of resource usage with Bot attack before applying Bot guardrails
- After each layer is added — verify the layer holds under the same load. We would watch the sztab-backend and sztabina pods specifically during bot stress tests.
2) Identify tools to measure impact of Bot (CPU usage or I/O usage)
- Grafana + Prometheus — we already have this or it's easy to add to the cluster via Helm.
- Gives us CPU, memory, and network I/O per pod.
3) Identify tools to induce Bot-like stress
A) k6 — open source load testing tool
We can write scripts in TypeScript and simulate concurrent anonymous/bot traffic against specific endpoints.
Example:
import http, { RefinedResponse, ResponseType } from 'k6/http'; import { check } from 'k6'; export default function (): void { const res: RefinedResponse<ResponseType> = http.get( 'https://staging.sztab.com/api/projects/1/pulls/5/diff', { headers: { 'User-Agent': 'GPTBot/1.0' }, } ); check(res, { 'status is 200': (r) => r.status === 200, }); }B) Java with Gatling
This is essentially the Java equivalent of k6. Since the broader Tigase team is Java-first, Gatling scripts would feel more natural to them and fit into Maven builds. Shall I use this option? This way the Bot simulation scripts an be reused for other Tigase projects.
Kotlin developers can use Gatling in Kotlin; Java developers can use Gatling in Java
C) JMX
JMX scripts can serve a dual purpose:
- Bot simulation
- Stress test
However, k6 is frictionless and will work "out of the box".
4) Layered approach to Bot mitigation
a) Layer 1: Spring Security — anonymous request blocking
(lowest effort, highest impact)b) Layer 2: Caddy — rate limiting + bot filtering at the edge
(before Spring even sees the request)c) Layer 3: robots.txt (soft signal, respected by well-behaved bots)
d) Layer 4: Permission-based access (Artur's suggestion — most flexible)
4.1 Layer 1
The simplest way is to identify the most expensive APIs and mandate authentication for shortlisted APIs.
With Spring this is easy: in the Spring Security policy add
.authenticated()for such endpoints.APIs that trigger git clone and git merge are candidates.
4.2 Layer 2
Since Caddy is already our reverse proxy with
forward_auth, we can add:# Rate limiting for anonymous traffic @anonymous not header Authorization * @anonymous not header Cookie * rate_limit @anonymous 10r/m # Block known bot user agents @bots header_regexp User-Agent `(?i)(GPTBot|ClaudeBot|CCBot|Bytespider|SemrushBot|AhrefsBot)` respond @bots 403This stops bots before they consume Spring Boot or Sztabina resources at all.
4.3 Layer 3 — robots.txt
Serve a
robots.txtfrom Caddy directly blocking AI crawlers:User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: * Disallow: /api/ Allow: /This is a soft signal, respected only by well-behaved crawlers.
4.4 Layer 4 — Permission-based access
This is the existing ExternalUserPolicy / role system extended with a new dimension.
Instead of just authenticated vs anonymous, we gate by role.
Example:
@PreAuthorize("hasPermission(#projectId, 'Project', 'READ_DIFFS')") public DiffResponse getPullRequestDiff(...) { ... }Roles like GUEST / COMMUNITY could be explicitly excluded from diff/search endpoints even if authenticated.
This is useful if we ever allow public read-only accounts but still want to protect expensive resources.
4.5 Layer 5 (Host Layer) — Using Host IDS (such as OSSEC)
OSSEC / Wazuh (OSSEC's modern fork) can help — it does log analysis, anomaly detection, and can trigger active responses (e.g. auto-ban an IP via iptables). But I think for now this may be an overkill in Sztab's context.
Known limitations
- Authenticated scenario uses a single shared session cookie across all 20 VUs. Real bot farms distribute load across multiple accounts/sessions. A more realistic simulation would create 5-10 bot accounts and distribute cookies among VUs — deferred to a later iteration.
-
I have assumed that the Bots/crawlers can cause performance issues alone by exhausting resources.
But bots can also attempt privilege escalation. Hence this issue is in part about security posture as well.
Data harvesting is another risk: A crawler indexing all the issues, PRs, comments, and code — even if read-only, this is a confidentiality problem for private projects and can be used for competitor intelligence gathering.
Please let me know if we should treat this as a performance issue alone in this rev.
-
Monitoring tool
Phase 1 (immediate) — kubectl top for CPU/memory across the three pods during stress tests. Free, zero setup, good enough to establish baseline.
Phase 2 (proper) — add node_exporter to the EC2 node for disk I/O, feed into Grafana alongside Caddy metrics. Full picture.
-
SZ-73 Bot Protection — Baseline Measurements
Purpose
Establish pre-mitigation resource usage baseline on staging, before any bot protection layers are applied. These numbers will be used to validate the effectiveness of each mitigation layer as it is implemented.
Environment
- Cluster: k3s on AWS EC2 (us-west-2)
- Host: ec2-35-87-145-56.us-west-2.compute.amazonaws.com
- Namespace: sztab-staging
- Image tag: sz73-bot-protection (rebased on wolnosc, no SZ-73 changes applied yet)
- Date: 2026-03-12
Idle Baseline (no load)
Captured via
kubectl top pods -n sztab-stagingwith no active traffic.Pod CPU (cores) Memory sztab-backend 5m 369Mi sztab-db 4m 46Mi sztabina 1m 1Mi caddy 1m 10Mi sztab-ui 1m 2Mi Notes:
sztab-backendmemory at 369Mi reflects normal Spring Boot JVM baseline (expected)sztabinaandcaddyare effectively idlesztab-dbat 4m CPU reflects background PostgreSQL activity only
Bot Stress Baseline (under simulated load)
TODO: Run k6 stress test simulating anonymous bot traffic against expensive endpoints. Capture CPU and memory spike for sztab-backend, sztabina, and sztab-db.
Target Endpoints
Endpoint Why Expensive GET /api/projects/{id}/pulls/{id}/diffTriggers git diff via Sztabina GET /api/projects/{id}/issues?q=...DSL query, DB-heavy GET /api/projects/{id}/files/{branch}Git tree traversal via Sztabina k6 Test Parameters
- Virtual users: TBD
- Duration: TBD
- User-Agent:
GPTBot/1.0(simulates AI crawler) - Auth: none (anonymous)
Results
TODO: Fill in after k6 run.
Pod CPU (cores) Memory Delta vs Idle sztab-backend - - - sztab-db - - - sztabina - - - Post-Mitigation Measurements
TODO: Re-run same k6 test after each layer is applied and record results here.
Layer Description Backend CPU Sztabina CPU Notes Layer 1 Spring Security .authenticated()- - - Layer 2 Caddy rate limiting + bot UA blocking - - - Layer 3 robots.txt - - soft signal only Layer 4 Permission-based access (role gating) - - - -
Next step: install k6 on my laptop:
rksuma@Ramakrishnans-MacBook-Pro sztab % brew install k6 //... rksuma@Ramakrishnans-MacBook-Pro sztab % k6 version k6 v1.6.1 (commit/devel, go1.26.0, darwin/arm64) rksuma@Ramakrishnans-MacBook-Pro sztab %Now, I'll write a Typescript script targeting the three expensive endpoints with a GPTBot user agent, no auth, and enough virtual users to actually stress the backend.
-
-
Results of Layer 1 testing after locking down all expensive methods with .authrequired() => (please disregard the spurious error at the end in deleting the test project)
Essentially since the Bot does not authenticate itself, it runs into http/403 for all hits and hence makes no difference to the resource usage of Sztab.
rksuma@Ramakrishnans-MacBook-Pro sztab % ADMIN_USER=admin ADMIN_PASSWORD=SztabStagingAdmin! ./scripts/stress-test/k6/run-stress-test.sh [INFO] === SZ-73 Bot Stress Test === [INFO] Base URL: http://ec2-35-87-145-56.us-west-2.compute.amazonaws.com [INFO] Namespace: sztab-staging [INFO] VUs: 50 [INFO] Duration: 60s [INFO] --- Step 1: Login --- [INFO] Login successful. [INFO] Logged in as user id=1 [INFO] --- Step 2: Create Sztab project --- [INFO] Project 'SZ73 Stress Test' already exists — looking up existing project... [INFO] Found existing project: id=16 [INFO] --- Step 3: Create issue --- [INFO] Issue created: id=3 [INFO] --- Step 4: Create pull request --- [INFO] Pull request created: id=3 [INFO] --- Step 5: Baseline pod metrics (idle) --- NAME CPU(cores) MEMORY(bytes) caddy-847774bbf9-xzvnv 1m 12Mi sztab-backend-644c77d58-r46xd 2m 432Mi sztab-db-fb967c9d5-fs84w 2m 44Mi sztab-ui-57764ffc4f-r9hlg 1m 3Mi sztabina-65b5cff756-kzl4f 1m 3Mi [INFO] --- Step 6: Run k6 stress test --- [INFO] Watch pod metrics in another terminal: kubectl top pods -n sztab-staging --watch /\ Grafana /‾‾/ /\ / \ |\ __ / / / \/ \ | |/ / / ‾‾\ / \ | ( | (‾) | / __________ \ |_|\_\ \_____/ execution: local script: /Users/rksuma/tigase/sztab/scripts/stress-test/k6/bot-stress-test.ts output: - scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * default: 50 looping VUs for 1m0s (gracefulStop: 30s) █ THRESHOLDS http_req_duration ✓ 'p(95)<5000' p(95)=134.53ms █ TOTAL RESULTS checks_total.......: 69856 1161.339279/s checks_succeeded...: 25.00% 17464 out of 69856 checks_failed......: 75.00% 52392 out of 69856 ✗ status is 200 (unprotected) ↳ 0% — ✓ 0 / ✗ 17464 ✗ status is 401 (auth required) ↳ 0% — ✓ 0 / ✗ 17464 ✓ status is 403 (bot blocked) ✗ status is 429 (rate limited) ↳ 0% — ✓ 0 / ✗ 17464 HTTP http_req_duration....: avg=71.19ms min=29.65ms med=55.02ms max=422.05ms p(90)=124.52ms p(95)=134.53ms http_req_failed......: 100.00% 17464 out of 17464 http_reqs............: 17464 290.33482/s EXECUTION iteration_duration...: avg=172.15ms min=130.26ms med=155.68ms max=522.51ms p(90)=225.17ms p(95)=235.3ms iterations...........: 17464 290.33482/s vus..................: 50 min=50 max=50 vus_max..............: 50 min=50 max=50 NETWORK data_received........: 7.8 MB 129 kB/s data_sent............: 2.3 MB 38 kB/s running (1m00.2s), 00/50 VUs, 17464 complete and 0 interrupted iterations default ✓ [======================================] 50 VUs 1m0s [INFO] --- Step 7: Pod metrics (post-stress) --- NAME CPU(cores) MEMORY(bytes) caddy-847774bbf9-xzvnv 99m 20Mi sztab-backend-644c77d58-r46xd 252m 440Mi sztab-db-fb967c9d5-fs84w 2m 45Mi sztab-ui-57764ffc4f-r9hlg 1m 3Mi sztabina-65b5cff756-kzl4f 1m 4Mi [INFO] === Stress test complete. Teardown will run now. === [INFO] --- Teardown --- [INFO] Deleting Sztab project 16... [ERROR] Failed to delete project 16 [INFO] Teardown complete. rksuma@Ramakrishnans-MacBook-Pro sztab % -
Baseline stress test results (pre-protection, 2026-03-14)
Ran k6 stress test against staging (
ec2-35-87-145-56.us-west-2.compute.amazonaws.com) with 50 VUs for 60s — 30 unauthenticated (anonymous bot simulation) and 20 authenticated (bot with DEVELOPER role, hitting issues/PR/branch endpoints).Throughput: 279 req/s
Pod metrics (idle → under load)
Pod CPU idle CPU load Memory idle Memory load sztab-backend 2m 370m 443Mi 544Mi sztab-db 4m 137m 46Mi 77Mi caddy 1m 117m 23Mi 23Mi sztabina 1m 1m 2Mi 2Mi Observations
- Unauthenticated requests: 100% returning 403 -- Layer 1 (Spring Security) blocking all anonymous traffic correctly.
- Authenticated requests: 100% returning 200 -- DEVELOPER role has correct read access.
- Backend CPU peaks at 370m under load -- this is the baseline to beat after Caddy rate limiting is applied.
- DB CPU peaks at 137m -- issue/PR list queries are the likely driver.
- Sztabina unaffected -- git ops not triggered by read-only REST traffic.
Known limitations
- Authenticated scenario uses a single shared session cookie across all 20 VUs. Real bot farms distribute load across multiple accounts/sessions. A more realistic simulation would create 5-10 bot accounts and distribute cookies among VUs -- deferred to a later iteration.
Next steps
Implement Layer 2 (Caddy rate limiting) and re-run to measure impact.
-
Layer 2: Caddy-level rate limiting and bot blocking
Rejection is now pushed upstream to the reverse proxy, before requests ever reach the JVM. I added two defenses to the Caddyfile:
-
UA blocklist -- known well-behaved AI crawlers (GPTBot, ClaudeBot, CCBot, Bytespider, SemrushBot, AhrefsBot) are rejected with 403 at the proxy edge. Btw, this check is easily sidestepped: adversarial scrapers that spoof their user agent will bypass this, which is why rate limiting is the primary defense.
-
Anonymous rate limiting -- unauthenticated traffic is capped at 30 requests/min per IP. Authenticated users (identified by session cookie or API token) are exempt. At 30 r/m, a human browsing casually has ample headroom; a bot hammering endpoints hits the ceiling immediately.
To support this, I built a custom Caddy image with the rate limiting plugin baked in, pinned to
v2.8.4for reproducibility. The next stress test run will measure how much backend CPU drops as a result. -
-
Layer 2 stress test results (Caddy rate limiting, 2026-03-14)
Setup
Same test as baseline: 50 VUs for 60s, 30 unauthenticated and 20 authenticated (DEVELOPER role). Rate limiting applied to anonymous traffic only (30 r/min per IP).
Pod metrics (idle => under load)
Pod CPU idle CPU load Memory idle Memory load sztab-backend 2m 174m 443Mi 542Mi sztab-db 4m 147m 46Mi 77Mi caddy 1m 102m 12Mi 17Mi sztabina 1m 1m 2Mi 2Mi Comparison vs baseline (Layer 1 only)
Pod Layer 1 Layer 2 Change sztab-backend 370m 174m -53% sztab-db 137m 147m ~flat (noise) caddy 117m 102m -13% Observations
- Backend CPU dropped by 53% -- anonymous bot traffic is now absorbed by Caddy before requests reach the JVM. The JVM no longer wakes up, allocates objects, or runs the filter chain for unauthenticated requests that exceed the rate limit.
- DB CPU is flat -- authenticated queries still run as expected. The reduction in backend CPU is entirely from eliminating the unauthenticated filter chain overhead.
- Caddy CPU is slightly lower too -- the rate limit decision short-circuits before the upstream proxy step, so Caddy does less work per rejected request than it did forwarding 403s from the backend.
- Memory is stable across both scenarios -- no sign of heap pressure or GC storms under load.
Next steps
Layer 3 (robots.txt) and Layer 4 (permission-based access gating) to follow.
-
Layer 3 Bot mitigation using robots.txt.
Test Results:
rksuma@Ramakrishnans-MacBook-Pro sztab % rksuma@Ramakrishnans-MacBook-Pro sztab % helm upgrade sztab deploy/helm/sztab -f deploy/helm/sztab/values-staging.yaml -n sztab-staging Release "sztab" has been upgraded. Happy Helming! NAME: sztab LAST DEPLOYED: Sun Mar 15 10:39:36 2026 NAMESPACE: sztab-staging STATUS: deployed REVISION: 25 TEST SUITE: None rksuma@Ramakrishnans-MacBook-Pro sztab % kubectl rollout restart deployment/caddy -n sztab-staging deployment.apps/caddy restarted Waiting for deployment "caddy" rollout to finish: 1 old replicas are pending termination... rksuma@Ramakrishnans-MacBook-Pro sztab % kubectl rollout status deployment/caddy -n sztab-staging Waiting for deployment "caddy" rollout to finish: 1 old replicas are pending termination... deployment "caddy" successfully rolled out rksuma@Ramakrishnans-MacBook-Pro sztab % kubectl get pods -n sztab-staging -w NAME READY STATUS RESTARTS AGE caddy-6fbc5697cd-ll92p 1/1 Running 0 13s sztab-backend-644c77d58-r46xd 1/1 Running 0 41h sztab-db-fb967c9d5-fs84w 1/1 Running 0 18d sztab-ui-57764ffc4f-r9hlg 1/1 Running 0 3d12h sztabina-65b5cff756-kzl4f 1/1 Running 0 42h ^C% ### **Verify Caddy serves the robots.txt and the sitemap.xml**: rksuma@Ramakrishnans-MacBook-Pro sztab % curl -s http://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/robots.txt curl -s http://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/sitemap.xml User-agent: * Disallow: /api/ Disallow: /git/ User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Bytespider Disallow: / User-agent: Amazonbot Disallow: / User-agent: PetalBot Disallow: / Sitemap: https://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/sitemap.xml <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/</loc> <lastmod>2026-03-14</lastmod> </url> <url> <loc>https://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/docs</loc> <lastmod>2026-03-14</lastmod> </url> </urlset>% rksuma@Ramakrishnans-MacBook-Pro sztab % curl -s http://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/sitemap.xml <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/</loc> <lastmod>2026-03-14</lastmod> </url> <url> <loc>https://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/docs</loc> <lastmod>2026-03-14</lastmod> </url> </urlset>% rksuma@Ramakrishnans-MacBook-Pro sztab % -
Layer 4: Permission-based access gating — design rationale
Rate limiting (Layer 2) stops anonymous bots. A determined attacker creates an account and bypasses it. Blocking by INTERNAL/EXTERNAL user type is also insufficient — attacks can come from compromised or low-privilege internal accounts.
Layer 4 gates expensive endpoints (PR detail, diffs, branch list) by role:
Tier Roles Access Light read OBSERVER, CUSTOMER_SUPPORT Issues list, basic project info Full read DEVELOPER, QA_ENGINEER, DOCUMENT_WRITER, UX_DESIGNER, SCRUM_MASTER + PR detail, branch list, diffs Write PROJECT_MANAGER, RELEASE_MANAGER + create/update Admin ADMIN Everything Implementation: extend
ExternalUserPolicywithrequireRole(auth, RoleName...)and apply it at the controller layer. Boundaries are a starting point — raise concerns on this ticket if adjustments are needed. -
SZ-73 – Layer 4 Bot Mitigation Performance Experiment
Objective
Evaluate the performance impact of removing per-request database lookups for user and role resolution in the authorization policy layer.
Previously, the security policy resolved the user type by querying the database via
UserService.getUserByUsername()for each incoming request. Under bot load, this caused the backend to issue frequent database lookups despite the fact that the authenticated user's authorities are already present in the Spring SecurityAuthenticationobject.The change introduced in this experiment eliminates the database lookup from the request hot path and instead relies solely on authorities stored in the
SecurityContext.The goal is to verify the impact of this change under synthetic bot traffic.
Change Introduced
Previous behavior:
request | v ExternalUserPolicy.resolveType() | v UserService.getUserByUsername() | v database lookupNew behavior:
request | v SecurityContextHolder | v Authentication.getAuthorities() | v policy enforcement (no DB access)The authorization aspects (
RequireRoleAspect,RequireInternalAspect) now operate purely on theAuthenticationauthorities.Image tested:
tigase.dev/sztab/sztab-backend:sz73-bot-protection-v8
Test Environment
Cluster: k3s staging cluster
namespace: sztab-stagingDeployment topology:
caddy sztab-backend (1 replica) sztab-db sztab-ui sztabina
Load Generation
Load was generated using the project k6 stress test script:
scripts/stress-test/k6/run-stress-test.shConfiguration:
50 virtual users 20 authenticated 30 unauthenticated duration: 60 secondsTraffic profile targets endpoints typically accessed by bot crawlers.
k6 Results
http_req_duration:
avg = 71.7 ms min = 27.68 ms median = 53.23 ms max = 655.92 ms p90 = 126.02 ms p95 = 142 mshttp_req_failed:
63.73% (11093 / 17405)http_reqs:
17405 requests ≈ 289 requests/sec*iteration_duration:
avg = 172.51 ms p95 = 243.16 msFailure rate is expected because bot mitigation intentionally rejects a large fraction of unauthenticated traffic.
Pod Metrics (Post-Stress)
NAME CPU MEMORY -------------------------------- caddy 105m 19Mi sztab-backend 503m 452Mi sztab-db 126m 74Mi sztab-ui 1m 3Mi sztabina 1m 2Mi
Comparison with Previous Run
Previous experiment (before removing DB lookups):
sztab-backend CPU ≈ 808m
sztab-db CPU ≈ 165mAfter optimization:
sztab-backend CPU ≈ 503m
sztab-db CPU ≈ 126mObserved improvements:
backend CPU reduction ≈ 37–40%
database CPU reduction ≈ 24%Latency also improved:
previous avg latency ≈ 175 ms
new avg latency ≈ 71 msNote: the latency comparison should be considered indicative rather than strictly controlled. The earlier measurement and this run were conducted under slightly different runtime conditions, and the earlier measurement included the overhead of the per-request database lookup. While the improvement is directionally consistent with the removal of that lookup, the latency values should not be interpreted as a controlled A/B benchmark.
Interpretation
The previous design performed a database lookup on each request to determine user type. Under bot traffic (~290 req/s), this resulted in frequent database access for information that was already present in the authenticated security context.
By removing the database dependency from the hot path and relying on Authentication.getAuthorities() instead, the system now performs role checks purely in memory.
This change produced measurable improvements in:
- backend CPU utilization
- database load
- request latency
Importantly, bot mitigation behavior remained unchanged.
Conclusion
Removing per-request user lookups from the authorization policy significantly improved system efficiency under bot traffic.
At approximately 290 requests/sec:
backend CPU dropped from ~808m to ~503m
database CPU dropped from ~165m to ~126m
average request latency dropped from ~175ms to ~71msThis confirms that the security policy layer should operate exclusively on data already present in the SecurityContext and avoid database access in the request hot path.
This optimization improves the resilience of the system when subjected to high volumes of bot or crawler traffic.
-
Next: Git-level bot mitigation
The REST API and proxy layers are now protected. The remaining attack surface is the Git endpoint (
/git/*), which is proxied directly to Sztabina.A determined bot that obtains a valid PAT (or leverages anonymous access on a public project) can repeatedly issue clone, fetch, and diff operations. These are significantly more expensive than REST requests, as they trigger disk I/O and traversal of the git object graph.
This makes the Git surface a high-cost amplification vector compared to the API layer.
Planned mitigations:
-
Rate limiting on
/git/*at the Caddy layer, independent from REST limits.
Git operations are more expensive, so thresholds should be lower (e.g. 5–10 requests/min per IP). -
PAT-scoped rate limiting to track usage per token rather than per IP.
This helps mitigate bots that rotate source IPs while reusing credentials. -
Sztabina-level request budgeting to enforce limits within the service itself, ensuring protection even if edge-layer controls are bypassed or misconfigured.
Before implementing these controls, SZ-78 will establish a baseline by stress testing git diff and related endpoints. This will measure CPU and I/O impact on Sztabina under load, using the same methodology previously applied to the REST API layer.
-
-
rksuma@Ramakrishnans-MacBook-Pro sztab % git checkout wolnosc Already on 'wolnosc' Your branch is up to date with 'origin/wolnosc'. rksuma@Ramakrishnans-MacBook-Pro sztab % git pull origin wolnosc From https://tigase.dev/sztab * branch wolnosc -> FETCH_HEAD Already up to date. rksuma@Ramakrishnans-MacBook-Pro sztab % git checkout -b feature/SZ-73-git-bot-mitigation Switched to a new branch 'feature/SZ-73-git-bot-mitigation' rksuma@Ramakrishnans-MacBook-Pro sztab % -
SZ-73 Work Log
Summary
Implemented a four-layer bot mitigation strategy across the HTTP stack (Spring Security, Caddy edge controls, crawler directives, and AOP-based authorization). Established load-testing infrastructure (k6) and baseline measurements to quantify impact. Identified Git endpoints as the remaining high-cost attack surface, to be addressed next.
Total effort: ~27h
SZ-77 (blocker, fixed first)
Bug: ProjectService infers repo type from gitUrl presence instead of repoType field
- Identified root cause:
isExternalRepo = gitUrl != nullignoredrepoTypefield entirely - Added
RepoTypeparameter toProjectService.createProject()interface and impl - Updated
ProjectControllerto passdto.effectiveRepoType() - Removed deprecated
createProject(Project)overload - Updated tests — 297 passing
- Branch:
bugfix/SZ-77-repoType-inference→ merged towolnosc - Estimate: 2h
SZ-73 Layer 1: Spring Security audit
- Confirmed
.anyRequest().authenticated()already in place - Identified actuator and Swagger exposure as follow-up items
- Estimate: 0.5h
SZ-73 Layer 2: Caddy rate limiting and UA blocklist
Custom Caddy image
- Wrote
deploy/helm/sztab/caddy/Dockerfilewithxcaddy+caddy-ratelimitplugin - Pinned to
caddy:2.8.4, added build-time module verification - Built multi-platform image (
linux/amd64,linux/arm64) - Updated
values.yaml,values-staging.yaml, Helm template for new image - Fixed
imagePullPolicy: Alwaysin Helm template to avoid stale image cache
Caddyfile
- Added
@ai_botsUA blocklist (GPTBot, ClaudeBot, CCBot, Bytespider, SemrushBot, AhrefsBot, Amazonbot, PetalBot) - Added
@anonymousrate limit zone: 30 r/min per{remote_ip} - Used
header_regexpfor JSESSIONID matching (handles multiple cookies correctly) - Moved Caddyfile to
deploy/helm/sztab/caddy/Caddyfile - Updated Helm ConfigMap template path
- Updated
build-release.shCaddyfile source path
Staging deployment issues resolved
- k3s image cache — added
imagePullPolicy: Always - ConfigMap empty — fixed
.Files.Getpath in Helm template - Multiple Caddy CrashLoopBackOff cycles debugged
Estimate: 6h
SZ-73 Layer 3: robots.txt and sitemap.xml
- Added
handle /robots.txtdirectly in Caddyfile with per-agent rules - Added
handle /sitemap.xmlwith{env.SZTAB_DOMAIN}placeholder - Used
SZTAB_DOMAINenv var (already wired fromsztab.domainin Helm values) - Verified both endpoints return correct domain substitution on staging
- Updated
docker-compose.ymlwith Caddyfile path and custom image - Estimate: 1.5h
Load testing infrastructure
k6 script (
bot-stress-test.ts)- Two scenarios:
unauthenticated_bots(30 VUs),authenticated_bots(20 VUs) - Unauthenticated: hits public project/issue/PR list endpoints
- Authenticated: hits issues, PR detail, branch list with bot session cookie
- Named scenario exports with
execfield - Fixed endpoint bugs:
api/projects/{id}/issues→api/issues?projectName=, branches path
Runner script (
run-stress-test.sh)- Mac-compatible
curl_apihelper (nohead -n -1) - Admin login → fetch user ID → create bot user → assign DEVELOPER role → bot login
- Project creation (
repoType: LOCAL) → issue → PR kubectl topbefore and after k6 run- Teardown: delete all PRs by project → delete feature branch → delete project → delete bot user
- Multiple teardown fixes: PR FK constraint, branch FK constraint, bulk PR deletion
Debugging cycles
- Mac shell incompatibilities (
head -n -1,local -nnameref) - Sztabina 409 handling in
SztabinaClient.createRepository() - Sztabina returning
text/plain→ fixed in Go handler withutil.EncodeJSON RepositoryResponseNPE on null return from 409 handler- k6 named scenario
execfield missing - Wrong issues endpoint, branch endpoint literal not interpolated
Estimate: 8h
Baseline measurements
Layer Backend CPU DB CPU Idle 2m 4m Layer 1 only 370m 137m Layer 2 (Caddy RL) 174m 137m Layer 4 v1 (DB lookup) 808m 165m Layer 4 v2 (auth cache) 503m 126m Estimate: 1.5h
SZ-73 Layer 4: Permission-based access gating
Design
- Defined two access tiers:
LIGHT_READ(all roles) andFULL_READ(DEVELOPER+) - Decided against
UserTypeas primary boundary — insider threat applies equally
Implementation
AccessTierenum incom.sztab.policy.security.enums@RequireRole(AccessTier)annotation incom.sztab.annotations.security@RequireInternalannotation incom.sztab.annotations.securityRequireRoleAspectandRequireInternalAspectincom.sztab.policy.security.aspect- Pre-allocated
RoleName[]arrays in aspect (no per-request allocation) - Defensive auth null check in both aspects
- Applied annotations across
IssueController,BranchController,PullRequestController - Fixed
User.hasRole()bug: enum vs String comparison always returned false - Added
UserTestwith guard comment explaining the trap
Performance optimization
- Initial implementation caused extra
getUserByUsername()DB call per request → 808m CPU - Fixed: resolved roles from
Authentication.getAuthorities()— no DB access in hot path - Updated
CustomUserDetailsServiceto includeUSERTYPE_INTERNAL/EXTERNALas authority - Updated
ExternalUserPolicyto use authorities — removedUserServicedependency
Estimate: 6h
Documentation
- Ticket comments: baseline results, Layer 2 results, Layer 4 results, design rationale
- Team update sent to Artur
- Work log
Estimate: 1.5h
Total
Area Hours SZ-77 blocker fix 2h Layer 1 audit 0.5h Layer 2 (Caddy image + Caddyfile) 6h Layer 3 (robots.txt + sitemap.xml) 1.5h Load testing infrastructure 8h Baseline measurements + analysis 1.5h Layer 4 (AOP role gating + perf optimization) 6h Documentation 1.5h Total 27h - Identified root cause:
-
A new problem seen when subjecting Sztab to large repos.
Sztab and Sztabina are interfaced using Spring WebFlux.
{ Sztabina } ==> { Sztab }When computing diff, Sztabina pipes the computed diff to Sztab thru WebFlux buffer. By default the buffer size is 256Kb. This was insufficient for the test I ran (I created a large repo).
This caused the default WebFlux buffer to overflow; when this happens, Sztabina is unaware of the issue and shows no error in logs. It's the consumer (Sztab) that fails but even there the exception happens inside SPring Flux leading to a cryptic exception in the logs:
rksuma@Ramakrishnans-MacBook-Pro sztab % rksuma@Ramakrishnans-MacBook-Pro sztab % kubectl logs -n sztab-staging deployment/sztab-backend --tail=200 | grep -B2 -A10 "diff-by-url" | head -40 Defaulted container "sztab-backend" out of: sztab-backend, wait-for-db (init) Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: Error has been observed at the following site(s): *__checkpoint ⇢ Body from POST http://sztabina:8085/repos/compare/diff-by-url [DefaultClientResponse] Original Stack Trace: at org.springframework.core.io.buffer.LimitedDataBufferList.raiseLimitException(LimitedDataBufferList.java:99) ~[spring-core-6.1.13.jar!/:6.1.13] at org.springframework.core.io.buffer.LimitedDataBufferList.updateCount(LimitedDataBufferList.java:92) ~[spring-core-6.1.13.jar!/:6.1.13] at org.springframework.core.io.buffer.LimitedDataBufferList.add(LimitedDataBufferList.java:58) ~[spring-core-6.1.13.jar!/:6.1.13] at reactor.core.publisher.MonoCollect$CollectSubscriber.onNext(MonoCollect.java:103) ~[reactor-core-3.6.10.jar!/:3.6.10] at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[reactor-core-3.6.10.jar!/:3.6.10] at reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:200) ~[reactor-core-3.6.10.jar!/:3.6.10] at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[reactor-core-3.6.10.jar!/:3.6.10] at reactor.netty.channel.FluxReceive.onInboundNext(FluxReceive.java:379) ~[reactor-netty-core-1.1.22.jar!/:1.1.22] at reactor.netty.channel.ChannelOperations.onInboundNext(ChannelOperations.java:425) ~[reactor-netty-core-1.1.22.jar!/:1.1.22]FIx:
Increase the WebFlux cache size is 16MB to support large diffs.
We can't fully know in advance because diff size depends on user content. The right approach is a generous configurable default, a specific caught exception with a clear error message, and monitoring.
-
Good catch. Yes, this is exactly why I am in favor of running tests against real world data. Testing on our largest repos is a good start but maybe this is not enough. How about testing the system against really big repos available out there. Like Linux repo?
I mean, how do you know that the cache size 16MB is enough? Is there a limit to what we can handle?
-
-
Good catch. Yes, this is exactly why I am in favor of running tests against real world data. Testing on our largest repos is a good start but maybe this is not enough. How about testing the system against really big repos available out there. Like Linux repo?
I mean, how do you know that the cache size 16MB is enough? Is there a limit to what we can handle?
16MB covers the overwhelming majority of real-world engineering team PRs based on first-principles sizing (500 files × 200 lines × 2 sides × 100 bytes ≈ 20MB worst case). It's a pragmatic default for the target audience.
But for Linux-scale repos the answer is not a bigger buffer — it's streaming.
bodyToMono() forces full in-memory buffering by design. The correct fix at that scale is to stream the diff response directly from Sztabina to the client using bodyToFlux() or server-sent events, bypassing the buffer entirely. That's a more significant architectural change and I am tracking that as a future improvement.
Testing against real large repos like Linux is a good idea for stress testing the git engine (SZ-78), but it will surface the streaming limitation as much as the buffer size. I have opened https://tigase.dev/sztab/~issues/120 to track this.
-
-
SZ-78 Baseline: Sztabina diff endpoint under bot load (2026-03-17)
Setup
50 VUs for 60s — 30 unauthenticated (anonymous bot simulation) and 20 authenticated (DEVELOPER role). Authenticated scenario includes
GET /api/pullrequests/29/detailwhich triggers a real git diff computation in Sztabina against the TESTSZTAB repo (20 files, ~276KB unified diff).Pod metrics (idle → under load)
Pod CPU idle CPU load Memory idle Memory load sztab-backend 2m 294m 443Mi 462Mi sztab-db 4m 36m 46Mi 75Mi caddy 1m 66m 12Mi 19Mi sztabina 1m 497m 2Mi 35Mi Observations
-
Sztabina is now the bottleneck — 497m CPU under load, exceeding the backend (294m). Previous runs showed Sztabina at 1m because no real diff work was being done. This confirms git diff computation is CPU-intensive.
-
DB CPU dropped to 36m — down from 126m in the Layer 4 baseline. Role lookups are now resolved from Spring Security authorities (SZ-79 fix), eliminating per-request DB hits.
-
106MB total data received during test run — confirms that diff payloads are flowing end-to-end and the 16MB buffer fix is not prematurely truncating responses.
-
p95 authenticated latency: 2.3s This reflects CPU-bound git diff computation under 20 concurrent requests. Latency is expected to scale with diff size and concurrency; mitigation should focus on limiting concurrent diff execution rather than optimizing JVM paths.
-
86.5% request failure rate — expected and desired. The majority of unauthenticated bot traffic is intentionally rejected (429 rate limiting at Caddy, 403 at Spring Security). This indicates mitigation layers are actively protecting backend resources.
Comparison with previous baselines
Scenario Backend CPU DB CPU Sztabina CPU Notes Layer 1 only 370m 137m 1m No real diff work Layer 2 (Caddy RL) 174m 137m 1m No real diff work Layer 4 (auth cache) 503m 126m 1m No real diff work SZ-78 (real diffs) 294m 36m 497m Real git diff load Key finding
Git diff computation is CPU-bound and shifts the system bottleneck from the JVM to Sztabina. Under concurrent load, Sztabina saturates (~500m CPU) before backend or DB resources become constrained.
This establishes git diff execution as the dominant cost center in the system and justifies prioritizing rate limiting and concurrency control for diff endpoints.
🔴 Important: The system is not I/O-bound or DB-bound under load; it is compute-bound on git operations. All further scaling and mitigation decisions should be evaluated against this constraint.
Next steps
- Implement git-level rate limiting at Caddy (
/git/*endpoints) - Test against larger diffs (Linux kernel scale) per SZ-79 task
- Monitor Sztabina CPU in production — consider horizontal scaling if diff load grows beyond single-pod capacity
-
-
Git endpoint rate limiting policy
Git operations (clone, fetch, diff) are significantly more expensive than REST requests — each one triggers disk I/O and git object graph traversal in Sztabina. Unlike REST endpoints, git operations are stateless from the client's perspective, meaning a bot with a valid PAT can hammer the same repo repeatedly without any server-side memory of prior requests.
Two rate limit zones are applied at the Caddy layer:
-
Anonymous git (no Authorization header): 5 requests/min per IP. Public repo access is permitted but tightly throttled. A legitimate user cloning a repo once is unaffected; a crawler hitting the endpoint repeatedly is blocked immediately.
-
Authenticated git (valid PAT): 30 requests/min per IP. Generous enough for CI pipelines and active developer workflows, tight enough to prevent a compromised or bot-controlled PAT from saturating Sztabina under repeated clone/fetch load.
PAT authentication proves identity but does not limit volume. Rate limiting at the proxy layer is the correct control for volume — it applies regardless of whether the requestor is human or automated, internal or external.
-
-
Git rate limiting directives in Caddy:
# ------------------------------ # Rate limiting — anonymous REST traffic only (Layer 2b) # Authenticated requests (JSESSIONID cookie or Authorization header) # bypass this — they are governed by Layer 1 (Spring Security) and # Layer 4 (permission-based access). # 30 events/min per IP gives legitimate anonymous browsers ample # headroom while decisively blocking crawlers. # {remote_ip} is used as the rate limit key — cheaper than {remote_host} # which would trigger a reverse DNS lookup on every request. # # NOTE: /git/* is explicitly excluded here to ensure git traffic is # governed solely by the git-specific rate limit zones below. # This makes the zones mutually exclusive and order-independent. # ------------------------------ @anonymous { not path /git/* not header_regexp Cookie JSESSIONID not header Authorization * } rate_limit @anonymous { zone anonymous_zone { key {remote_ip} events 30 window 1m } } # ------------------------------ # Rate limiting — anonymous git traffic (Layer 2c) # Git operations (clone, fetch, diff) are significantly more expensive # than REST requests — each triggers disk I/O and git object graph # traversal in Sztabina. Anonymous access to public repos is permitted # but tightly throttled. # # 10 r/min accounts for git clone burst behavior — a single clone # generates multiple HTTP requests (info/refs, pack negotiation, object # fetch). 5 r/min was too tight; 10 r/min blocks sustained crawling # while allowing legitimate one-time clones. # # Rate limit key is {remote_ip}{path} — scoped per IP per repository. # Git cost is repo-specific: cloning repo A is independent of cloning # repo B. A CI pipeline cloning multiple repos is not penalized the # same as a bot hammering a single repo repeatedly. # # IP-only key for anonymous traffic — no token available. # NAT/corporate network tradeoff accepted: anonymous git from a shared # IP is already a suspicious pattern. # ------------------------------ @git_anonymous { path /git/* not header Authorization * } rate_limit @git_anonymous { zone git_anonymous_zone { key {remote_ip}{path} events 10 window 1m } } # ------------------------------ # Rate limiting — authenticated git traffic (Layer 2d) # PAT authentication proves identity but does not limit volume. # A compromised or bot-controlled PAT can saturate Sztabina with # repeated clone/fetch operations. 30 r/min per IP per repo is # generous enough for CI pipelines and active developer workflows # while blocking bots. # # Rate limit key is {remote_ip}{path} — scoped per IP per repository. # A developer or CI pipeline working across multiple repos is not # penalized; a bot hammering a single repo is throttled. # # NOTE: Authorization header value (Base64-encoded Basic credentials) # is intentionally NOT used as part of the key. The same credentials # can produce different encodings across clients (whitespace, padding # variations), making the raw header an unreliable key. IP + path # is simpler, stable, and matches the actual cost model. # # Git HTTP protocol uses Authorization headers (Basic/PAT). # Session cookies (JSESSIONID) are not used by git clients and # are intentionally ignored here to avoid incorrect classification. # ------------------------------ @git_authenticated { path /git/* header Authorization * } rate_limit @git_authenticated { zone git_authenticated_zone { key {remote_ip}{path} events 30 window 1m } }Validation
Rate limiting can be verified manually by issuing repeated requests to the git info/refs endpoint and observing HTTP 429 responses once the configured thresholds are exceeded.
In addition, the existing k6 stress test will be extended to include git endpoints. Validation criteria:
- Anonymous git traffic is throttled at ~10 r/min per IP per path
- Authenticated git traffic is throttled at ~30 r/min per IP per path
- Legitimate single clone/fetch operations complete without 429
- Sztabina CPU usage decreases under bot load compared to baseline
-
Test git Rate Limiting: Unauthenticated Git
rksuma@Ramakrishnans-MacBook-Pro sztab % for i in $(seq 1 15); do STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ "http://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/git/TESTSZTAB.git/info/refs?service=git-upload-pack") echo "Request $i: $STATUS" done Request 1: 200 Request 2: 200 Request 3: 200 Request 4: 200 Request 5: 200 Request 6: 200 Request 7: 200 Request 8: 200 Request 9: 200 Request 10: 200 Request 11: 429 Request 12: 429 Request 13: 429 Request 14: 429 Request 15: 429 rksuma@Ramakrishnans-MacBook-Pro sztab %Test git Rate Limiting: Authenticated Git
rksuma@Ramakrishnans-MacBook-Pro sztab % for i in $(seq 1 35); do STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ -H "Authorization: Basic $(echo -n 'admin:szt_6G9__eAmsumQr2C79F9c5ScAl5NkgwIySshIPE7v' | base64)" \ "http://ec2-35-87-145-56.us-west-2.compute.amazonaws.com/git/TESTSZTAB.git/info/refs?service=git-upload-pack") echo "Request $i: $STATUS" done Request 1: 200 Request 2: 200 Request 3: 200 Request 4: 200 Request 5: 200 Request 6: 200 Request 7: 200 Request 8: 200 Request 9: 200 Request 10: 200 Request 11: 200 Request 12: 200 Request 13: 200 Request 14: 200 Request 15: 200 Request 16: 200 Request 17: 200 Request 18: 200 Request 19: 200 Request 20: 200 Request 21: 200 Request 22: 200 Request 23: 200 Request 24: 200 Request 25: 200 Request 26: 200 Request 27: 200 Request 28: 200 Request 29: 200 Request 30: 200 Request 31: 429 Request 32: 429 Request 33: 429 Request 34: 429 Request 35: 429 rksuma@Ramakrishnans-MacBook-Pro sztab %Summary:
- 30 consecutive requests → HTTP 200 - 31st request onward → HTTP 429Confirms:
- authenticated git limit enforced at configured threshold - Authorization-based classification working correctly - no overlap with anonymous rate limit zone -
-
In Progress
| Type |
New Feature
|
| Priority |
Normal
|
| Assignee | |
| Version |
none
|
| Sprints |
n/a
|
| Customer |
n/a
|
The main problem we faced was our servers overload by AI bots and crawlers. The most sensible solution seems to hide resource heavy from anonymous or guests access. I suggested to make these operations accessible based on user permissions. This would give us the most flexibility.