The brief
The platform that hosted Dkubex had grown organically: every service had its own ingress story, its own image base, its own observability hookup. Onboarding a new microservice took days. We needed a unified architecture.
What I did
- Helmfile as the source of truth. Every service (and every environment) expressed declaratively in a single Helmfile tree. New service onboarding collapsed from days to a single PR.
- Kubernetes Gateway API + Traefik. Migrated off the old per-app Ingress and onto Gateway API with Traefik as the controller. Per-route auth and rate-limit policies became routine instead of bespoke.
- Container size reduction. Audited the base images — turned out we were shipping the entire CUDA toolkit in inference images that only needed the runtime libraries. Multi-stage builds + a leaner base brought a 10 GB image down to 2 GB. Cold start times followed.
- ClickStack observability. Stood up ClickHouse + HyperDX + ClickStack as a unified telemetry surface; every service emits to the same store. MTTR on inference-path incidents dropped meaningfully.
Outcome
- 80% container image size reduction (10 GB → 2 GB).
- 2×+ serving throughput (30 → 50–80 req / GPU) after batching + scheduler tuning on the leaner runtime.
- One observability surface across the entire platform.
What I learned
"Platform work" pays back in velocity, not features. The metric that matters is how quickly the next engineer can ship.