Realtime and Batch GPU Workloads: How SS&C Built an AI‑as‑a‑Service Platform on Private Cloud

Joseph Stein walks through the architecture, governance, and operational tactics that let SS&C run both low‑latency inference and high‑throughput batch jobs on a shared pool of NVIDIA H100/H200 GPUs. The article covers multi‑namespace scheduling, Valkey‑Lua priority queues, vLLM tuning, custom S3‑to‑Kafka proxies, and lessons for cost‑effective oversubscription in enterprise private clouds.

Technical announcement

SS&C Technologies has opened its private‑cloud AI platform to the broader engineering community via an InfoQ presentation titled “Realtime and Batch Processing of GPU Workloads.” The platform delivers generative‑AI inference, embeddings, and Retrieval‑Augmented Generation (RAG) as a service while keeping all data on‑premises. The key claim: a single 8‑chip H100 chassis in Kansas City can simultaneously serve 250 + production tenants, 1 000+ use‑cases, and a growing batch pipeline without exceeding budgeted GPU spend.

Specifications

Hardware footprint

Component	Quantity	Specs
NVIDIA GPUs	80 (mix of H100 & H200)	4 000 tokens / s per H100 chip (peak), 8 GB VRAM per H100, 16 GB VRAM per H200
Compute nodes	12 (dual‑socket AMD EPYC)	256 GB RAM, NVMe local storage
Network	100 GbE leaf‑spine	Low‑latency intra‑region traffic
Storage	Multi‑region object store (S3‑compatible)	30 PB total capacity

All nodes are provisioned through a Terraform provider that mirrors public‑cloud APIs, allowing engineers to request Kubernetes clusters, Kafka topics, or dedicated GPU pods via a self‑service portal.

Software stack

Kubernetes 1.30 with custom node selectors for H100 vs H200 resources.
vLLM (open‑source LLM serving engine) as the primary inference runtime.
Valkey (Redis‑compatible) + Lua scripts for atomic rate‑limiting, priority queuing, and back‑pressure handling.
OpenAI‑compatible API gateway built on Envoy, enriched with per‑tenant guardrails (prompt‑injection detection, toxicity filters, FINRA compliance checks).
Kafka 3.4 for event streaming; all file uploads are first registered in a Valkey cache, then a custom S3‑proxy writes the object to an internal bucket and pushes a registration event to Kafka.
OPA (Open Policy Agent) with Rego policies for fine‑grained access control on the S3‑proxy.
ServiceNow integration for automated ticketing when a tenant exceeds its quota or triggers a security rule.

Core architectural patterns

Multi‑namespace GPU pool – each tenant’s workload runs in a distinct Kubernetes namespace. A single GPU‑pool proxy aggregates requests across namespaces, inspects the request’s tenant label, and applies two‑dimensional priority:
- Environment priority (Prod > Prod‑DR > Demo > UAT > Dev)
- Namespace priority (high‑value services vs background batch jobs)
Atomic Lua‑based rate limiting – every incoming request hits a Valkey script that checks:
- Global token‑per‑second cap
- Per‑model token window (e.g., 8 B model ≤ 7 s, 70 B model ≤ 90 s)
- Tenant‑specific quota
- Current vLLM back‑pressure metric (queue_length from the /metrics endpoint) If any check fails, the request is throttled or placed in a low‑priority queue.
Micro‑batching for embeddings – a thin gRPC service aggregates up to 650 pages of text into 32‑KB batches before feeding them to the GPU, reducing kernel launch overhead by ~30 %.
Batch‑window scheduling – file‑processing jobs (document OCR, audio transcription) are accepted with an SLA‑profile (e.g., complete by 20:00 UTC). The scheduler computes the expected token consumption, matches it against off‑peak GPU capacity, and either accepts the job immediately or defers it to a low‑utilisation window.
Disaster‑recovery namespace – a hot‑standby namespace mirrors the production GPU pool. In a fail‑over, traffic is re‑routed to the standby namespace without touching the underlying hardware selectors.

Benchmarks & performance numbers

Workload	Model	Avg latency (95th pct)	Throughput	GPU utilisation
Real‑time chat (Llama‑3.1‑8B)	8 B	120 ms	4 k req/s	68 %
RAG inference (Llama‑3.1‑70B)	70 B	1.2 s	850 req/s	74 %
Embedding micro‑batch (text‑2‑vec)	8 B	45 ms	9 k req/s	62 %
Audio transcription (Whisper‑large‑v2)	1.5 B	3.4 s	210 req/s	55 %

All numbers were collected on a mixed H100/H200 fleet running Ubuntu 22.04 with the NVIDIA driver 560.68. The vLLM engine was compiled with --max-num-batches=64 and --max-num-seqs=256 to maximise parallelism.

Real‑world implications

Cost efficiency through oversubscription

SS&C’s approach shows that a single 8‑chip chassis can safely support 80 GPU‑equivalent workloads when you:

Slice traffic by environment and tenant.
Apply strict token‑based quotas.
Use back‑pressure signals from vLLM to trigger low‑priority queuing. The result is a ~2.3× improvement in GPU utilisation compared to naïve per‑tenant provisioning, translating to an estimated $1.8 M annual OPEX saving on a $7 M hardware investment.

Governance at scale

By front‑ending every request with a central gateway that audits payloads, enforces prompt‑injection detection, and logs all decisions to ServiceNow, SS&C meets FINRA, ISO‑27001, and the emerging OWASP LLM Top 10 requirements without sacrificing latency. The policy engine is version‑controlled in Git, enabling audit trails for compliance reviewers.

Batch workload optimisation

The S3‑to‑Kafka proxy decouples file ingestion from GPU processing, allowing the system to smooth spikes caused by large document uploads. Because the proxy stores file metadata in Valkey, the scheduler can predict token consumption and automatically reject jobs that would breach the SLA, avoiding costly out‑of‑band retries.

Future work

KV‑Cache exploitation – the team plans to integrate NVIDIA’s KV‑Cache extensions into vLLM to cut token‑generation latency by up to 15 % for long‑context models.
Model quantisation – exploring 4‑bit INT4 kernels to increase the number of concurrent models per GPU without sacrificing regulatory‑grade precision.
Dynamic GPU partitioning – while current H100 chips lack native MIG for the workloads Stein described, upcoming H200‑based MIG support could enable true elastic partitioning, reducing the need for static node selectors.

Takeaways for engineers building private‑cloud AI services

Treat the GPU as a shared, rate‑limited resource – use a fast in‑memory store (Valkey) with Lua scripts to enforce per‑tenant quotas and global back‑pressure.
Separate control‑plane (policy, registration) from data‑plane (GPU inference) – this reduces coupling and makes it easy to swap out the inference engine (vLLM → Triton, SGLang, etc.) without rewriting governance.
Leverage Kubernetes namespace isolation for disaster‑recovery and environment segregation; a single GPU‑pool proxy can then make global scheduling decisions.
Schedule batch jobs during off‑peak windows – the same GPU fleet that serves sub‑second chat can also run nightly document‑intelligence pipelines, dramatically improving ROI.
Open‑source first – Stein’s preference for community‑maintained projects (vLLM, OPA) avoids vendor lock‑in and ensures that security patches can be applied quickly.

By combining these patterns, SS&C demonstrates that enterprise‑grade AI services can be delivered on‑premises at a fraction of the cost of public‑cloud alternatives, while still providing the compliance guarantees required in finance and healthcare.