Cedana’s Forward‑Deployed Engineer Role Tackles AI‑HPC Reliability at Scale

Cedana, a YC‑S23 startup, is hiring a Forward‑Deployed Engineer to bring its kernel‑level GPU checkpointing system to research labs and enterprise clusters. The role blends deep Linux expertise with customer‑facing integration work, aiming to cut downtime and improve utilization for costly AI and HPC workloads.

![Featured image]()

The problem Cedana is trying to fix

High‑performance computing (HPC) and large‑scale AI training run on expensive GPU clusters that are often under‑utilized. A single node failure can stall weeks of training, costing both time and money. Operators juggle SLURM, Kubernetes, and proprietary orchestration tools, each with its own quirks, making it hard to keep workloads running smoothly. The industry therefore needs a way to move GPU jobs between machines without losing progress, while keeping the underlying software stack untouched.

Cedana’s approach

Cedana builds an automated checkpoint‑and‑migration layer that lives at the kernel/OS level. By using technologies such as CRIU (Checkpoint/Restore In Userspace) and custom NVIDIA driver hooks, the system can pause a running GPU job, serialize its state, and resume it on a different instance in seconds. Because the solution works beneath the container runtime, customers do not have to modify their code or change job scripts. It integrates with the most common schedulers—SLURM, Kubernetes, and NVIDIA Dynamo—through plugins and operators, letting clusters achieve higher utilization and lower failure impact without a major re‑architecture.

Why the role matters

Cedana is still early‑stage (YC batch S23, founded 2023) but already has deployments in research universities, cloud inference providers, and a Fortune 100 pharma lab. The Forward‑Deployed Engineer will be the bridge between the product team and those customers. Responsibilities include:

Installing and configuring Cedana’s stack on diverse environments (bare‑metal SLURM clusters, Kubernetes nodes, hybrid setups).
Writing SLURM plugins and Kubernetes operators that expose the migration functionality to existing job workflows.
Measuring reliability gains and throughput improvements, feeding those metrics back into product development.
Building a repeatable install playbook so that the second customer in any segment can be onboarded faster than the first.

The position demands a rare mix of deep Linux system knowledge—systemd, cgroups v2, namespaces, kernel modules—and hands‑on experience with HPC schedulers. Candidates who have led multi‑month deployments, debugged cgroup or driver issues at odd hours, and contributed to open‑source scheduler code will fit the bill.

Market context and funding

GPU clusters are a bottleneck for both AI research and enterprise inference. According to recent industry reports, the total spend on AI‑focused HPC infrastructure is approaching $30 billion annually, with a sizable portion allocated to redundancy and fault‑tolerance. Cedana’s solution targets the “middle” of that spend: customers who cannot afford full duplication of hardware but need to protect against costly downtime.

Cedana raised its seed round as part of Y Combinator’s S23 batch. While the exact amount is not disclosed, YC‑backed startups typically secure $1‑2 million in seed funding, giving them enough runway to expand the engineering team and deepen integrations with major cloud providers. Early traction in “leading inference platforms, neoclouds, enterprise, and research clusters” suggests that the product is already addressing a real pain point.

What you get as a candidate

Base salary between $140 k and $180 k, plus equity that could become meaningful if Cedana’s adoption curve continues upward.
Full remote work (US‑based) with about a quarter of the time spent traveling to customer sites.
Health, dental, vision coverage for employees and families, unlimited PTO, and a 401(k) plan.
Direct interaction with the founders—Neel Master and Niranjan Ravichandra—who bring a decade of systems engineering experience from places like Shopify and prior AI‑healthcare ventures.

How to evaluate the opportunity

If you enjoy troubleshooting low‑level Linux issues as much as you like translating those fixes into a polished customer experience, this role offers a chance to shape a product that could become a standard layer for AI‑HPC reliability. The position also provides exposure to a range of environments—from university supercomputers to Fortune 100 pharma clusters—giving you a breadth of experience that is hard to find in a pure SaaS role.

For more details on the company and the application process, see the YC batch page and the official job posting.

#HPC #GPU #Linux #job scheduling #AI training