Platform · Full-time · Remote (Nepal)

Staff Runtime Engineer

About the role

The agent task execution engine is the core of everything we ship. It decides how work moves through a system — what runs in parallel, what retries, what escalates, and what gets surfaced to a human. As the Staff Runtime Engineer, you own that engine: its architecture, its reliability posture, and its performance envelope across every client deployment.

This is not a maintenance role. We are actively rearchitecting the runtime to support more complex agentic workloads — multi-step reasoning chains, approval gates, real-time observability hooks, and cross-system handoffs. You will drive those architectural decisions, mentor engineers across the platform team, and be the technical voice when customers ask hard questions about how their systems behave under load.

We build with a Linear and Vercel aesthetic in mind: decisions should be defensible, systems should be boring in the best way, and the runtime should be the part of the stack that never makes it into a postmortem. If you care about making the infrastructure layer invisible so product teams can move fast, this is the role.

What you'll do

Own the design and evolution of the agent task execution engine, including scheduling, retry logic, cancellation, and concurrency controls.
Define and enforce SLOs for task throughput, latency, and error rates across production deployments — and own the on-call rotation for runtime incidents.
Build and maintain the observability layer: structured telemetry, distributed tracing, and the dashboards that let both our engineers and customers see what their agents are doing in real time.
Lead architectural reviews for new runtime capabilities, ensuring changes are backward-compatible, well-instrumented, and tested against adversarial workloads before they reach production.
Partner with the security and AppSec teams to harden the runtime against prompt injection, runaway tasks, and resource abuse vectors that are unique to agentic systems.
Mentor platform engineers, review critical code paths, and raise the team's standard for system design — particularly around distributed state and failure modes.

What we're looking for

8+ years of backend engineering experience with at least 3 years in a senior or staff-level role owning distributed systems in production.
Deep familiarity with task queue patterns, orchestration engines (Temporal, Celery, BullMQ, or similar), and the failure modes that come with them at scale.
Strong understanding of observability primitives: structured logging, metrics, distributed tracing, and how to build dashboards that surface signal over noise.
Experience defining SLOs, running incident retrospectives, and making architectural changes based on production data rather than intuition.
Comfortable writing TypeScript or Go (our primary runtime languages) and reviewing code across language boundaries.

Nice to have

Prior experience building infrastructure for AI workloads — especially LLM inference coordination, agent orchestration, or multi-step pipeline execution.
Familiarity with Vercel Edge, Cloudflare Workers, or other edge compute environments where cold starts and memory constraints are real constraints.
You have shipped a runtime or platform component that is now used by customers you have never met and still works correctly.

Interview process

1
Intro call (30 min)
A conversation with our engineering lead to understand your background and what you're optimizing for in your next role. No coding, no tricks.
2
System design interview (60 min)
We'll give you a real problem from our runtime and work through a design together. We care about your reasoning, not whether you arrive at our exact answer.
3
Technical deep-dive (60 min)
Walk us through a system you're proud of. We'll ask hard questions about the tradeoffs you made and what you'd do differently today.
4
Offer and references
If we're aligned, we'll move quickly. References are a conversation, not a checkbox.

Apply for this role

Fill out the form below. We read every application and respond within 5 business days.

Prefer to start with a conversation?

Your information is used solely for evaluating your application and is stored securely.