Every millisecond and every cent matters at this scale. The work is making large models serve fast and cheap, and knowing which of those two the workload in front of you actually needs.

What you will do

Profile and cut latency across the serving stack, from kernel to cluster.
Bring cost per token down without giving up quality.
Choose the right batching, caching, and quantization for the workload.
Keep the system fast when traffic is not polite.

What we look for

You have shipped inference at scale, not just benchmarked it.
You are comfortable from CUDA up to Kubernetes.
You can tell which optimizations are worth the complexity and which are not.

What stays open

The shape of the engagement is a conversation. Apply and we will figure out what fits.

More about the inference engineer track →

Inference Engineer

What you will do

What we look for

What stays open