Machine Learning Engineer — Inference Optimization

Featherless AI · Remote (world)

Remote mid machine learning

About the Role We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale . You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users. This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains. What You’ll Do Optimize inference latency, throughput, and cost for large-scale ML models in production Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO) Implement and tune techniques such as: Quantization (fp16, bf16, int8, fp8) KV-cache optimization & reuse Speculative decoding, batching, and streaming Model pruning or architectural simplifications for inference Collaborate with research engineers to productionize new model architectures Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks) Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups Improve system reliability, observability, and cost efficiency under real workloads What We’re Looking For Strong experience in ML inference optimization or high-performance ML systems Solid understanding of deep learning internals (attention, memory layout, compute graphs) Hands-on experience with PyTorch (or similar) and model deployment Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations) Experience scaling inference for real users (not just research benchmarks) Comfortable working in fast-moving startup environments with ownership and ambiguity Nice to Have Experience with LLM or long-context model inference Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton) Experience optimizing across different hardware vendors Open-source contributions in ML systems or inference tooling Background in distributed systems or low-latency services Why Join Us Real ownership over performance-critical systems Direct impact on product reliability and unit economics Close collaboration with research, infra, and product Competitive compensation + meaningful equity at Series A A team that cares about engineering quality, not hype

Posted 2026-01-22