Design and build platform primitives—Python SDKs, platform APIs, and templates—that enable reproducible experiments, configuration-as-code workflows, model lineage, and artifact tracking, which enable seamless promotion from research to production.
Create developer tools to elevate development experience—CLIs, UI, dashboards, visualization layers—that simplify platform operation and multi-stage workflows.
Implement and scale distributed training systems (multi-node GPU workloads) on top of Kubernetes and cloud-based orchestration foundation.
Build large-scale evaluation frameworks for offline tests, shadow deployments, and A/B experimentation.
Implement model/dataset versioning, approvals, lineage tracking, retention, and compliance hooks.
Partner with AI/ML research, platform engineering/MLOps and infrastructure, and data engineering teams to generalize workflows into reusable frameworks.
Qualification & Experience
BS in Computer Science, Mathematics, Engineering, or equivalent technical field. Master’s preferred.
Proven track record building large-scale distributed systems and integrated data and AI/ML platforms (e.g., training, serving, workflow orchestration, data pipelines).
Expert-level proficiency in Python and one of Go/Java/C++ and building production-grade services/APIs/SDKs