Develop, improve, and maintain the MLOps platform to support scalable, reproducible, and observable machine learning and generative AI workflows.
Design and operate core ML infrastructure including feature stores, model registries, CI/CD pipelines, and data pipelines using AWS services such as SageMaker, ECS/EKS, Lambda, and Step Functions.
Enable and support AI and ML development teams, providing best practices, tooling, and technical guidance on leveraging the platform for training, fine-tuning, and deployment.
Drive architecture and technology decisions across the ML stack, including frameworks, orchestration, data processing, and observability tools.
Collaborate with AI engineering teams to integrate LLMs and generative AI capabilities into products using secure, standardized, and auditable infrastructure.
Continuously evaluate and adopt emerging tools and technologies such as LangChain, Ray, MLflow, Kubeflow, and Hugging Face to improve platform capability and developer productivity.
Qualification & Experience
Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field.
5+ years of experience in ML/AI platform engineering, infrastructure engineering, or MLOps, preferably in enterprise or SaaS environments.
Strong hands-on experience with AWS cloud services, including SageMaker, ECS/EKS, S3, Lambda, Step Functions, and Terraform/CloudFormation.
Proficiency in Python, Docker, and Kubernetes for building and operating scalable ML infrastructure.
Experience with MLOps tools/frameworks such as MLflow, Kubeflow, Vertex AI, Azure ML, or equivalent, along with familiarity in LLM infrastructure, vector databases, and RAG-based systems.