As a staff engineer on ML Compute team, your work will include: - Drive large-scale training initiatives to support our most complex models. - Operationalize large-scale ML workloads on Kubernetes. - Enhance distributed cloud training techniques for foundation models. - Design and integrate end-to-end lifecycles for distributed ML systems - Develop tools and services to optimize ML systems beyond model selection. - Architect a robust MLOps platform to support seamless ML operations. - Collaborate with cross-functional engineers to solve large-scale ML training challenges. - Research and implement new patterns and technologies to improve system performance, maintainability, and design. - Lead complex technical projects, defining requirements and tracking progress with team members. - Mentor engineers in areas of your expertise, fostering skill growth and knowledge sharing. - Cultivate a team centered on collaboration, technical excellence, and innovation.