Jobs

Site Reliability Engineer (SRE) - LLM and Machine Learning


Job details
  • Techruiter
  • London
  • 5 days ago

We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.

As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. You will work closely with cross-functional teams to design, implement, and optimize infrastructure, monitor system health, and respond to incidents, enabling our researchers and engineers to focus on innovation.

Responsibilities

  • Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
  • Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
  • Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
  • Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence.
  • Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
  • Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
  • Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
  • Documentation: Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.

Requirements

  • Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
  • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
  • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines.
  • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).Scripting and automation skills (e.g., Python, Bash).Excellent problem-solving and troubleshooting skills.
  • Strong communication and collaboration skills.

#J-18808-Ljbffr

Sign up for our newsletter

The latest news, articles, and resources, sent to your inbox weekly.

Similar Jobs

AIML - Site Reliability Engineer (SRE), Siri Knowledge Platforms

AIML - Site Reliability Engineer (SRE), Siri Knowledge PlatformsPlay a meaningful role in revolutionising how people use their computers and mobile devices, build ground breaking technology for algorithmic search, machine learning, natural language processing & artificial intelligence and work with the teams building the most scalable big-data systems in existence.DescriptionAs...

Apple Inc. London

Junior / Mid level Site Reliability Engineer - degree

Site Reliability Engineer - support, SRE, degree, AWS,Location: Hybrid / CambridgeSalary: Competitive + BenefitsGraduates welcome to apply!About the Company: Our client is one of the UK's most innovative software houses, renowned for their cutting-edge advancements in artificial intelligence. With a series of prestigious awards, they continue to redefine the boundaries...

Cambridge

Senior Software Engineer – Platform

Senior Software Engineer – Platform opportunity within a world-famous technology and engineering organisation in Woking, Surrey.A global leader in transportation connectivity technology are looking to add a Senior Software Engineer – Platform to their growing software team in Woking, Surrey. You will be working in a team developing the platform...

Woking

Software Engineer - Platform

Senior Software Engineer – Platform opportunity within a world-famous technology and engineering organisation in Woking, Surrey.A global leader in transportation connectivity technology are looking to add a Senior Software Engineer – Platform to their growing software team in Woking, Surrey. You will be working in a team developing the platform...

Woking

Manufacturing Engineer Machining

Job title: Manufacturing Engineer MachiningReference: E(phone number removed)Location: Halewood, MerseysideDuration: PermanentStart date: ASAPSalary: £61,116.53 paGPW Recruitment are partnering with Ford Halewood Transmissions Ltd (FHTL) in Halewood to recruit a Manufacturing Engineer (Machining) to work in their existing site during a huge transformation. Ford is investing £230m into the Halewood site...

Halewood

Manufacturing Engineer

Shape the Future of Automotive Technology! Manufacturing Engineer - Machining - Ford HalewoodJoin Ford Halewood Transmissions, a global leader in electric vehicle technology, as a Manufacturing Engineer and play a crucial role in shaping the future of automotive manufacturing.You'll be at the forefront of innovation, working with cutting-edge technologies and...

Halewood