Lead Site Reliability Engineer

hace 7 días

WorkFromHome, Colombia Masabi A tiempo completo

Lead Site Reliability Engineer Introducing Masabi // At Masabi, we’re driving the fare payment revolution, powering the journeys of millions all over the world. We build fare collection platforms that allow riders to seamlessly buy and present tickets for public transport either on their mobile phones, from a ticket machine, or even by tapping their bank card to travel. Role Lead Site Reliability Engineer We’re looking for a Lead Site Reliability Engineer to join our platform team, someone who’s confident working hands‑on with infrastructure, but also ready to shape how we scale and operate as a global team. You’ll take ownership of key systems, lead cross‑functional work, and help evolve the way we build for performance, reliability, and security. This role is ideal for those who enjoy solving complex problems, improving systems through automation, and supporting others as they grow. It’s a chance to have both technical depth and meaningful influence, while staying close to the work that matters. Location This role is available in a remote model to candidates based in Colombia. What You’ll Be Doing Build and automate reliable systems Lead design discussions and make key architectural decisions for reliability, scalability, and performance. Establish SRE standards and best practices (IaC patterns, CI/CD maturity, observability, etc.) across teams. Design and manage infrastructure using Terraform and CloudFormation. Build and evolve CI/CD pipelines that support fast, safe, and frequent deployments. Automate manual tasks to reduce operational load and enable faster delivery. Help expand our infrastructure globally, scaling up new environments with care. Improve visibility, scale and performance Define and maintain SLIs, SLOs, and alerting strategies aligned with user experience. Implement monitoring solutions that give us clear, early signals during incidents. Lead capacity planning and performance tuning as our systems and teams grow. Identify opportunities to improve architecture for resilience and cost‑effectiveness. Own reliability and incident response Lead or contribute to incident response, root cause analysis, and post‑incident reviews. Design and maintain disaster recovery and failover strategies. Partner with compliance and security teams to meet frameworks like SOC 2 and PCI. Support others and share your knowledge Collaborate with engineers, architects, and product teams to embed SRE practices from the start and define long‑term platform reliability strategy. Mentor others in areas like observability, incident readiness, and infrastructure‑as‑code. Document systems and processes clearly to support learning and long‑term success. Partake of the on‑call rotation, shared with the team and paid on top of salary. About You // You’re an experienced SRE who combines technical depth with curiosity, care, and a desire to make things better for the platform, the team, and the people using our systems. You’ve worked in SRE, platform, or DevOps roles where reliability was business‑critical (24/7). You have proven experience designing and evolving production‑grade systems for scale and resilience. You’re comfortable designing and operating in AWS, with strong knowledge of cloud architecture, networking and security (VPC design, IAM, least privilege). You have hands‑on experience with Terraform, infrastructure automation, and CI/CD systems. You’ve led or contributed to high‑impact projects involving observability, performance, incident command and/or reliability (distributed tracing, log correlation, metrics maturity, etc). You communicate clearly and drive cross‑functional reliability improvements in distributed, async‑first teams. You enjoy helping others grow and value a kind, collaborative engineering culture. You take pride in doing things the right way, but you’re pragmatic and focused on impact. Nice To Have Familiarity with PCI DSS v4 or similar compliance standards. Experience with container orchestration. AWS certifications. Tools & Platforms Monitoring & Observability: Grafana, Prometheus, CloudWatch, Pingdom, Kibana. Infrastructure as Code: Terraform, CloudFormation. Configuration Management & Logging: Puppet, Confluent Cloud. Why Join Masabi? Driven by Purpose – We believe in journeys made simple. The work isn’t always easy, but the best things never are. Encouraged to Accelerate – Masabi is going places and our people are in the driving seat. Whether you’re taking the direct route or exploring new paths, we support your journey. Advancing with Empathy – We put people first and foster a culture of learning, not blame. No matter your cargo, we share the load. We’re already powering journeys – are you ready to join us? #J-18808-Ljbffr

Remote Lead Site Reliability Engineer — Scale

hace 7 días

WorkFromHome, Colombia Masabi A tiempo completo

A leading fintech company is seeking a Lead Site Reliability Engineer to enhance system reliability. This remote role in Colombia involves designing reliable systems, contributing to incident response, and mentoring teams. Candidates should have substantial SRE or DevOps experience, particularly in AWS and infrastructure automation. A supportive and...
Senior Site Reliability Engineer — Remote

hace 2 semanas

WorkFromHome, Colombia Truelogic Software LLC A tiempo completo

A leading software development firm based in Colombia is looking for a Site Reliability Engineer to enhance the reliability of their AWS and Kubernetes systems. The engineer will focus on observability, operational improvements, and collaborate with various engineering teams. This position offers 100% remote work and a highly competitive USD salary, along...
Senior Site Reliability Engineer — Cloud

hace 1 semana

WorkFromHome, Colombia AgileEngine A tiempo completo

A leading software development company in Colombia is seeking a Site Reliability Engineer to shape secure and scalable cloud-native systems. You will design resilient AWS infrastructure, lead CI/CD pipeline development, and mentor teams in DevSecOps practices. This role emphasizes innovation and collaboration with a focus on automation and observability....
Lead Site Reliability Engineer

hace 2 días

WorkFromHome, Colombia EPAM Systems A tiempo completo

3 days ago Be among the first 25 applicants EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of...
Site Reliability Engineer ID45689

hace 1 semana

WorkFromHome, Colombia AgileEngine A tiempo completo

Join to apply for the Site Reliability Engineer ID45689 role at AgileEngine AgileEngine is an Inc. 5000 company that creates award‑winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people‑first culture has earned us multiple Best...
Senior Site Reliability Engineer — Cloud

hace 2 semanas

WorkFromHome, Colombia AgileEngine A tiempo completo

A leading software development firm in Colombia is seeking an experienced Site Reliability Engineer (SRE) to enhance cloud-native systems' reliability and efficiency. You will work closely with cross-functional teams, focusing on resilient AWS infrastructure and DevSecOps practices. Candidates should possess 8–10 years of experience in infrastructure or...
Site Reliability Engineer

hace 2 semanas

WorkFromHome, Colombia BairesDev A tiempo completo

Overview Site Reliability Engineer at BairesDev – Remote work We are looking for a Site Reliability Engineer to administer and provide support for the project infrastructure hosted in the cloud while implementing CI/CD pipelines for the automation of deployments. What You Will Do Ensure high service availability, performance, security, and maintainability....
Senior Site Reliability Engineer — Remote

hace 2 semanas

WorkFromHome, Colombia Truelogic A tiempo completo

A leading nearshore staff augmentation firm in Bogotá seeks a Site Reliability Engineer to enhance the reliability of distributed systems on AWS and Kubernetes. Responsibilities include designing observability strategies, monitoring system behavior, and automating operational responses. The ideal candidate has over 5 years of experience in SRE/Platform...
Site Reliability Engineer

hace 2 semanas

WorkFromHome, Colombia Canonical A tiempo completo

Site Reliability Engineer Canonical is a leading provider of open‑source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation, and IoT. With customers that include the world's leading public...
Remote Site Reliability Engineer — CI/CD

hace 2 semanas

WorkFromHome, Colombia BairesDev A tiempo completo

A technology solutions company is seeking a Site Reliability Engineer to manage cloud infrastructure and automate deployments. Responsibilities include ensuring service availability, implementing CI/CD pipelines, and troubleshooting issues. Ideal candidates have experience with Kubernetes, Ansible, and CI/CD tools, along with advanced English skills. Enjoy a...

América

Europa

Asia / Oceanía

África

Lead Site Reliability Engineer