Lead Site Reliability Engineer

hace 7 días


WorkFromHome, Colombia Masabi A tiempo completo

Lead Site Reliability Engineer Introducing Masabi // At Masabi, we’re driving the fare payment revolution, powering the journeys of millions all over the world. We build fare collection platforms that allow riders to seamlessly buy and present tickets for public transport either on their mobile phones, from a ticket machine, or even by tapping their bank card to travel. Role Lead Site Reliability Engineer We’re looking for a Lead Site Reliability Engineer to join our platform team, someone who’s confident working hands‑on with infrastructure, but also ready to shape how we scale and operate as a global team. You’ll take ownership of key systems, lead cross‑functional work, and help evolve the way we build for performance, reliability, and security. This role is ideal for those who enjoy solving complex problems, improving systems through automation, and supporting others as they grow. It’s a chance to have both technical depth and meaningful influence, while staying close to the work that matters. Location This role is available in a remote model to candidates based in Colombia. What You’ll Be Doing Build and automate reliable systems Lead design discussions and make key architectural decisions for reliability, scalability, and performance. Establish SRE standards and best practices (IaC patterns, CI/CD maturity, observability, etc.) across teams. Design and manage infrastructure using Terraform and CloudFormation. Build and evolve CI/CD pipelines that support fast, safe, and frequent deployments. Automate manual tasks to reduce operational load and enable faster delivery. Help expand our infrastructure globally, scaling up new environments with care. Improve visibility, scale and performance Define and maintain SLIs, SLOs, and alerting strategies aligned with user experience. Implement monitoring solutions that give us clear, early signals during incidents. Lead capacity planning and performance tuning as our systems and teams grow. Identify opportunities to improve architecture for resilience and cost‑effectiveness. Own reliability and incident response Lead or contribute to incident response, root cause analysis, and post‑incident reviews. Design and maintain disaster recovery and failover strategies. Partner with compliance and security teams to meet frameworks like SOC 2 and PCI. Support others and share your knowledge Collaborate with engineers, architects, and product teams to embed SRE practices from the start and define long‑term platform reliability strategy. Mentor others in areas like observability, incident readiness, and infrastructure‑as‑code. Document systems and processes clearly to support learning and long‑term success. Partake of the on‑call rotation, shared with the team and paid on top of salary. About You // You’re an experienced SRE who combines technical depth with curiosity, care, and a desire to make things better for the platform, the team, and the people using our systems. You’ve worked in SRE, platform, or DevOps roles where reliability was business‑critical (24/7). You have proven experience designing and evolving production‑grade systems for scale and resilience. You’re comfortable designing and operating in AWS, with strong knowledge of cloud architecture, networking and security (VPC design, IAM, least privilege). You have hands‑on experience with Terraform, infrastructure automation, and CI/CD systems. You’ve led or contributed to high‑impact projects involving observability, performance, incident command and/or reliability (distributed tracing, log correlation, metrics maturity, etc). You communicate clearly and drive cross‑functional reliability improvements in distributed, async‑first teams. You enjoy helping others grow and value a kind, collaborative engineering culture. You take pride in doing things the right way, but you’re pragmatic and focused on impact. Nice To Have Familiarity with PCI DSS v4 or similar compliance standards. Experience with container orchestration. AWS certifications. Tools & Platforms Monitoring & Observability: Grafana, Prometheus, CloudWatch, Pingdom, Kibana. Infrastructure as Code: Terraform, CloudFormation. Configuration Management & Logging: Puppet, Confluent Cloud. Why Join Masabi? Driven by Purpose – We believe in journeys made simple. The work isn’t always easy, but the best things never are. Encouraged to Accelerate – Masabi is going places and our people are in the driving seat. Whether you’re taking the direct route or exploring new paths, we support your journey. Advancing with Empathy – We put people first and foster a culture of learning, not blame. No matter your cargo, we share the load. We’re already powering journeys – are you ready to join us? #J-18808-Ljbffr



  • WorkFromHome, Colombia Masabi A tiempo completo

    A leading fintech company is seeking a Lead Site Reliability Engineer to enhance system reliability. This remote role in Colombia involves designing reliable systems, contributing to incident response, and mentoring teams. Candidates should have substantial SRE or DevOps experience, particularly in AWS and infrastructure automation. A supportive and...


  • WorkFromHome, Colombia EPAM Systems A tiempo completo

    3 days ago Be among the first 25 applicants EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of...


  • WorkFromHome, Colombia AgileEngine A tiempo completo

    Join to apply for the Site Reliability Engineer ID45689 role at AgileEngine AgileEngine is an Inc. 5000 company that creates award‑winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people‑first culture has earned us multiple Best...


  • WorkFromHome, Colombia Masabi A tiempo completo

    Introducing Masabi // At Masabi, we’re driving the fare payment revolution, powering the journeys of millions all over the world. We build fare collection platforms that allow riders to seamlessly buy and present tickets for public transport either on their mobile phones, from a ticket machine, or even by tapping their bank card to travel. Our Justride...


  • WorkFromHome, Colombia Truelogic A tiempo completo

    A leading technology firm in Colombia seeks a Site Reliability Engineer to enhance the reliability of systems on AWS and Kubernetes. The role emphasizes observability and automated responses to system behavior. Candidates should have over five years of experience in SRE roles and expertise in AWS and Kubernetes. This position offers fully remote work,...


  • WorkFromHome, Colombia NiCE A tiempo completo

    A global technology company is seeking a Senior Site Reliability Engineer in Medellín to enhance the reliability and scalability of its platform. This hybrid role offers ownership of critical systems and opportunities for professional growth with comprehensive company benefits. The ideal candidate has extensive experience in Linux and cloud infrastructure,...


  • WorkFromHome, Colombia AgileEngine A tiempo completo

    Site Reliability Engineer (ID45689) – AgileEngine Why Join Us AgileEngine is an Inc. 5000 company that creates award‑winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in application development and AI/ML and have earned multiple Best Place to Work awards. If you're looking for a place to...


  • WorkFromHome, Colombia AgileEngine A tiempo completo

    Join to apply for the Site Reliability Engineer ID45689 role at AgileEngine AgileEngine is an Inc. 5000 company that creates award-winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people-first culture has earned us multiple Best Place to...


  • WorkFromHome, Colombia Truelogic Software A tiempo completo

    Site Reliability Engineer (AWS) - Technology Join to apply for the Site Reliability Engineer (AWS) - Technology role at Truelogic Software About Truelogic At Truelogic we are a leading provider of nearshore staff augmentation services headquartered in New York. For over two decades, we’ve been delivering top-tier technology solutions to companies of all...

  • Site Reliability Engineer

    hace 2 semanas


    WorkFromHome, Colombia BairesDev A tiempo completo

    Overview Site Reliability Engineer at BairesDev – Remote work We are looking for a Site Reliability Engineer to administer and provide support for the project infrastructure hosted in the cloud while implementing CI/CD pipelines for the automation of deployments. What You Will Do Ensure high service availability, performance, security, and maintainability....