Principal Site Reliability Engineer

hace 6 días


WorkFromHome, Colombia w Groupon A tiempo completo

Principal Site Reliability Engineer (AI-first SRE) Remote - Colombia Groupon is a marketplace where customers discover new experiences and services everyday and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e‑commerce giants, we stand out as one of the few platforms uniquely committed to helping local businesses succeed on a performance basis. Groupon is on a radical journey to transform our business with relentless pursuit of results. Even with thousands of employees spread across multiple continents, we still maintain a culture that inspires innovation, rewards risk‑taking and celebrates success. The impact here can be immediate due to our scale and the speed of our transformation. We’re a "best of both worlds" kind of company. We’re big enough to have the resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact. About the Role Groupon is modernizing its global platform — and reliability is at the center of that transformation. We’re looking for a Principal Site Reliability Engineer to lead the evolution from reactive maintenance to predictive, AI‑driven resilience . You’ll design intelligent, self‑healing systems that prevent incidents before they happen, ensuring our customers enjoy fast, secure, and reliable experiences across millions of daily interactions. Key Responsibilities Architect and maintain self‑healing systems with 99.9%+ availability targets . Use AI/ML to automate infrastructure governance and detect configuration or IaC anti‑patterns . Implement adaptive SLI/SLOs that evolve automatically from real‑time data. Build AIOps‑based observability and auto‑remediation pipelines . Apply predictive modeling to forecast failures before they impact users. Lead chaos, performance, and resilience testing programs . Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance. Mentor engineers and drive reliability standards across teams. Partner with platform, data, and product teams to ensure stability aligns with business goals. Support major incident response , incident review , and participate in on‑call rotations . Key Requirements 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability . Strong experience with GCP (preferred) or AWS , Kubernetes , and Terraform . Proficiency in Python or Go for automation and tooling. Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy). Hands‑on AIOps experience : anomaly detection, predictive analytics, ML‑assisted operations. Strongcommunication and influencing skills — data over hierarchy . Nice to Have Experience with MLOps or large‑scale data infrastructure . Exposure to FinOps or cloud cost optimization . Previous leadership of global incident response or SRE transformation programs . What Success Looks Like 99.9%+ uptime sustained through predictive rather than reactive responses . Faster MTTR via automated detection and auto‑remediation . Reliability insights used in leadership decisions. Mentorship leading to stronger reliability practices across teams. We Are Interested In Technologists who see reliability as a product , not just a metric. Engineers who use AI/ML as a tool for scale and insight . Leaders who can balance innovation speed with operational excellence . Engineers who understand the entire e‑commerce stack and how it impacts revenue . What We Offer The opportunity to work with cutting‑edge technologies in a transformative environment. Professional growth and leadership development pathways tailored to your aspirations. A chance to leave a lasting impact by shaping the future of reliable and scalable systems. Join us to push the boundaries of platform reliability and drive meaningful change in a fast‑evolving digital world Groupon is an AI‑First Company We’re committed to building smarter, faster, and more innovative ways of working—and AI plays a key role in how we get there. We encourage candidates to leverage AI tools during the hiring process where it adds value, and we’re always keen to hear how technology improves the way you work. If you’re passionate about AI or curious to explore how it can elevate your role—you’ll be right at home here . Groupon’s purpose is to build strong communities through thriving small businesses. To learn more about the world’s largest local e‑commerce marketplace, click here. You can also find out more about us in the latest Groupon news as well as learning about our DEI approach. If all of this sounds like something that’s a great fit for you, then click apply and join us on a mission to become the ultimate destination for local experiences and services. #J-18808-Ljbffr



  • WorkFromHome, Colombia Epsilon Solutions Ltd. SA de CV. A tiempo completo

    Sr. Site Reliability Engineer Location: Colombia (REMOTE)Employment type: Full Time Contract Key Skills Microsoft Technologies, IIS, Azure, AWS Kubernetes (K8), CI/CD Pipeline – Git Action, IaC – CloudFormation, Terraform Monitoring – Grafana, Troubleshooting in SRE (Preferred engineering background) Responsibilities 80% – Production support under...


  • WorkFromHome, Colombia N-iX A tiempo completo

    A leading technology firm located in Bogotá, Colombia is seeking a Site Reliability Engineer to enhance the reliability and scalability of software production environments, especially in onboarding new microservices. Responsibilities include automating workflows, managing service reliability, and collaborating across teams. The ideal candidate has strong...


  • WorkFromHome, Colombia Masabi A tiempo completo

    A leading fintech company is seeking a Lead Site Reliability Engineer to enhance system reliability. This remote role in Colombia involves designing reliable systems, contributing to incident response, and mentoring teams. Candidates should have substantial SRE or DevOps experience, particularly in AWS and infrastructure automation. A supportive and...


  • WorkFromHome, Colombia Blankfactor A tiempo completo

    This is a remote position as a full time Colombia employee paid in COP. This requires a minimum of a B2 English comprehension, please be sure to apply with your English CV. We are seeking a proactive and experienced Site Reliability Engineer (SRE) to join our team, focusing on maximizing the reliability, availability, and performance of our enterprise...


  • WorkFromHome, Colombia N-iX A tiempo completo

    N-iX Bogota, D.C., Capital District, Colombia Overview Site Reliability Engineer (SRE) to help monitor, maintain, and scale software production environments, with a primary focus on onboarding new microservices. Work closely with development and platform teams to automate and program-managed onboarding lifecycle—from requirements and environment setup...


  • WorkFromHome, Colombia Truelogic A tiempo completo

    A leading technology firm in Colombia seeks a Site Reliability Engineer to enhance the reliability of systems on AWS and Kubernetes. The role emphasizes observability and automated responses to system behavior. Candidates should have over five years of experience in SRE roles and expertise in AWS and Kubernetes. This position offers fully remote work,...


  • WorkFromHome, Colombia AgileEngine A tiempo completo

    A leading software development company in Colombia is seeking a Site Reliability Engineer to design and deploy scalable cloud-native systems. The ideal candidate has over 8 years of experience in SRE, is highly proficient in AWS and Terraform, and excels in CI/CD pipelines. The role involves mentoring teams, improving system reliability, and implementing...


  • WorkFromHome, Colombia Epsilon Solutions Ltd. SA de CV. A tiempo completo

    A leading technology solutions provider is seeking a Senior Site Reliability Engineer to provide production support and drive DevOps activities. This remote position focuses on troubleshooting issues in production and maintaining CI/CD pipelines while leveraging Microsoft technologies, AWS, and Kubernetes. Ideal candidates have strong skills in production...


  • WorkFromHome, Colombia AgileEngine A tiempo completo

    Join to apply for the Site Reliability Engineer ID45689 role at AgileEngine AgileEngine is an Inc. 5000 company that creates award‑winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI/ML, and our people‑first culture has earned us multiple Best...


  • WorkFromHome, Colombia Masabi A tiempo completo

    Lead Site Reliability Engineer Introducing Masabi // At Masabi, we’re driving the fare payment revolution, powering the journeys of millions all over the world. We build fare collection platforms that allow riders to seamlessly buy and present tickets for public transport either on their mobile phones, from a ticket machine, or even by tapping their bank...