Site Reliability Engineer

hace 4 días

WorkFromHome, Colombia Blankfactor A tiempo completo

This is a remote position as a full time Colombia employee paid in COP. This requires a minimum of a B2 English comprehension, please be sure to apply with your English CV. We are seeking a proactive and experienced Site Reliability Engineer (SRE) to join our team, focusing on maximizing the reliability, availability, and performance of our enterprise applications hosted primarily on Azure. This role is central to shifting our operational model toward engineering excellence. Key responsibilities include building out comprehensive observability stacks using Azure Monitor/Application Insights, defining and managing error budgets, establishing performance baselines, and maturing our incident management and runbook procedures. Responsibilities Observability Implementation: Design, implement, and maintain robust, centralized observability solutions across the application and infrastructure stack, leveraging Azure Monitor and Application Insights. SLO/Error Budget Management: Define Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical services and establish clear error budgets to guide engineering trade‑offs between feature velocity and reliability. Performance Engineering: Establish documented performance baselines for key transactions and resources; proactively identify and remediate bottlenecks before they impact users. Incident Management Ownership: Own and drive the incident management process end‑to‑end, including on‑call rotation participation, rapid response, effective communication, and post‑mortem analysis. Runbook Automation: Develop, document, and continuously refine high‑quality, actionable runbooks for common failure scenarios, focusing on automating remediation steps where possible. Toil Reduction: Systematically identify and eliminate manual, repetitive operational work (toil) through automation, scripting, and self‑healing infrastructure. System Design Consultation: Collaborate with development teams to review system designs, ensuring they meet reliability standards before deployment into production environments. Cost Optimization: Monitor resource consumption and provide engineering guidance to optimize cloud infrastructure spend while maintaining target SLOs. Qualifications 5+ years of experience in systems engineering, operations, or a dedicated Site Reliability Engineering role. Extensive hands‑on experience designing and managing complex observability solutions, specifically within the Azure ecosystem (Azure Monitor, Application Insights, Log Analytics). Proven track record of defining, implementing, and enforcing SLOs/SLIs and Error Budgets for production services. Direct experience managing critical production incidents, including formal incident management protocols (e.g., PagerDuty integration, clear communication paths). Strong background in establishing performance baselines and using load testing or profiling tools to validate system behavior. Proficient in infrastructure automation using IaC tools like Terraform or Bicep. Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience. Skills and Competencies In‑depth understanding of cloud‑native architecture patterns and services within Azure (or similar major cloud providers). Expert knowledge of modern monitoring, logging, and tracing technologies (e.g., Prometheus, Grafana, Jaeger, distributed tracing concepts). Strong scripting and automation skills (Python, PowerShell, Bash) necessary for building automated remediation and tooling. Deep knowledge of networking principles (DNS, HTTP, load balancing) as they relate to service availability. Exceptional troubleshooting and analytical capabilities under pressure during high‑severity incidents. Excellent written communication skills, particularly in documenting clear, concise runbooks and thorough post‑mortem reports. Proven ability to prioritize reliability tasks and effectively negotiate technical debt/feature work with development teams. Seniority level Mid‑Senior level Employment type Full‑time Job function Consulting and Information Technology Industries Software Development Referrals increase your chances of interviewing at Blankfactor by 2x Get notified about new Site Reliability Engineer jobs in Colombia . #J-18808-Ljbffr

Site Reliability Engineer

hace 2 días

WorkFromHome, Colombia Epsilon Solutions Ltd. SA de CV. A tiempo completo

Sr. Site Reliability Engineer Location: Colombia (REMOTE)Employment type: Full Time Contract Key Skills Microsoft Technologies, IIS, Azure, AWS Kubernetes (K8), CI/CD Pipeline – Git Action, IaC – CloudFormation, Terraform Monitoring – Grafana, Troubleshooting in SRE (Preferred engineering background) Responsibilities 80% – Production support under...
Site Reliability Engineer: Microservices Onboarding

hace 2 días

WorkFromHome, Colombia N-iX A tiempo completo

A leading technology firm located in Bogotá, Colombia is seeking a Site Reliability Engineer to enhance the reliability and scalability of software production environments, especially in onboarding new microservices. Responsibilities include automating workflows, managing service reliability, and collaborating across teams. The ideal candidate has strong...
Remote Lead Site Reliability Engineer — Scale

hace 2 semanas

WorkFromHome, Colombia Masabi A tiempo completo

A leading fintech company is seeking a Lead Site Reliability Engineer to enhance system reliability. This remote role in Colombia involves designing reliable systems, contributing to incident response, and mentoring teams. Candidates should have substantial SRE or DevOps experience, particularly in AWS and infrastructure automation. A supportive and...
Site Reliability Engineer

hace 2 días

WorkFromHome, Colombia N-iX A tiempo completo

N-iX Bogota, D.C., Capital District, Colombia Overview Site Reliability Engineer (SRE) to help monitor, maintain, and scale software production environments, with a primary focus on onboarding new microservices. Work closely with development and platform teams to automate and program-managed onboarding lifecycle—from requirements and environment setup...
Senior Site Reliability Engineer — Cloud

hace 2 días

WorkFromHome, Colombia AgileEngine A tiempo completo

A leading software development company in Colombia is seeking a Site Reliability Engineer to design and deploy scalable cloud-native systems. The ideal candidate has over 8 years of experience in SRE, is highly proficient in AWS and Terraform, and excels in CI/CD pipelines. The role involves mentoring teams, improving system reliability, and implementing...
Remote Site Reliability Engineer — SRE

hace 2 días

WorkFromHome, Colombia Epsilon Solutions Ltd. SA de CV. A tiempo completo

A leading technology solutions provider is seeking a Senior Site Reliability Engineer to provide production support and drive DevOps activities. This remote position focuses on troubleshooting issues in production and maintaining CI/CD pipelines while leveraging Microsoft technologies, AWS, and Kubernetes. Ideal candidates have strong skills in production...
Lead Site Reliability Engineer

hace 2 semanas

WorkFromHome, Colombia Masabi A tiempo completo

Lead Site Reliability Engineer Introducing Masabi // At Masabi, we’re driving the fare payment revolution, powering the journeys of millions all over the world. We build fare collection platforms that allow riders to seamlessly buy and present tickets for public transport either on their mobile phones, from a ticket machine, or even by tapping their bank...
Site Reliability Engineer

hace 2 semanas

WorkFromHome, Colombia Truelogic Software A tiempo completo

Site Reliability Engineer (AWS) - Technology Join to apply for the Site Reliability Engineer (AWS) - Technology role at Truelogic Software About Truelogic At Truelogic we are a leading provider of nearshore staff augmentation services headquartered in New York. For over two decades, we’ve been delivering top-tier technology solutions to companies of all...
Site Reliability Engineer

hace 2 días

WorkFromHome, Colombia DCT A tiempo completo

Overview DCT Bogota, D.C., Capital District, Colombia Site Reliability Engineer Responsibilities Service & Infrastructure Management: Oversee and manage core platform web services, including API and database servers to ensure optimal performance and health. System Monitoring & Emergency Response: Proactively monitor application and infrastructure health...
Site Reliability Engineer — Remote, Kubernetes

hace 2 días

WorkFromHome, Colombia BairesDev A tiempo completo

A leading technology solutions provider is seeking a Site Reliability Engineer to support and administrate cloud project infrastructure. The role involves ensuring service availability and implementing CI/CD pipelines for automation. Candidates should have over 2 years of experience as an Infrastructure Engineer, familiarity with Kubernetes, and proficiency...

América

Europa

Asia / Oceanía

África

Site Reliability Engineer