Site Reliability Engineer
hace 5 días
N-iX Bogota, D.C., Capital District, Colombia Overview Site Reliability Engineer (SRE) to help monitor, maintain, and scale software production environments, with a primary focus on onboarding new microservices. Work closely with development and platform teams to automate and program-managed onboarding lifecycle—from requirements and environment setup through deployment, testing, documentation, and handover—ensuring reliability, scalability, performance, and compliance at every step. Responsibilities Lead and support the end-to-end onboarding process for new microservices into production environments. Identify and automate gaps in the current onboarding workflow (deployment, configuration, monitoring, scaling, etc.). Provide program management for onboarding activities, including timelines, dependencies, and stakeholder communication. Collaborate with development and operations/platform teams to ensure smooth and consistent rollout of new services. Design and implement monitoring, logging, and alerting for all onboarded services. Ensure comprehensive metrics collection (e.g., availability, latency, error rates, throughput) to support SLOs/SLIs. Tune alerts to minimize noise while ensuring rapid detection and response to production issues. Perform load and stress testing to validate that services can scale to meet current and projected demand. Implement and refine auto-scaling mechanisms and capacity planning practices. Conduct ongoing performance tuning and optimization to achieve minimal latency and high throughput. Drive high service reliability and uptime for all onboarded microservices. Help teams design and implement fault-tolerant architectures, including failover and redundancy mechanisms. Work with teams to adopt SRE best practices (e.g., error budgets, post-incident reviews, runbooks). Ensure all onboarded services meet security and compliance requirements. Integrate security best practices into deployment, monitoring, and operational processes. Maintain audit trails and documentation for onboarding activities to support regulatory and internal compliance. Create detailed documentation for the service onboarding process, including standards, patterns, and templates. Develop and maintain runbooks, playbooks, and SOPs for ongoing operations. Conduct training sessions and workshops for internal teams to enable self-service onboarding and long-term maintainability. Participate in requirements analysis for new services; define onboarding success criteria and KPIs. Develop onboarding plans outlining steps, timelines, responsibilities, and acceptance criteria; present plans to stakeholders for review and approval. Prepare and validate environments, ensuring appropriate access, permissions, and tooling are in place. Conduct comprehensive functional, performance, reliability, and security testing prior to go-live. Provide post-onboarding support, monitoring services to ensure continued reliability and quickly addressing any issues that arise. Required Qualifications Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role in microservices-based environments. Strong understanding of microservices architecture, distributed systems, and cloud-native concepts. Hands-on experience with: Production monitoring, logging, and alerting (e.g., metrics, tracing, log aggregation tools). Automation of deployment and operational workflows (e.g., scripts, pipelines, IaC, or similar). Load/performance testing and capacity planning. Demonstrated ability to improve service reliability, scalability, and performance in production. Familiarity with security best practices related to service deployment, monitoring, and operations. Experience working across cross-functional teams (development, operations, security, compliance) to deliver complex initiatives. Excellent documentation, communication, and stakeholder management skills. Preferred Qualifications Experience defining and tracking SRE KPIs/SLOs/SLIs for onboarding and production services. Background in program or project management of technical initiatives (especially service onboarding or platform rollouts). Prior experience in high-availability, regulated, or large-scale SaaS environments. We offer Flexible working format - remote, office-based or flexible A competitive salary and good compensation package Personalized career growth Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more) Active tech communities with regular knowledge sharing Education reimbursement Memorable anniversary presents Corporate events and team buildings Other location-specific benefits Not applicable for freelancers #J-18808-Ljbffr
-
Senior Site Reliability Engineer — Remote
hace 2 días
WorkFromHome, Colombia Truelogic Software LLC A tiempo completoA leading software development firm based in Colombia is looking for a Site Reliability Engineer to enhance the reliability of their AWS and Kubernetes systems. The engineer will focus on observability, operational improvements, and collaborate with various engineering teams. This position offers 100% remote work and a highly competitive USD salary, along...
-
Site Reliability Engineer
hace 5 días
WorkFromHome, Colombia Epsilon Solutions Ltd. SA de CV. A tiempo completoSr. Site Reliability Engineer Location: Colombia (REMOTE)Employment type: Full Time Contract Key Skills Microsoft Technologies, IIS, Azure, AWS Kubernetes (K8), CI/CD Pipeline – Git Action, IaC – CloudFormation, Terraform Monitoring – Grafana, Troubleshooting in SRE (Preferred engineering background) Responsibilities 80% – Production support under...
-
WorkFromHome, Colombia N-iX A tiempo completoA leading technology firm located in Bogotá, Colombia is seeking a Site Reliability Engineer to enhance the reliability and scalability of software production environments, especially in onboarding new microservices. Responsibilities include automating workflows, managing service reliability, and collaborating across teams. The ideal candidate has strong...
-
Site Reliability Engineer
hace 3 días
WorkFromHome, Colombia BairesDev A tiempo completoOverview Site Reliability Engineer at BairesDev. We are looking for a Site Reliability Engineer to build and maintain highly reliable, scalable, and secure OpenShift/Kubernetes clusters. Approach production systems from a software engineering perspective with a focus on automation and reliability. What you will do Build and automate and maintain...
-
Site Reliability Engineer
hace 7 días
WorkFromHome, Colombia Blankfactor A tiempo completoThis is a remote position as a full time Colombia employee paid in COP. This requires a minimum of a B2 English comprehension, please be sure to apply with your English CV. We are seeking a proactive and experienced Site Reliability Engineer (SRE) to join our team, focusing on maximizing the reliability, availability, and performance of our enterprise...
-
Senior Site Reliability Engineer — Cloud
hace 5 días
WorkFromHome, Colombia AgileEngine A tiempo completoA leading software development company in Colombia is seeking a Site Reliability Engineer to design and deploy scalable cloud-native systems. The ideal candidate has over 8 years of experience in SRE, is highly proficient in AWS and Terraform, and excels in CI/CD pipelines. The role involves mentoring teams, improving system reliability, and implementing...
-
Remote Site Reliability Engineer — SRE
hace 5 días
WorkFromHome, Colombia Epsilon Solutions Ltd. SA de CV. A tiempo completoA leading technology solutions provider is seeking a Senior Site Reliability Engineer to provide production support and drive DevOps activities. This remote position focuses on troubleshooting issues in production and maintaining CI/CD pipelines while leveraging Microsoft technologies, AWS, and Kubernetes. Ideal candidates have strong skills in production...
-
Site Reliability Engineer
hace 3 días
WorkFromHome, Colombia EPAM Systems A tiempo completoEPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most...
-
Site Reliability Engineer
hace 2 días
WorkFromHome, Colombia BairesDev A tiempo completoOverview Site Reliability Engineer at BairesDev – Remote work We are looking for a Site Reliability Engineer to administer and provide support for the project infrastructure hosted in the cloud while implementing CI/CD pipelines for the automation of deployments. What You Will Do Ensure high service availability, performance, security, and maintainability....
-
Site Reliability Engineer
hace 3 días
WorkFromHome, Colombia Monokera A tiempo completoSomos Monokera y hacemos parte del grupo Credicorp , como una de las primeras empresas de tecnología que busca transformar la industria aseguradora en LatAm. Nuestra oferta de valor se basa en ofrecer una plataforma abierta de gestión de seguros digitales que permite integrar rápidamente y sin fricciones a las aseguradoras, reaseguradoras y servicios de...