Chaos SME

Job description

Important IT company At the Latin American level, growth requires:

Chaos SME

Roles & Responsibilities

Develop and implement chaos engineering principles to ensure system resilience and reliability.
Identify critical system components for resilience testing and failure simulations.
Design and execute chaos experiments to uncover vulnerabilities in systems.
Collaborate with development and operations teams to improve system stability.
Monitor and analyze system behavior during failure simulations and provide actionable insights.
Document findings, create reports, and recommend improvements based on chaos engineering results.
Develop tools, scripts, and frameworks for automated chaos testing.
Train and mentor teams on chaos engineering concepts and methodologies.
Ensure adherence to industry best practices for resilience testing.
Stay updated on the latest chaos engineering trends and tools.

Assessment Phase Responsibilities

Identify critical failure points and dependencies in system architecture.
Assess current system resilience capabilities and potential risks.
Collaborate with stakeholders to understand business requirements for reliability.
Perform a gap analysis to identify weaknesses in existing systems.
Define resilience metrics and key performance indicators (KPIs) for system stability.
Evaluate current monitoring tools and frameworks for coverage and effectiveness.
Create a high-level roadmap for introducing chaos engineering practices.
Assess the feasibility of implementing chaos engineering experiments in existing environments.
Document findings and provide recommendations to stakeholders for resilience improvement.
Align chaos engineering objectives with organizational goals and strategies.

Implementation Phase Responsibilities

Design and execute controlled chaos experiments to simulate failures and outages.
Collaborate with development and operations teams to implement resilience improvements.
Deploy monitoring and alerting tools to track system behavior during experiments.
Automate chaos testing workflows using tools such as Gremlin, Chaos Monkey, or LitmusChaos.
Analyze and interpret data from chaos experiments to identify areas for improvement.
Document detailed post-mortem reports and provide actionable recommendations.
Implement resilience strategies, such as failover mechanisms and load balancing.
Ensure compliance with security and regulatory standards during chaos testing.
Train teams on using chaos engineering tools and methodologies effectively.
Conduct post-implementation reviews to evaluate the success of resilience enhancements.

Required Skills and Qualifications

Proven experience in chaos engineering, system reliability, or resilience testing.
Strong knowledge of chaos engineering tools such as Gremlin, Chaos Monkey, or LitmusChaos.
Familiarity with cloud platforms (AWS, Azure, GCP) and container orchestration tools (Kubernetes, Docker).
Expertise in monitoring and observability tools like Prometheus, Grafana, or Datadog.
Hands-on experience in automating failure simulations and resilience testing.
Strong analytical and problem-solving skills to interpret chaos experiment results.
Excellent communication and collaboration skills for working with cross-functional teams.
A bachelor's degree in computer science, engineering, or a related field.
Relevant certifications in reliability engineering or chaos engineering practices are a plus.

ADVANCED CONVERSATIONAL ENGLISH ESSENTIAL (Will be evaluated).

Job type: On site.

Location: Monterrey

Salary: $105,000 gross.

Benefits: Excellent superior benefits.

Apply with Indeed unavailable