Skip to content

Chaos SME

  • On-site
    • Monterrey, Nuevo León, Mexico

Job description

Important IT company At the Latin American level, growth requires:

Chaos SME

Roles & Responsibilities
  • Develop and implement chaos engineering principles to ensure system resilience and reliability.
  • Identify critical system components for resilience testing and failure simulations.
  • Design and execute chaos experiments to uncover vulnerabilities in systems.
  • Collaborate with development and operations teams to improve system stability.
  • Monitor and analyze system behavior during failure simulations and provide actionable insights.
  • Document findings, create reports, and recommend improvements based on chaos engineering results.
  • Develop tools, scripts, and frameworks for automated chaos testing.
  • Train and mentor teams on chaos engineering concepts and methodologies.
  • Ensure adherence to industry best practices for resilience testing.
  • Stay updated on the latest chaos engineering trends and tools.
Assessment Phase Responsibilities
  • Identify critical failure points and dependencies in system architecture.
  • Assess current system resilience capabilities and potential risks.
  • Collaborate with stakeholders to understand business requirements for reliability.
  • Perform a gap analysis to identify weaknesses in existing systems.
  • Define resilience metrics and key performance indicators (KPIs) for system stability.
  • Evaluate current monitoring tools and frameworks for coverage and effectiveness.
  • Create a high-level roadmap for introducing chaos engineering practices.
  • Assess the feasibility of implementing chaos engineering experiments in existing environments.
  • Document findings and provide recommendations to stakeholders for resilience improvement.
  • Align chaos engineering objectives with organizational goals and strategies.
Implementation Phase Responsibilities
  • Design and execute controlled chaos experiments to simulate failures and outages.
  • Collaborate with development and operations teams to implement resilience improvements.
  • Deploy monitoring and alerting tools to track system behavior during experiments.
  • Automate chaos testing workflows using tools such as Gremlin, Chaos Monkey, or LitmusChaos.
  • Analyze and interpret data from chaos experiments to identify areas for improvement.
  • Document detailed post-mortem reports and provide actionable recommendations.
  • Implement resilience strategies, such as failover mechanisms and load balancing.
  • Ensure compliance with security and regulatory standards during chaos testing.
  • Train teams on using chaos engineering tools and methodologies effectively.
  • Conduct post-implementation reviews to evaluate the success of resilience enhancements.
Required Skills and Qualifications
  • Proven experience in chaos engineering, system reliability, or resilience testing.
  • Strong knowledge of chaos engineering tools such as Gremlin, Chaos Monkey, or LitmusChaos.
  • Familiarity with cloud platforms (AWS, Azure, GCP) and container orchestration tools (Kubernetes, Docker).
  • Expertise in monitoring and observability tools like Prometheus, Grafana, or Datadog.
  • Hands-on experience in automating failure simulations and resilience testing.
  • Strong analytical and problem-solving skills to interpret chaos experiment results.
  • Excellent communication and collaboration skills for working with cross-functional teams.
  • A bachelor's degree in computer science, engineering, or a related field.
  • Relevant certifications in reliability engineering or chaos engineering practices are a plus.


ADVANCED CONVERSATIONAL ENGLISH ESSENTIAL (Will be evaluated).

Job type: On site.

Location: Monterrey

Salary: $105,000 gross.

Benefits: Excellent superior benefits.

or

Apply with Indeed unavailable