See all roles

[Remote] Staff Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Thrive Market is an online, membership-based market focused on making healthy and sustainable living accessible. They are seeking a Staff Site Reliability Engineer to establish their SRE practice, define reliability metrics, and ensure system scalability during rapid growth.

Responsibilities

  • Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform services
  • Build and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platforms
  • Establish error budgets and use them to balance feature velocity with reliability investments
  • Lead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrence
  • Design and implement chaos engineering practices to proactively identify failure modes before they impact members
  • Architect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency
  • Support large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operations
  • Contribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigation
  • Design and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilities
  • Develop and own disaster recovery plans, capacity planning models, and system hardening initiatives
  • Collaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practices
  • Help establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teams
  • Champion a culture of operational excellence, continuous improvement, and data-driven reliability decisions
  • Create and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooks
  • Participate in weekly on-call rotations and help build sustainable on-call practices that avoid burnout
  • Identify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvement

Skills

  • B.S. in Computer Science or equivalent professional experience
  • 7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companies
  • Deep expertise in Kubernetes (K8s) — including cluster management, Helm charts, service meshes, and production-grade container orchestration
  • Strong systems engineering background with advanced proficiency in Linux administration
  • Advanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languages
  • Extensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and Lambda
  • Strong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar)
  • Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environments
  • Deep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments)
  • Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar)
  • Strong knowledge of web application infrastructure, networking, load balancing, and security best practices
  • Excellent communication skills with the ability to lead incident response and facilitate blameless postmortems
  • Experience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scale
  • Experience with ConcourseCI, Github Actions (GHA) or similar deployment frameworks
  • Experience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar)
  • Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd)
  • Experience building and managing cost-optimization strategies for cloud infrastructure
  • Background in establishing SRE practices in organizations transitioning from traditional DevOps models
  • Experience with configuration management tools (Ansible, Chef, Puppet, or similar)

Benefits

  • Comprehensive health benefits (medical, dental, vision, life and disability)
  • Competitive salary (DOE) + equity
  • 401k plan
  • 9 Observed Holidays
  • Flexible Paid Time Off
  • Subsidized ClassPass Membership with access to fitness classes and wellness and beauty experiences
  • Ability to work in our beautiful office in Playa Vista
  • Free Thrive Market membership with exclusive employee discount
  • Coverage for Life Coaching & Therapy Sessions on our holistic mental health and well-being platform

Company Overview

  • Thrive Market is a membership-based online company that offers natural and organic food products. It was founded in 2013, and is headquartered in Los Angeles, California, USA, with a workforce of 501-1000 employees. Its website is https://thrivemarket.com.
  • Apply To This Job

    You might like

    [Remote] Senior DevOps Engineer/Site Reliability Engineer-East Coast

    Work from home Full-time role

    [Remote] Business Development Representative

    Work from home Full-time role

    [Remote] Information Technology Project Manager

    Work from home Full-time role

    [Remote] Aftersales Account Manager

    Work from home Full-time role

    [Remote] Program Manager

    Work from home Full-time role

    [Remote] Supervision Consultant

    Work from home Full-time role

    [Remote] Senior Software Developer - Oracle Health, Platform Engineering

    Work from home Full-time role

    [Remote] Training Manager \- Human Services Program \- Remote

    Work from home Full-time role

    [Remote] Legal Counsel

    Work from home Full-time role

    [Remote] Business Development Intern at Oncology Startup

    Work from home Full-time role

    Recall Analyst - German Native Speaker (Portugal Remote)

    Work from home Full-time role

    Experienced Customer Service Representative – Work From Home Opportunity with arenaflex

    Work from home Full-time role

    Field Service Technician – Dover, DE

    Work from home Full-time role

    Compliance Coordinator

    Work from home Full-time role

    Experienced Online Live Chat Assistant – Entry Level / Immediate Start at arenaflex

    Work from home Full-time role

    Utilization Management Nurse Consultant- Specialty Medications (Remote)

    Work from home Full-time role

    Enterprise Security Engineer

    Work from home Full-time role

    Experienced Full Stack Administrative Data Entry Specialist – Remote Opportunity at arenaflex

    Work from home Full-time role

    Pharmacy Care Services Paralegal - Remote - 2312867

    Work from home Full-time role

    Intermediate Customer Success Manager (CSM)

    Work from home Full-time role