[Remote] Senior Platform & Reliability Engineer (SRE)
Note: The job is a remote job and is open to candidates in USA. Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. They are seeking a Senior Platform & Reliability Engineer to own service reliability and ensure the platform remains reliable, fast, and resilient as it scales.
Responsibilities
- Set and enforce SLIs/SLOs/error budgets for critical user flows
- Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access
- Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety
- Own poison pill containment and workload isolation
- Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution)
- Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline
- Gate risky deploys and enforce reliability guardrails when production health is at risk
Skills
- Experience with Kubernetes-based production infrastructure
- Proficiency in setting and enforcing SLIs/SLOs/error budgets
- Ability to drive failure isolation across API, workers, queues, and dependencies
- Experience defining probe contracts, rollout/rollback standards, and graceful shutdown behavior
- Knowledge of queue and job safety, specifically with BullMQ and Redis
- Experience leading incident response for Sev1/Sev2 incidents
- Strong skills in observability, on-call effectiveness, and postmortem discipline
- Ability to gate risky deploys and enforce reliability guardrails
- Calm and structured incident commander under pressure
- Ability to think in failure modes and blast radius
- Pragmatic approach to stabilizing systems quickly and implementing durable fixes
- High ownership and strong written communication skills
Company Overview
Company H1B Sponsorship