[Remote] Remote | Incident Management, Reliability & SRE Consultant — $70–$110/hour
Note: The job is a remote job and is open to candidates in USA. 24-MAG LLC is offering a part-time consulting opportunity for professionals with expertise in incident management, reliability, and site reliability engineering (SRE). The role involves reviewing various technical documents and artifacts for accuracy and quality, providing structured feedback, and ensuring high standards in incident management and reliability practices.
Responsibilities
- Evaluate AI-generated documents, spreadsheets, and slide decks involving incident management, reliability engineering, SRE practices, post-incident reviews, RCA summaries, runbooks, and service health reporting
- Review incident and reliability materials for accuracy, completeness, rigor, clarity, and practical relevance
- Assess whether timelines, root-cause analysis, contributing factors, impact summaries, and remediation plans are logically supported
- Identify inaccurate assumptions, unclear incident logic, incomplete mitigation plans, weak reliability analysis, or poor linkage between evidence and recommendations
- Review materials involving SLOs, SLIs, SLAs, error budgets, monitoring, alerting, on-call workflows, severity classification, escalation paths, and customer impact summaries
- Assess whether reliability materials are clear, actionable, and suitable for technical or leadership audiences
- Evaluate dashboards, status summaries, incident communication materials, remediation plans, and executive-facing recommendations for rigor and usability
- Provide clear written feedback that improves incident management and reliability artifact quality
- Review spreadsheets for structure, logic, calculations, formatting, usability, and consistency
- Assess slide decks for organization, visual clarity, executive readability, and presentation quality
- Identify factual, aesthetic, formatting, and presentation errors across Microsoft Office and Google Workspace files
- Apply consistent review standards across documents, spreadsheets, and slide decks
Skills
- 5+ years of relevant professional experience in incident management, reliability engineering, site reliability engineering, platform engineering, production engineering, cloud infrastructure, observability, incident response, or related work
- Native or professional fluency in English
- High proficiency in Microsoft Office and Google Workspace
- Strong experience with Google Slides, PowerPoint, Excel, Google Sheets, Word, and Google Docs
- Ability to evaluate documents, spreadsheets, and slide decks with strong attention to detail
- Excellent written communication skills and ability to provide structured feedback
- Ability to work independently in a remote, project-based environment
- Academic backgrounds in computer science, software engineering, information systems, cloud infrastructure, systems engineering, data engineering, cybersecurity, or related fields may be relevant
- Advanced degree from a reputable institution may be valuable
- Professional training in SRE, reliability engineering, incident response, cloud systems, observability, or technical service management may also be relevant depending on project scope
- Master's degree or comparable technical credential
- Experience creating or reviewing post-incident reports, RCA documents, runbooks, SLO/SLA materials, reliability dashboards, alerting plans, service health summaries, or incident communication materials
- Familiarity with tools such as PagerDuty, incident.io, Datadog, New Relic, Grafana, Prometheus, Splunk, CloudWatch, Kubernetes, Terraform, Jira, or comparable reliability and observability tools
- Experience reviewing presentation decks for clarity, polish, and technical communication quality
- Strong ability to evaluate both technical substance and visual/presentation quality
Benefits
- Fully remote with flexible scheduling
- Weekly payments via Stripe or Wise
Company Overview