ai-sre-incident-response

Community

AI SRE Incident Response

AuthorBagelHole
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill establishes robust Site Reliability Engineering (SRE) practices tailored for AI systems, addressing unique incident classes like model outages, quality degradation, safety regressions, and cost overruns.

Core Features & Use Cases

  • AI Incident Classification: Defines and categorizes incidents specific to AI services (Availability, Quality, Safety, Cost).
  • Severity Framework: Provides a clear severity scale (SEV1-SEV3) for AI-related incidents.
  • Golden Signals: Identifies key metrics for monitoring AI service health.
  • Response Playbooks: Outlines specific actions for common AI incidents like model outages, quality regressions, and cost spikes.
  • Postmortem Requirements: Details essential elements for thorough incident reviews.

Quick Start

Apply SRE principles to AI systems by defining incident classes and response playbooks for AI outages and quality regressions.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: ai-sre-incident-response
Download link: https://github.com/BagelHole/DevOps-Security-Agent-Skills/archive/main.zip#ai-sre-incident-response

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.