ai-sre-incident-response
CommunityAI SRE Incident Response
AuthorBagelHole
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill establishes robust Site Reliability Engineering (SRE) practices tailored for AI systems, addressing unique incident classes like model outages, quality degradation, safety regressions, and cost overruns.
Core Features & Use Cases
- AI Incident Classification: Defines and categorizes incidents specific to AI services (Availability, Quality, Safety, Cost).
- Severity Framework: Provides a clear severity scale (SEV1-SEV3) for AI-related incidents.
- Golden Signals: Identifies key metrics for monitoring AI service health.
- Response Playbooks: Outlines specific actions for common AI incidents like model outages, quality regressions, and cost spikes.
- Postmortem Requirements: Details essential elements for thorough incident reviews.
Quick Start
Apply SRE principles to AI systems by defining incident classes and response playbooks for AI outages and quality regressions.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: ai-sre-incident-response Download link: https://github.com/BagelHole/DevOps-Security-Agent-Skills/archive/main.zip#ai-sre-incident-response Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.