operational-resilience

Community

Make failures visible and systems recoverable

AuthorKAFKA2306
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Operational Resilience provides clear, enforceable infrastructure standards to prevent silent failures and make incidents immediately diagnosable, reducing time-to-resolution during production issues.

Core Features & Use Cases

  • LLM orchestration guidance: recommend Gemini Flash for multi-source processing, enforce JSON structured outputs, and delegate quota/key rotation to a dedicated orchestration service.
  • Workflow and checkpoint control: compress context between stages, extract summaries instead of passing full chat history, and require idempotent checkpoints so side-effecting steps can be safely retried.
  • Failure and service unit practices: prefer crash-fast behavior with system-level restarts, require absolute paths, and mandate service unit fields (User, Group, WorkingDirectory) and pre-flight permission checks.
  • Use Case: Evaluate a YT3 publish pipeline to ensure it produces Final Deliverable Metadata, uses idempotent checkpoints, and yields structured JSON outputs for publish validation.

Quick Start

Audit a deployment or LLM orchestration design for YT3 and produce a checklist verifying absolute paths, idempotent checkpoints, compressed context handoffs, JSON structured outputs, and final deliverable metadata.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: operational-resilience
Download link: https://github.com/KAFKA2306/yt3/archive/main.zip#operational-resilience

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.