AI-powered SRE and Security agent for OpenShift. Claude diagnoses, triages, and remediates — 24/7.
Full-spectrum cluster operations. Autonomous when you trust it.
Investigate pod failures, crash loops, OOM kills, scheduling problems. Correlate events, logs, and Prometheus metrics for root cause analysis.
Privileged containers, RBAC audit, network policy gaps, image security, SCC analysis, secret hygiene. Full cluster security posture in one scan.
18 scanners running every 60 seconds. Crashloops, pending pods, failed deployments, node pressure, cert expiry, firing alerts, OOM kills, image pull errors, degraded operators, DaemonSet gaps, HPA saturation, SLO burn rate + 5 audit scanners.
Trust level 3: safe fixes without approval. Trust level 4: full autonomous. Rollback snapshots for every action. Rate-limited with cooldown.
6-channel skill selector (keyword, component, historical, semantic, taxonomy, temporal). 7 skills with trigger patterns. Phased plan execution: triage, diagnose, remediate, verify, postmortem. Self-improving learned weights.
Learns from every interaction. Extracts runbooks from confirmed resolutions. Detects patterns. Augments prompts with relevant past incidents.
Every finding, investigation, and action includes a 0-100% confidence score. Reasoning chains with evidence and alternatives considered.
Structured error types with 7 categories. Thread-safe ring buffer tracking. Circuit breaker with auto-recovery. Database resilience decorators.
Time-aware greeting, action summary, investigation count, current cluster state. The 30-second standup replacement for your cluster.
73 PromQL recipes across 16 categories. Semantic auto-layout engine. discover_metrics → verify_query → build. Data-first, never empty charts.
Feeds analytics back into the system prompt. Query reliability scores, dashboard patterns, error hotspots, token efficiency. The agent gets smarter over time.
71% prompt reduction (28KB → 8KB). Selective component schemas. Selective runbooks. Token tracking per turn. Cache hit rate monitoring.
Full audit log of every tool invocation. Chain discovery via bigram analysis. Next-tool hints injected into prompts. Token tracking and usage stats API.
Multi-cluster operations across your fleet. 5 fleet tools for cross-cluster visibility, comparison, and coordinated actions.
6 ArgoCD integration tools. Application sync status, health checks, rollback operations, and drift detection across GitOps-managed clusters.
TF-IDF prediction from real usage, Haiku LLM fallback for cold start, co-occurrence bundles, mid-turn chain expansion. Negative signals suppress wasted tools. Self-improving — LLM trains itself out of a job.
11 eval suites, 98 scenarios. Release gate: 98.1% (75/75 PASS). Safety: 3/3 PASS. Selector: 23 scenarios, recall@5 >92%. Ablation testing, regression tracking, weekly digest.
36 focused modules across 3 packages. No file over 910 lines. Typo auto-correction for 130+ K8s terms. Centralized Pydantic config, zero raw os.environ.
Talk to your cluster like a colleague.
Progressive autonomy. You decide how much to trust.
Monitor Only
Observe & report
Suggest
Dry-run previews
Ask First
Propose + confirm
Auto-Fix Safe
Low-risk auto
Full Auto
All categories
From zero to scanning in 60 seconds.
Pair with OpenShift Pulse for the full experience, or run standalone as a CLI.