SREBench

About SREBench by Parity

Read about how and why we created SREBench hereWithout anything like SWEBench for Kubernetes tasks, evaluating our agent's effectiveness was challenging. We created this site as a way of announcing a new SRE benchmark, SREBench, to the public. We're still in the early stages of putting together our full benchmark, but we're excited to share a sneak peak!SREBench, inspired by MuSR, was created as a dataset of common Kubernetes issues and their root cause. We provided it to our agent to measure its performance. We created this competition as a fun way of seeing how well our agent performs compared to real experts, and gather feedback on how to make this better. The questions in the test are a subset of those in our benchmark.Here’s how it works1. Simulating the Cluster StateWe use an LLM to simulate the state of the cluster. Whenever you input a command, an LLM with knowledge of the root cause produces simulated outputs that are aligned with the root cause and the output history.2. Evaluating Your AnswerWhen you submit an answer of the root cause, an LLM as a judge scores how close your guess was to the actual root cause. We use this score, as well as the LLM judge's feedback, to determine if an answer was correct, partially correct, or incorrect.We're still early in developing this benchmark, so if you have suggestions/ideas on how to improve it, please reach out at founders@tryparity.com.