# RCA-100 v1.0

A blind benchmark of **103 chaos-drill–injected incidents** across a Kubernetes + OpenTelemetry demo store, with per-task observability dumps suitable for Root-Cause-Analysis (RCA) agent evaluation.

Download: `https://aiops-benchmark.oss-cn-hongkong.aliyuncs.com/rca/rca100/v1.0/`

| | |
|---|---|
| **Tasks** | 103 (opaque IDs `t001`..`t103`) |
| **Cluster** | OpenTelemetry demo store on Alibaba Cloud ACK |
| **Modalities** | metrics · logs · traces · events · alerts · topology |
| **Total size** | ~3.4 GB |

---

## 1. Layout

```
rca100/v1.0/
├── README.md
├── LICENSE                       (CC BY-NC-SA 4.0)
├── manifest.txt                  (t001..t103, one per line)
├── summary.json                  (per-task row counts)
└── cases/
    └── t001..t103/               × 103
        ├── task.json             (agent-facing contract: alert + window, NO answer)
        ├── metrics.parquet
        ├── logs.parquet
        ├── traces.parquet
        ├── events.parquet
        ├── alerts.parquet        (entry-alert lifecycle only)
        └── topology.json         (per-task UModel topology snapshot)
```

`task_id` is encoded only in directory paths — no `task_id` column inside any parquet.

The original fault label, root-cause entity, and causal chain are NOT included in the public release. They are distributed separately as an `answer_key/` package under restricted access. See §5 for evaluation flow.

---

## 2. Schemas

### 2.1 task.json (agent-facing contract)

```jsonc
{
  "task_id": "t034",
  "task_version": "1.0",
  "alert_event_id": "195ff92b621c1d2937d5fcd36e7747fa",
  "alert_title": "inventory接口流量下跌告警",
  "alert_trigger_time": "2026-04-20T11:21:54+08:00",
  "alert_window": {"start": "<ISO8601>", "end": "<ISO8601>"},
  "alert_entity": {
    "entity_id": "...",
    "entity_name": "inventory::/api/v1/inventory/{productId}",
    "entity_type": "apm.operation",
    "entity_domain": "apm"
  },
  "prompt_text": "<alert payload + user question>",
  "workspace": "rca-benchmark",
  "region_id": "cn-hongkong",
  "available_modalities": ["metrics","logs","traces","events","alerts","topology"],
  "scoring_note": "Output contract and fault taxonomy will be published in a follow-up release."
}
```

### 2.2 metrics.parquet (long-format, unified)

| column | type | notes |
|---|---|---|
| `time` | int64 | unix microseconds |
| `domain` | string | `apm` / `k8s` |
| `entity_set` | string | `apm.service.legacy` / `apm.operation` / `apm.instance` / `apm.metric.{jvm,thread,exception}` / `k8s.{node,deployment,namespace,cluster,pod}` |
| `entity_id` / `entity_name` | string | resolves to topology entity |
| `metric` | string | e.g. `node_cpu_usage_rate`, `request_count`, `error_rate` |
| `value` | float | |
| `metric_set_id` / `service` | string | source set id + service tag |

Every `entity_id` row references an entity present in the corresponding task's `topology.json`. ~8.86M total rows.

### 2.3 logs.parquet

OTel/SLS app logs. Key columns: `time`, `content`, `_pod_name_`, `_namespace_`, `_node_ip_`, `cluster_id`. ~53.6M total rows.

### 2.4 traces.parquet

OTel span schema. Key columns: `traceId`, `spanId`, `parentSpanId`, `kind`, `spanName`, `startTime` (ns), `endTime`, `duration` (ns), `serviceName`, `statusCode`, `statusMessage`, `resources`, `attributes`, `events`. ~51.1M total rows.

### 2.5 events.parquet (k8s events)

Scoped to OTel demo workload + node-level events. ~22.8K total rows. The k8s Event payload is JSON-encoded inside `eventId`; parse to access `reason`/`message`/`involvedObject` etc.

### 2.6 alerts.parquet (CMS alert center, CloudEvents 1.0)

Entry-alert lifecycle filtered by `data.transId` ∈ task's `trans_id`. ~3134 total rows across 103 tasks (~30 lifecycle events/task).

### 2.7 topology.json (per-task entity graph)

Per-task structured topology derived from Aliyun UModel. Snapshot at alert time, scoped to OTel demo workload (`cms-demo` namespace + 19 demo apm.services).

```jsonc
{
  "entities": [
    {"id": "<hash>", "type": "k8s.pod|k8s.node|apm.service|apm.operation|...",
     "name": "inventory-87855b9b9-md2jg",
     "first_observed": <epoch_s>, "last_observed": <epoch_s>,
     "props": {"namespace": "cms-demo", "cluster_id": "...", ...}}
  ],
  "edges": [
    {"src": "<id>", "src_type": "k8s.node", "dst": "<id>", "dst_type": "k8s.pod",
     "relation": "contains|hosts|calls|same_as", ...}
  ],
  "stats": {"entities_total": <int>, "edges_total": <int>, ...}
}
```

**Entity types**: `k8s.{cluster,namespace,node,deployment,service,pod,daemonset,configmap,ingress,job,cronjob,storageclass,persistentvolume}`, `apm.{service,instance,operation}`, `apm.external.{http_client,rpc_client,database,nosql,message}`.

**Relations**: `contains` (parent-child ownership), `hosts` (k8s → apm mapping), `calls` (service-to-service / service-to-middleware).

Per-task topology distribution: min 208 / median 230 / p95 267 / max 1903 entities; min 212 / median 269 / p95 341 / max 1909 edges.

### 2.8 Data integrity

All entity_id references resolve to entities present in the corresponding task's `topology.json`:

| Reference surface | Coverage |
|---|---|
| `traces.serviceName` ↔ `apm.service` | 100% (1542 / 1542) |
| `events.pod_name` ↔ `k8s.pod` | 100% (958 / 958) |
| `logs._pod_name_` ↔ `k8s.pod` | 100% (3120 / 3120) |
| `metrics.entity_id` ↔ topology entity | 100% (7454 / 7454 unique) |
| `topology.edges` `src` / `dst` ↔ entities | 100% (no dangling edges) |

---

## 3. Download

Public read-only at `https://aiops-benchmark.oss-cn-hongkong.aliyuncs.com/rca/rca100/v1.0/`. No credentials needed.

```bash
BASE=https://aiops-benchmark.oss-cn-hongkong.aliyuncs.com/rca/rca100/v1.0
for f in README.md LICENSE manifest.txt summary.json; do
  curl -sO $BASE/$f
done

# Batch download all 103 tasks (~3.4 GB)
xargs -I {} -P 16 -a manifest.txt bash -c '
  CID=$0
  mkdir -p rca100/cases/$CID
  for f in task.json metrics.parquet logs.parquet traces.parquet events.parquet alerts.parquet topology.json; do
    curl -sfL -o rca100/cases/$CID/$f '$BASE'/cases/$CID/$f
  done
' {}

# Verify
ls rca100/cases/ | wc -l                  # → 103
find rca100/cases -type f | wc -l         # → 721
```

---

## 4. Quick load (Python)

```python
import pandas as pd, json
from pathlib import Path

ROOT = Path('rca100')
task_id = 't034'  # any from manifest
cdir = ROOT / 'cases' / task_id

task = json.loads((cdir / 'task.json').read_text())
print('Alert:        ', task['alert_title'])
print('Alert window: ', task['alert_window']['start'], '~', task['alert_window']['end'])
print('Alert entity: ', task['alert_entity']['entity_name'])

metrics = pd.read_parquet(cdir / 'metrics.parquet')
print(metrics.groupby('entity_set').size())
```

---

## 5. Evaluation

Submit a JSON prediction per task to be scored against the held-out answer key. The output contract (`prediction_schema.json`), fault taxonomy (`taxonomy.json`), and reference scorer will be published in a follow-up release. Contact the maintainers for early access to the answer key for trusted evaluation.

---

## 6. Cite

```bibtex
@misc{rca100_2026,
  title  = {RCA-100: A Chain-Reasoning Benchmark for Root Cause Analysis on Cloud-Native Microservices},
  author = {Wen, Xidao and Liu, Haibin and Liu, Guiyang and Zhang, Cheng and Situ, Fang and Zhou, Qi},
  year   = {2026},
  note   = {103 chaos-drill incidents, OpenTelemetry demo store, CC BY-NC-SA 4.0}
}
```

---

## 7. License

Data: **CC BY-NC-SA 4.0** (see `LICENSE`).

---

## 8. Contact

Maintainer: Xidao Wen — `wenxidao.wxd@alibaba-inc.com`

For evaluator integration questions, blind-evaluation coordination, dataset extension proposals, or to report data issues.