ClawBench: Evaluating Browser Agents on Live Production Websites with Submission-Interception

Eval & Observability arxiv arXiv:2604.08523 PDF ↗

browserlivesubmissionwebsitesproductioninterceptioninterceptionclawbenchsites

Benchmarks browser agents on 283 everyday tasks (V1 153 + V2 130) across 163 live production sites, with a Chrome-extension plus CDP layer that blocks only the final write request so agents can run end-to-end on real sites without real-world side effects. Two-stage scoring (interception + LLM judge); leaderboard at https://claw-bench.com.

Status

5~10분. 제목→초록→인트로→섹션헤더→그림→결론만.
판단: 어떤 문제를 풀고 / 핵심 아이디어 / 내 작업과 관련 있나?

~1시간. 그림·표를 꼼꼼히. 증명·수식 디테일은 건너뜀.
산출물: "이들이 뭘 했고 왜 그게 통하는가" 한 문단.

재현하듯 읽기. 가정을 의심. 직접 인용/반박할 논문만.
렌즈: "내 플릿에서 측정하면 저자가 못 한 무엇을 보여줄 수 있나?"

View in Knowledge Graph →

ClawBench: Evaluating Browser Agents on Live Production Websites with Submission-Interception

Related Papers (3)