AI Agent Papers

ClawBench: Evaluating Browser Agents on Live Production Websites with Submission-Interception

Benchmarks browser agents on 283 everyday tasks (V1 153 + V2 130) across 163 live production sites, with a Chrome-extension plus CDP layer that blocks only the final write request …

Eval & Observability 2604.08523 notes →

From Features to Actions: Explainability in Traditional and Agentic AI Systems

Compares attribution-based explanations with trace-based diagnostics across static and agentic settings to study how explainability methods translate to multi-step agent trajectori…

Eval & Observability 2602.06841 notes →

Agentic Uncertainty Reveals Agentic Overconfidence

Investigates whether agents can accurately predict their own success rates in agentic tasks.

Eval & Observability 2602.06948 notes →

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Introduces 20 research tasks from real ML papers covering idea generation, experiments, and refinement for benchmarking science agents.

Eval & Observability 2602.06855 notes →

JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

Proposes evaluating agent outputs by decomposing responses into individual claims and checking each against expert knowledge.

Eval & Observability 2602.06486 notes →

Completing Missing Annotation: Multi-Agent Debate for Accurate Relevant Assessment

Explores using multi-agent debate to fill missing labels in information retrieval benchmarks.

Eval & Observability 2602.06526 notes →

TrajAD: Trajectory Anomaly Detection for Trustworthy LLM Agents

Proposes a specialized verifier that detects and locates errors in agent execution trajectories at runtime to enable precise rollback-and-retry.

Eval & Observability 2602.06443 notes →

Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents

Examines whether GPT-4/5 agents can reproduce aggregate human cognitive biases in interactive decision-making scenarios.

Eval & Observability 2602.05597 notes →

Capture the Flags: Family-Based Evaluation of Agentic LLMs

Proposes generating families of equivalent CTF challenges through code transformations to test whether agents truly understand exploits or just memorize patterns.

Eval & Observability 2602.05523 notes →

PieArena: Frontier Language Agents Achieve MBA-Level Negotiation

Introduces a negotiation benchmark where frontier LLM agents are evaluated against MBA students to reveal cross-model differences in deception, accuracy, and trustworthiness.

Eval & Observability 2602.05302 notes →

ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

Benchmarks how well conversational agents retain and use personal information over long emotional support conversations.

Eval & Observability 2602.01885 notes →

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Introduces a benchmark that replays published human-subject experiments with LLM agents to test how well they simulate real participants.

Eval & Observability 2602.00685 notes →

Benchmarking Agents in Insurance Underwriting Environments

Proposes an expert-designed multi-turn insurance underwriting benchmark to evaluate agent performance under real-world enterprise conditions with noisy tools and proprietary knowle…

Eval & Observability 2602.00456 notes →

TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI

Proposes automated state abstraction from agent execution traces using predicate trees and counterexample refinement for probabilistic runtime verification of agent behavior.

Eval & Observability 2601.22997 notes →

Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Compares three LLM agent frameworks (Aider, OpenHands, SWE-agent) on vulnerability false positive filtering to study how agent design and backbone model affect triage performance.

Eval & Observability 2601.22952 notes →

Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study

Analyzes 8,106 fix-related pull requests from five AI coding agents to catalog the reasons agent-generated contributions are closed without merging.

Eval & Observability 2602.00164 notes →

JAF: Judge Agent Forest

Proposes a judge agent framework that evaluates query-response pairs jointly across a cohort rather than in isolation, using in-context neighborhoods for cross-instance pattern det…

Eval & Observability 2601.22269 notes →

Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis

Evaluates LLM reasoning under ReAct and Plan-and-Execute agentic workflows across 48,000 simulated failure scenarios, producing a taxonomy of 16 common reasoning failures.

Eval & Observability 2601.22208 notes →

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Introduces a benchmark for evaluating LLM agent consistency, uncertainty handling, and capability awareness in multi-turn tool-using scenarios with incomplete or ambiguous user req…

Eval & Observability 2601.22027 notes →

More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests

Examines code quality, maintainability, and reviewer sentiment toward AI-agent-generated pull requests compared to human-authored contributions.

Eval & Observability 2601.21276 notes →

The Quiet Contributions: Insights into AI-Generated Silent Pull Requests

Analyzes silent (no-comment) AI-generated pull requests to examine their impact on code complexity, quality issues, and security vulnerabilities.

Eval & Observability 2601.21102 notes →

Agent Benchmarks Fail Public Sector Requirements

Analyzes over 1,300 agent benchmarks against public-sector requirements including process-based evaluation, realism, and domain-specific metrics.

Eval & Observability 2601.20617 notes →

Interpreting Emergent Extreme Events in Multi-Agent Systems

Applies Shapley values to attribute emergent extreme events in LLM multi-agent systems to specific agent actions across time, agent, and behavior dimensions.

Eval & Observability 2601.20538 notes →

Who Writes the Docs in SE 3.0? Agent vs. Human Documentation Pull Requests

Analyzes AI agent contributions to documentation pull requests and examines how human developers review and intervene in agent-authored documentation changes.

Eval & Observability 2601.20171 notes →

Are We All Using Agents the Same Way? An Empirical Study of Core and Peripheral Developers Use of Coding Agents

Examines how core and peripheral developers differ in their use, review, modification, and verification of coding-agent-generated pull requests.

Eval & Observability 2601.20106 notes →

DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

Introduces an end-to-end benchmark with 700+ real-world tasks across build, monitoring, issue resolving, and test generation for evaluating AI agents in full software DevOps workfl…

Eval & Observability 2601.20882 notes →

Toward Architecture-Aware Evaluation Metrics for LLM Agents

Proposes an architecture-informed evaluation approach that links agent components like planners, memory, and tool routers to observable behaviors and diagnostic metrics.

Eval & Observability 2601.19583 notes →

Balancing Sustainability And Performance: The Role Of Small-Scale LLMs In Agentic AI Systems

Investigates whether smaller-scale language models can reduce energy consumption in multi-agent agentic AI systems without compromising task quality.

Eval & Observability 2601.19311 notes →

Understanding Dominant Themes in Reviewing Agentic AI-authored Code

Analyzes 19,450 inline review comments on agent-authored pull requests and derives a taxonomy of 12 review themes to understand how reviewers respond to AI-generated code.

Eval & Observability 2601.19287 notes →

Let's Make Every Pull Request Meaningful: An Empirical Analysis of Developer and Agentic Pull Requests

Analyzes 40,214 developer and agentic pull requests to compare merge outcomes and identify how submitter attributes and review features differ between human and AI agent contributi…

Eval & Observability 2601.18749 notes →

Automated Structural Testing of LLM-Based Agents: Methods, Framework, and Case Studies

Presents structural testing methods for LLM-based agents using OpenTelemetry traces, mocking for reproducible behavior, and automated assertions for component-level verification.

Eval & Observability 2601.18827 notes →

When AI Agents Touch CI/CD Configurations: Frequency and Success

Analyzes how five AI coding agents interact with CI/CD configurations across 8,031 pull requests, examining modification frequency, merge rates, and build success.

Eval & Observability 2601.17413 notes →

Fingerprinting AI Coding Agents on GitHub

Identifies behavioral signatures of five AI coding agents from 33,580 pull requests using commit, PR structure, and code features for agent attribution.

Eval & Observability 2601.17406 notes →

Interpreting Agentic Systems: Beyond Model Explanations to System-Level Accountability

Assesses existing interpretability methods for agentic systems and identifies gaps in explaining temporal dynamics, compounding decisions, and context-dependent behaviors.

Eval & Observability 2601.17168 notes →

AI builds, We Analyze: An Empirical Study of AI-Generated Build Code Quality

Investigates maintainability and security-related build code smells in AI-agent-generated pull requests across 364 identified quality issues.

Eval & Observability 2601.16839 notes →

Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source

Examines long-term survival of AI-agent-generated code through survival analysis of 200,000+ code units across 201 open-source projects.

Eval & Observability 2601.16809 notes →

LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents

Develops an oracle counterfactual framework for multi-turn agentic tasks that measures the criticality of individual capabilities like planning and state tracking.

Eval & Observability 2601.16649 notes →

When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems

Presents a 12-category error taxonomy and diagnostic framework for evaluating tool-use reliability across open-weight and proprietary LLMs in multi-agent systems on edge hardware.

Eval & Observability 2601.16280 notes →

Agentic Confidence Calibration

Introduces the problem of agentic confidence calibration and proposes Holistic Trajectory Calibration, extracting process-level features across an agent's entire trajectory to diag…

Eval & Observability 2601.15778 notes →

Improving Methodologies for Agentic Evaluations Across Domains: Leakage of Sensitive Information, Fraud and Cybersecurity Threats

Examines methodological challenges in evaluating AI agents across sensitive information leakage, fraud, and cybersecurity threats through a multi-national collaborative benchmarkin…

Eval & Observability 2601.15679 notes →

MiRAGE: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation

Introduces a multi-agent framework that generates verified, domain-specific, multimodal, multi-hop question-answer datasets for benchmarking retrieval-augmented generation systems.

Eval & Observability 2601.15487 notes →

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

Analyzes 1,187 bug reports from LLM agent software across seven frameworks to categorize bug types, root causes, effects, and tests automated bug labeling with a ReAct agent.

Eval & Observability 2601.15232 notes →

The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution

Proposes a hierarchical framework for general agentic attribution that identifies internal factors driving agent actions through temporal likelihood dynamics and perturbation-based…

Eval & Observability 2601.15075 notes →

Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering

Analyzes token consumption patterns across software development lifecycle stages in a multi-agent system to identify where tokens are consumed and which stages drive cost.

Eval & Observability 2601.14470 notes →

APEX-Agents

Introduces a benchmark of 480 long-horizon, cross-application productivity tasks created by investment banking analysts, consultants, and lawyers for evaluating AI agent capabiliti…

Eval & Observability 2601.14242 notes →

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

Introduces a benchmark of 600+ collaborative coding tasks to evaluate whether coding agents can coordinate as effective teammates under various coordination structures.

Eval & Observability 2601.13295 notes →

Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

Investigates how RAG systems can game nugget-based LLM judge evaluations through metric overfitting, demonstrating near-perfect scores when evaluation elements are leaked or predic…

Eval & Observability 2601.13227 notes →

Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

Introduces the Determinism-Faithfulness Assurance Harness for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using LLM agents across 74 configuratio…

Eval & Observability 2601.15322 notes →

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

Presents a process-aware and auditable multi-agent evaluation framework that plans, executes, and aggregates multi-step evaluations across heterogeneous agentic workflows under hum…

Eval & Observability 2601.11903 notes →

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Introduces a curated benchmark of 89 hard tasks in computer terminal environments with unique environments, human-written solutions, and comprehensive tests for evaluating frontier…

Eval & Observability 2601.11868 notes →

ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue Systems

Introduces a benchmark and evaluation framework for agentic task-oriented dialogue systems covering multi-goal coordination, dependency management, memory, adaptability, and proact…

Eval & Observability 2601.11854 notes →

What Do LLM Agents Know About Their World? Task2Quiz

Decouples task execution from environment understanding with a deterministic QA paradigm to study whether task success is actually a good proxy for how well agents understand their…

Eval & Observability 2601.09503 notes →

The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments

Evaluates frontier models on 150 workplace tasks to identify an empirical hierarchy of agentic capabilities spanning tool use, planning, adaptability, groundedness, and common-sens…

Eval & Observability 2601.09032 notes →

ViDoRe V3: A Comprehensive Evaluation of RAG in Complex Real-World Scenarios

Introduces a multimodal RAG benchmark with 26K pages and 3,099 queries in 6 languages to evaluate retrieval across non-textual elements and open-ended queries.

Eval & Observability 2601.08620 notes →

M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

Evaluates LLM agent social behaviors in mixed-motive games using process-aware analysis of both reasoning and communication rather than outcome-only metrics.

Eval & Observability 2601.08462 notes →

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Benchmarks whether agents can proactively use long-term memory to execute tool-based actions, rather than just passively retrieving facts on demand.

Eval & Observability 2601.19935 notes →

Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

Proposes a formal framework for actively evaluating general-purpose agents across multiple tasks, selecting which tasks and agents to sample next to minimize ranking error over tim…

Eval & Observability 2601.07651 notes →

VirtualEnv: A Platform for Embodied AI Research

Introduces an Unreal Engine 5 simulation platform for benchmarking LLM-driven agents on embodied tasks including navigation, object manipulation, and multi-agent coordination in pr…

Eval & Observability 2601.07553 notes →

FROAV: A Framework for RAG Observation and Agent Verification

Presents an open-source platform combining visual workflow orchestration with LLM-as-a-Judge evaluation for prototyping and validating RAG-based agent pipelines without infrastruct…

Eval & Observability 2601.07504 notes →

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors

Benchmarks model robustness across 11 RAG, reasoning, alignment, and tool-use tasks against diverse contextual noise types including random documents, irrelevant histories, and har…

Eval & Observability 2601.07226 notes →

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Introduces a project-oriented memory benchmark with 2,000+ cross-session dialogues across eleven scenarios to evaluate how well agents track evolving goals and dynamic context depe…

Eval & Observability 2601.06966 notes →

IDRBench: Interactive Deep Research Benchmark

Introduces the first benchmark for interactive deep research combining a modular multi-agent framework with on-demand user interaction, a scalable user simulator, and interaction-a…

Eval & Observability 2601.06676 notes →

ToolGym: an Open-world Tool-using Environment for Scalable Agent Testing and Data Curation

Introduces an open-world tool-using environment with 5,571 tools across 204 apps, a task engine for multi-tool workflows with wild constraints, and a state controller that injects …

Eval & Observability 2601.06328 notes →

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Introduces a tower defense environment for evaluating LLM agent planning and decision-making with low computational demands, multimodal observation, and hallucination assessment su…

Eval & Observability 2601.05899 notes →

MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

Introduces a user-authored benchmark for memory-aware LLM agents in Minecraft with parametric task templates, machine-checkable validators, and bounded-knowledge evaluation under a…

Eval & Observability 2601.05215 notes →

Internal Representations as Indicators of Hallucinations in Agent Tool Selection

Proposes a framework for detecting tool-calling hallucinations in LLM agents by analyzing internal representations during a single forward pass, targeting incorrect tool selection,…

Eval & Observability 2601.05214 notes →

Agent-as-a-Judge

Surveys the evolution from LLM-as-a-Judge to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory f…

Eval & Observability 2601.05111 notes →

Arabic Prompts with English Tools: A Benchmark

Introduces the first benchmark for evaluating tool-calling and agentic capabilities of LLMs in Arabic, measuring functional accuracy and robustness in Arabic agentic workflows.

Eval & Observability 2601.05101 notes →

Effects of Personality Steering on Cooperative Behavior in LLM Agents

Examines how Big Five personality steering affects cooperative behavior in LLM agents using repeated Prisoner's Dilemma games across multiple model generations.

Eval & Observability 2601.05302 notes →

Analyzing Message-Code Inconsistency in AI Coding Agent-Authored Pull Requests

Analyzes message-code inconsistency in pull requests authored by AI coding agents across five agent systems to study trustworthiness of agent-generated PR descriptions.

Eval & Observability 2601.04886 notes →

GUITester: Enabling GUI Agents for Exploratory Defect Discovery

Proposes a multi-agent framework for autonomous exploratory GUI testing that decouples navigation from verification via planning-execution and hierarchical reflection modules.

Eval & Observability 2601.04500 notes →

Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems

Introduces the concept of agent drift and a composite metric framework for quantifying semantic, coordination, and behavioral degradation in multi-agent LLM systems over extended i…

Eval & Observability 2601.04170 notes →

M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Domains and Modalities?

Introduces a unified benchmark for evaluating Multi-Agent Debate methods across multiple domains, modalities, and efficiency metrics including token consumption and inference time.

Eval & Observability 2601.02854 notes →

Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

Documents six recurring failure modes across four end-to-end attempts at autonomous ML research using a pipeline of LLM agents mapped to stages of the scientific workflow.

Eval & Observability 2601.03315 notes →

LongDA: Benchmarking LLM Agents for Long-Document Data Analysis

Introduces a data analysis benchmark for evaluating LLM agents under documentation-intensive analytical workflows requiring long document navigation and multi-step computation.

Eval & Observability 2601.02598 notes →

The Rise of Agentic Testing: Multi-Agent Systems for Robust Software Quality Assurance

Proposes a closed-loop multi-agent testing framework with generation, execution analysis, and review optimization agents for autonomous software test refinement.

Eval & Observability 2601.02454 notes →

Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

Proposes a causal framework using structural causal models and counterfactual interventions to audit whether reasoning traces in LLM agents are faithful generative drivers or post-…

Eval & Observability 2601.02314 notes →

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

Introduces a benchmark for evaluating agent reliability across consistency, robustness to perturbations, and fault tolerance under chaos-engineering-style tool failure injection.

Eval & Observability 2601.06112 notes →

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

Introduces an evaluation suite that standardizes MAS configuration and execution, exports framework-agnostic execution traces, and enables systematic reliability assessment across …

Eval & Observability 2601.00481 notes →

Beyond Perfect APIs: WildAGTEval

Introduces a benchmark for evaluating LLM agent function-calling under realistic API complexity including noisy outputs, detailed specifications, and runtime challenges.

Eval & Observability 2601.00268 notes →

Survey on Evaluation of LLM-based Agents

LLM-based agents represent a paradigm shift in AI, enabling autonomous systems to plan, reason, and use tools while interacting with dynamic environments. This paper provides the f…

Eval & Observability 2503.16416 notes → 💬 Tier 1. 에이전트 평가 지형 전체 지도. 트라젝토리 평가, MCP Atlas, Too…

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attem…

Eval & Observability 2603.29231 notes → 💬 Tier 1. 신뢰성을 과학으로 다루는 롱호라이즌 프레임워크. Kyle의 control-t…

Demystifying Evals for AI Agents (Anthropic Blog)

Anthropic 블로그. 에이전트 평가가 왜 어려운지 직관을 제공. Tier 2 평가 논문들을 읽기 전에 이 글로 맥락을 잡으면 훨씬 잘 읽힌다.

Eval & Observability blog/anthropic/demys notes → 💬 Tier 0 — 논문 아님. 에이전트 평가 직관 잡기용. Tier 2 평가 논문(2503.…

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fai…

Eval & Observability 2606.01365 notes → 💬 Tier 2. 멀티에이전트 낭비 연산 조기 진단 — 실패 인식 관측가능성. 실행이 회복 가…