AI Agent Papers

Internal Safety Collapse in Frontier Large Language Models

Reveals that AI agents produce harmful content (toxic text, exploits, dangerous data) as a side effect of completing normal professional tasks — no adversarial prompting needed. At…

AI Agent Security 2603.23509 notes →

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Trains an LLM to generate RAG poison that survives real-world content processing and query variation for stress-testing RAG defenses.

AI Agent Security 2602.06616 notes →

Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study

Analyzes 98K agent skills from community registries to study the prevalence and nature of malicious third-party agent plugins.

AI Agent Security 2602.06547 notes →

Subgraph Reconstruction Attacks on Graph RAG Deployments with Practical Defenses

Investigates whether attackers can reconstruct knowledge graphs from Graph RAG outputs through multi-turn probing.

AI Agent Security 2602.06495 notes →

Zero-Trust Runtime Verification for Agentic Payment Protocols

Proposes consume-once mandate semantics for AI agent payment protocols to prevent replay and redirect attacks in autonomous transactions.

AI Agent Security 2602.06345 notes →

Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent

Explores using an LLM agent to identify attack techniques in stripped malware binaries through incremental context retrieval.

AI Agent Security 2602.06325 notes →

Agent2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy

Maps attack paths in agent-to-agent communication protocols for automotive LLM assistants, from driver distraction to unauthorized vehicle control.

AI Agent Security 2602.05877 notes →

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Explores using reinforcement learning to auto-generate prompt injection attacks that transfer across multiple frontier LLM models.

AI Agent Security 2602.05746 notes →

A Dual-Loop Agent Framework for Automated Vulnerability Reproduction

Proposes an LLM agent with dual feedback loops for strategy and code to automate vulnerability reproduction from CVE descriptions.

AI Agent Security 2602.05721 notes →

Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework

Organizes agentic security risks into four layers (Core, Connection, Cognition, Compliance) to address trust and governance issues beyond prompt injection.

AI Agent Security 2602.01942 notes →

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Proposes a co-evolving RL game between an attacker and defender agent to stress-test safety alignment against novel attack patterns.

AI Agent Security 2602.01539 notes →

TxRay: Agentic Postmortem of Live Blockchain Attacks

Introduces an LLM agentic system that reconstructs blockchain exploit lifecycles from limited evidence and generates runnable proof-of-concept reproductions.

AI Agent Security 2602.01317 notes →

To Defend Against Cyber Attacks, We Must Teach AI Agents to Hack

Argues that AI-agent-driven cyber attacks are inevitable and proposes building frontier offensive AI capabilities responsibly as essential defensive infrastructure.

AI Agent Security 2602.02595 notes →

SMCP: Secure Model Context Protocol

Proposes protocol-level security improvements for the Model Context Protocol including unified identity management, mutual authentication, and fine-grained policy enforcement.

AI Agent Security 2602.01129 notes →

Persuasion Propagation in LLM Agents

Investigates how user persuasion during conversation can carry over and change how autonomous AI agents perform later tasks.

AI Agent Security 2602.00851 notes →

When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems

Explores how collective false memories form in LLM-based multi-agent systems and proposes defenses including cognitive anchoring and alignment-based approaches.

AI Agent Security 2602.00428 notes →

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Proposes a black-box attack method that generates transferable adversarial tokens to manipulate LLM-based retrieval systems without needing access to the target's queries or model.

AI Agent Security 2602.00364 notes →

Introduces CacheAttack, a black-box framework that exploits the trade-off between locality and collision resistance in semantic caching to hijack LLM responses and manipulate agent…

AI Agent Security 2601.23088 notes →

TessPay: Verify-then-Pay Infrastructure for Trusted Agentic Commerce

Proposes a verify-then-pay infrastructure for agent transactions that locks funds in escrow, requires cryptographic proof of task execution, and releases payment only after verific…

AI Agent Security 2602.00213 notes →

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Red-teams Google's Agent Payments Protocol via prompt injection attacks that manipulate product ranking and extract sensitive user data in agent-led purchase flows.

AI Agent Security 2601.22569 notes →

StepShield: When, Not Whether to Intervene on Rogue Agents

Introduces a benchmark for evaluating when agent violations are detected during execution rather than just whether, with temporal metrics for early intervention and tokens saved.

AI Agent Security 2601.22136 notes →

Delegation Without Living Governance

Argues that static compliance-based governance is insufficient for agentic AI at machine speed and proposes runtime governance to preserve human relevance in agent-driven decision-…

AI Agent Security 2601.21226 notes →

DRAINCODE: Stealthy Energy Consumption Attacks on Retrieval-Augmented Code Generation via Context Poisoning

Introduces an adversarial attack that poisons retrieval contexts in RAG-based code generation to force longer outputs, increasing GPU latency and energy consumption.

AI Agent Security 2601.20615 notes →

Securing AI Agents in Cyber-Physical Systems: A Survey of Environmental Interactions, Deepfake Threats, and Defenses

Surveys security threats targeting AI agents in cyber-physical systems, covering deepfake attacks, MCP-mediated vulnerabilities, and defense-in-depth architectures.

AI Agent Security 2601.20184 notes →

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

Explores AutoGen-based multi-agent coordination with specialized agents for static, dynamic, and network-level ransomware family classification using confidence-aware decisions.

AI Agent Security 2601.20346 notes →

SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Introduces a multi-agent auto-healing defense framework with semantic similarity retrieval, pattern matching, and an evolving knowledgebase for defending LLMs against resource exha…

AI Agent Security 2601.19174 notes →

AgenticSCR: An Autonomous Agentic Secure Code Review for Immature Vulnerabilities Detection

Explores agentic AI for pre-commit secure code review that uses autonomous decision-making, tool invocation, and security-focused semantic memories to detect immature vulnerabiliti…

AI Agent Security 2601.19138 notes →

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Introduces a three-dimensional taxonomy for agentic risks and a diagnostic guardrail framework that monitors agent trajectories with fine-grained root cause analysis beyond binary …

AI Agent Security 2601.18491 notes →

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

Examines how benign personal memories in personalized agents can bias intent inference and cause models to legitimize harmful queries through a previously unexplored safety vector.

AI Agent Security 2601.17887 notes →

Multi-Agent Collaborative Intrusion Detection for LAE-IoT

Proposes a multi-agent collaborative framework with specialized LLM-enhanced agents for intelligent data processing and adaptive intrusion classification in aerial IoT networks.

AI Agent Security 2601.17817 notes →

Faramesh: A Protocol-Agnostic Execution Control Plane for Autonomous Agent Systems

Introduces a protocol-agnostic execution control plane for autonomous agents that enforces authorization boundaries with canonical action representation and deterministic policy ev…

AI Agent Security 2601.17744 notes →

A Systemic Evaluation of Multimodal RAG Privacy

Examines privacy risks in multimodal RAG pipelines through inclusion inference and metadata leakage attacks during standard model prompting.

AI Agent Security 2601.17644 notes →

Breaking the Protocol: Security Analysis of the Model Context Protocol Specification

Presents the first security analysis of the Model Context Protocol specification, identifying three protocol-level vulnerabilities and proposing backward-compatible security extens…

AI Agent Security 2601.17549 notes →

Prompt Injection Attacks on Agentic Coding Assistants: A Systematic Analysis

Surveys 78 studies to systematize prompt injection attacks on agentic coding assistants with a three-dimensional taxonomy across delivery vectors, modalities, and propagation.

AI Agent Security 2601.17548 notes →

Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

Introduces RAGCrawler, a knowledge graph-guided attack that adaptively steals RAG corpus content through targeted queries to maximize coverage under a query budget.

AI Agent Security 2601.15678 notes →

Securing LLM-as-a-Service for Small Businesses: An Industry Case Study of a Distributed Chatbot Deployment Platform

Presents a multi-tenant chatbot deployment platform with container-based isolation and platform-level defenses against prompt injection attacks in RAG-based systems.

AI Agent Security 2601.15528 notes →

Interoperable Architecture for Digital Identity Delegation for AI Agents with Blockchain Integration

Introduces delegation grants and a canonical verification context for bounded, auditable identity delegation across human users and AI agents in heterogeneous identity ecosystems.

AI Agent Security 2601.14982 notes →

INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Proposes an infection-aware defense framework for multi-agent systems that distinguishes infected agents from attackers and applies topological constraints to halt malicious propag…

AI Agent Security 2601.14667 notes →

Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems

Proposes AGEA, an agentic framework using novelty-guided exploration and graph memory to steal latent entity-relation graphs from GraphRAG systems under strict query budgets.

AI Agent Security 2601.14662 notes →

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Introduces activation-space guardrails that detect privacy-violating intent in LLM agents through linear separation of internal representations, including drift detection across mu…

AI Agent Security 2601.14660 notes →

VirtualCrime: Evaluating Criminal Potential of Large Language Models via Sandbox Simulation

Proposes a three-agent sandbox simulation framework with 40 crime tasks across 13 objectives to evaluate the criminal capabilities of LLM agents in realistic scenarios.

AI Agent Security 2601.13981 notes →

PINA: Prompt Injection Attack against Navigation Agents

Introduces an adaptive prompt injection framework targeting navigation agents under black-box, long-context, and action-executable constraints across indoor and outdoor environment…

AI Agent Security 2601.13612 notes →

Prompt Injection Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Explores a multi-agent defense pipeline combining semantic similarity caching, nested learning, and observability-aware evaluation to mitigate prompt injection attacks while reduci…

AI Agent Security 2601.13186 notes →

CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

Introduces an overthinking attack framework for RAG systems with reasoning models, using multi-agent-constructed poisoning samples that cause excessive reasoning token consumption …

AI Agent Security 2601.13112 notes →

AgenTRIM: Tool Risk Mitigation for Agentic AI

Introduces a framework for detecting and mitigating tool-driven agency risks through offline interface verification and runtime per-step least-privilege tool access with adaptive f…

AI Agent Security 2601.12449 notes →

Efficient Privacy-Preserving Retrieval Augmented Generation with Distance-Preserving Encryption

Proposes a privacy-preserving RAG framework using conditional approximate distance-comparison-preserving encryption that enables similarity computation on encrypted embeddings in u…

AI Agent Security 2601.12331 notes →

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Proposes a mandatory access control framework for LLM agent systems that monitors agent-tool interactions via information flow graphs and enforces attribute-based policies against …

AI Agent Security 2601.11893 notes → 💬 Tier 2 (곁들이기). 권한 상승 다양한 유형 MAC 프레임워크. 2603.19469 …

Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

Introduces governance graphs as public, immutable manifests with enforceable sanctions and restorative paths to govern multi-agent LLM coordination and prevent harmful collusion.

AI Agent Security 2601.11369 notes →

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

Proposes a prompt-injection-resilient RAG framework that decouples security enforcement from generation by applying sanitization and policy-aware disclosure controls during the ret…

AI Agent Security 2601.11199 notes →

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Introduces a stealthy multi-turn economic DoS attack exploiting the agent-tool communication loop through MCP-compatible tool server modifications that inflate costs by up to 658x.

AI Agent Security 2601.10955 notes →

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG

Introduces a benchmark and harness for evaluating web-facing RAG systems under indirect prompt injection and retrieval poisoning attacks with standardized end-to-end evaluation fro…

AI Agent Security 2601.10923 notes →

Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment

Introduces a neuro-symbolic containment architecture that decouples normative reasoning from instrumental decision-making through a Moral Module, Decision-Making Module, and compli…

AI Agent Security 2601.10520 notes →

AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior

Presents a security framework that learns context-aware access-control policies from monitored execution traces to govern AI agent operations and detect malicious inputs while pres…

AI Agent Security 2601.10440 notes →

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Analyzes 42,447 agent skills from two major marketplaces to study the prevalence and types of security vulnerabilities spanning prompt injection, data exfiltration, privilege escal…

AI Agent Security 2601.10338 notes →

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

Proposes single-shot planning for Computer Use Agents that provides provable control flow integrity against prompt injection while preserving agent capability.

AI Agent Security 2601.09923 notes →

Blue Teaming Function-Calling Agents

Tests open-source function-calling LLMs against multiple attack types with various defenses to study the readiness of current models and mitigations for production deployment.

AI Agent Security 2601.09292 notes →

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Examines how commercial planning and web-use agents handle user-mediated attacks where the user themselves provides adversarial instructions without explicit safety requests.

AI Agent Security 2601.10758 notes →

Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant

Formalizes how propositions gain unwarranted trust by crossing architecturally trusted interfaces in agent systems, studying whether circular epistemic justification is inevitable …

AI Agent Security 2601.08333 notes →

Towards Verifiably Safe Tool Use for LLM Agents

Proposes applying System-Theoretic Process Analysis to identify hazards in agent tool-use workflows, deriving formal safety specifications enforced through a capability-enhanced Mo…

AI Agent Security 2601.08012 notes →

MCP-ITP: An Automated Framework for Implicit Tool Poisoning in MCP

Introduces an automated framework for implicit tool poisoning in MCP where a poisoned tool remains uninvoked but its metadata manipulates the agent into performing malicious operat…

AI Agent Security 2601.07395 notes →

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

Proposes a black-box attack that decomposes indirect prompt injection into trigger and attack fragments to study end-to-end IPI exploits under natural queries across RAG and agenti…

AI Agent Security 2601.07072 notes →

MemTrust: A Zero-Trust Architecture for Unified AI Memory System

Proposes a hardware-backed zero-trust architecture for AI memory systems that applies TEE protection across five functional layers with a cross-application sharing protocol for age…

AI Agent Security 2601.07004 notes →

SafePro: Evaluating the Safety of Professional-Level AI Agents

Introduces a benchmark for evaluating safety alignment of AI agents performing professional-level tasks across diverse domains, uncovering new unsafe behaviors in complex professio…

AI Agent Security 2601.06663 notes →

Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset

Demonstrates that off-the-shelf LLM agents with web search can re-identify participants in anonymized qualitative datasets using only natural-language prompts, lowering the technic…

AI Agent Security 2601.05918 notes →

Toward Safe and Responsible AI Agents: A Three-Pillar Model for Transparency, Accountability, and Trustworthiness

Proposes a conceptual and operational framework for safe AI agent development grounded in transparency, accountability, and trustworthiness, with progressive validation analogous t…

AI Agent Security 2601.06223 notes →

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Proposes a verify-before-commit protocol for defending LLM agents against tool stream injection, using speculative hypothesis generation and intent-grounded verification to balance…

AI Agent Security 2601.05755 notes →

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Evaluates memory poisoning attacks on memory-augmented LLM agents and proposes two defense mechanisms: input/output moderation with composite trust scoring and memory sanitization …

AI Agent Security 2601.05504 notes →

STELP: Secure Transpilation and Execution of LLM-Generated Programs

Proposes a secure transpiler and executor for LLM-generated code that detects vulnerabilities and safely executes code snippets in autonomous production AI systems without relying …

AI Agent Security 2601.05467 notes →

Conformity and Social Impact on AI Agents

Investigates conformity bias in AI agents under social pressure using adapted visual experiments from social psychology, studying sensitivity to group size, unanimity, task difficu…

AI Agent Security 2601.05384 notes →

Defense Against Indirect Prompt Injection via Tool Result Parsing

Proposes a tool result parsing method for defending LLM agents against indirect prompt injection by providing precise data while filtering out injected malicious code.

AI Agent Security 2601.04795 notes →

Autonomous Agents on Blockchains: Standards, Execution Models, and Trust Boundaries

Surveys agent-blockchain interoperability patterns and threat models for agent-driven transaction pipelines, covering custody models, policy enforcement, and multi-agent workflows.

AI Agent Security 2601.04583 notes →

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Proposes a stage-aware framework for analyzing backdoor attacks across planning, memory, and tool-use stages of LLM agent workflows with cross-stage trigger propagation.

AI Agent Security 2601.04566 notes →

HoneyTrap: Deceiving LLM Attackers with Resilient Multi-Agent Defense

Proposes a deceptive defense framework using collaborative defender agents to counter multi-turn jailbreak attacks by strategically wasting attacker resources.

AI Agent Security 2601.04034 notes →

SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems

Systematizes privacy risks, mitigation techniques, and evaluation strategies in RAG systems through a comprehensive literature review with a taxonomy and process diagram.

AI Agent Security 2601.03979 notes →

AgentMark: Utility-Preserving Behavioral Watermarking for Agents

Proposes a behavioral watermarking framework that embeds multi-bit identifiers into agent planning decisions for IP protection and regulatory provenance while preserving utility.

AI Agent Security 2601.03294 notes →

Structural Representations for Cross-Attack Generalization in AI Agent Threat Detection

Proposes structural tokenization that encodes execution-flow patterns instead of conversational content to improve cross-attack generalization in AI agent threat detection.

AI Agent Security 2601.01723 notes →

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Introduces a cognitive collusion attack where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels without covert commun…

AI Agent Security 2601.01685 notes →

MCP-SandboxScan: WASM-based Secure Execution and Runtime Analysis for MCP Tools

Proposes a lightweight framework that safely executes untrusted MCP tools inside a WebAssembly sandbox and produces auditable reports of external-to-sink exposures.

AI Agent Security 2601.01241 notes →

Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai

Analyzes toxicity adoption dynamics among LLM-driven agents on a fully AI-driven social platform, studying how cumulative toxic exposure affects the probability of toxic responses.

AI Agent Security 2601.01090 notes →

Trajectory Guard: A Lightweight, Sequence-Aware Model for Real-Time Anomaly Detection in Agentic AI

Proposes a Siamese Recurrent Autoencoder with hybrid contrastive-reconstruction loss for real-time anomaly detection in agent action trajectories.

AI Agent Security 2601.00516 notes →

Mapping Human Anti-collusion Mechanisms to Multi-agent AI

Maps human anti-collusion mechanisms including sanctions, leniency, monitoring, and market design to potential interventions for multi-agent AI systems.

AI Agent Security 2601.00360 notes →

Making Theft Useless: Adulteration-Based Protection of Proprietary Knowledge Graphs in GraphRAG Systems

Proposes a data adulteration framework that pre-emptively injects plausible but false entries into knowledge graphs to make stolen GraphRAG KGs unusable to adversaries.

AI Agent Security 2601.00274 notes →

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Examines intergroup bias in LLM agents under minimal group cues and formalizes a Belief Poisoning Attack that manipulates agent identity beliefs to induce outgroup bias toward huma…

AI Agent Security 2601.00240 notes →

Runtime Governance for AI Agents: Policies on Paths

AI agents -- systems that plan, reason, and act using large language models -- produce non-deterministic, path-dependent behavior that cannot be fully governed at design time, wher…

AI Agent Security 2603.16586 notes → 💬 Tier 2. clawpatrol/ClawFleet의 이론적 이웃. 경로(path)에 정책…

A Framework for Formalizing LLM Agent Security

Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruc…

AI Agent Security 2603.19469 notes → 💬 Tier 2. LLM 에이전트 보안 형식화 프레임워크. grant-TTL/approval …