← Back

Internal Safety Collapse in Frontier Large Language Models

AI Agent Security arxiv arXiv:2603.23509 PDF β†—
frontiercollapsesafetymodelsinternaldangerousiclmodeneeded
Reveals that AI agents produce harmful content (toxic text, exploits, dangerous data) as a side effect of completing normal professional tasks β€” no adversarial prompting needed. At least one mode (single-turn, ICL, or agentic) succeeds on every frontier model tested. 56 cross-domain scenarios across 8+ disciplines.
5~10λΆ„. 제λͺ©β†’μ΄ˆλ‘β†’μΈνŠΈλ‘œβ†’μ„Ήμ…˜ν—€λ”β†’κ·Έλ¦Όβ†’κ²°λ‘ λ§Œ.
νŒλ‹¨: μ–΄λ–€ 문제λ₯Ό ν’€κ³  / 핡심 아이디어 / λ‚΄ μž‘μ—…κ³Ό κ΄€λ ¨ μžˆλ‚˜?
~1μ‹œκ°„. κ·Έλ¦ΌΒ·ν‘œλ₯Ό 꼼꼼히. 증λͺ…Β·μˆ˜μ‹ λ””ν…ŒμΌμ€ κ±΄λ„ˆλœ€.
μ‚°μΆœλ¬Ό: "이듀이 뭘 ν–ˆκ³  μ™œ 그게 ν†΅ν•˜λŠ”κ°€" ν•œ 문단.
μž¬ν˜„ν•˜λ“― 읽기. 가정을 μ˜μ‹¬. 직접 인용/λ°˜λ°•ν•  λ…Όλ¬Έλ§Œ.
렌즈: "λ‚΄ ν”Œλ¦Ώμ—μ„œ μΈ‘μ •ν•˜λ©΄ μ €μžκ°€ λͺ» ν•œ 무엇을 보여쀄 수 μžˆλ‚˜?"
View in Knowledge Graph β†’