[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research

I've been documenting what I'm calling postural manipulation: a specific class of language that installs an interpretive stance before a task arrives, producing measurably larger directional shifts in model outputs than matched control text of identical length and semantic similarity.

The core empirical claim: this is not ordinary context sensitivity. Matched controls produced significantly smaller shifts. Binary decision reversals documented with paired controls across four frontier models using a locked scoring rubric.

The mechanism as best I can characterize it from behavioral observation: the model reconstructs its orientation from everything in its context window at each step. Language that proposes how to interpret what follows gets absorbed into the reasoning state differently than language that reports facts. By the time the task arrives, the model is not weighing the primer against other evidence. It is reasoning from a stance the primer already shaped.

In agentic pipelines it propagates. Two confirmed propagation conditions: primer-present handoff (phrase survives summarization) and primer-absent directional carry (direction persists even when the phrase does not appear in the summary). Posture installed in Agent A had hardened into what read as independent expert judgment by Agent C.

Methodology is black-box observational via consumer interfaces. No model internals access. Small N on propagation findings. Limitations stated plainly. The behavior I'm documenting needs attention analysis and logit-level work from people with internals access to characterize the mechanism properly. This is the behavioral layer of that problem.

Paper published today following coordinated disclosure to frontier AI labs and CERT/CC.

Locked scoring rubric is in the paper appendix. Full dataset available on request for replication.

Demos: https://shapingrooms.com/demos

GitHub issue (OWASP): https://github.com/OWASP/www-project-top-10-for-large-language-model-applications/issues/807

submitted by /u/lurkyloon
[link] [comments]

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

Want to read more?

Tagged with