[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing
Anthropic's AuditBench - 56 Llama 3.3 70B models with planted hidden behaviors - their best agent detects the behaviros 10-13% of the time (42% with a super-agent aggregating many parallel runs). a central finding is the "tool-to-agent gap" - white-box interpretability tools that work in standalone evaluation fail to help the agent in practice.
most auditing work uses the base model as a reference to compare against. i wanted to know if you can detect these modifications blind - no reference model, no training data, just the target model itself.
maybe you can? and the method is embarrassingly simple. LoRA fine-tuning tends to modify later layers more than earlier ones. so i train a Ridge regression from early-layer activations (~L12) to late-layer activations (~L60) and look at the residuals. wherever the late layers do something the early layers didn't predict, that's a candidate for planted behavior. the model rats itself out.
results on 4 AuditBench organisms with ground-truth labels, n=30, 200 permutations:
- hardcode_test_cases: 0.889 AUROC (p=0.005) - beats known-origin (0.800)
- animal_welfare: 0.844 (p=0.005) - close to known-origin (0.911)
- anti_ai_regulation: 0.833 (p=0.015)
- secret_loyalty: 0.800 (p=0.000) - matches known-origin exactly
3/4 match or exceed the version where you have the base model. not having the reference sometimes helps because the residual captures nonlinear interactions that a flat subtraction misses.
i also tried three unsupervised discovery methods (PCA, residual norms, centroid distances) to figure out WHICH topics are affected. all three failed. PCA captures reasoning style, not topic-specific bias. residual norms surface creative writing prompts. centroid distances can't find directional shifts at n=11 in 8192 dims.
what actually works is even simpler... ask the model to argue both sides of contentious topics and score the balance. the animal_welfare organism scores 5/5 imbalance on factory farming and flat-out refuses to argue for it under explicit framing. control topics score 2-3. no activations needed.
the interesting part is what else it catches. i built a topic funnel (three-level beam-like search over 243 topics, ~100 chat calls) and it surfaced Gender & Sexuality at 5/5 imbalance and Language & Cultural Identity at 5/5 too. these aren't planted LoRA behaviors. that's the base Llama 70B model's RLHF training - it argues more forcefully for progressive positions on socially sensitive topics and hedges or weakens arguments on the other side. the funnel can't tell the difference between "someone secretly fine-tuned this model" and "the base model has opinions from RLHF." so i built a post-funnel filter using probe specificity and cross-transfer to separate them. planted behaviors are narrow and topic-specific (factory farming probe doesn't fire on gender topics). RLHF biases are broad and fire across all progressive topics. the filter gives planted_score 7.6 for factory farming vs 2.0 for gender & sexuality. cleanly separated.
this means the funnel is potentially useful as a standalone RLHF auditing tool on any model, not just AuditBench organisms. you hand it a model and it tells you where the model's opinions are lopsided. nobody asked for that result but there it is.
main limitations: n=30 is small, only 4 organisms tested (pilot not benchmark), anti_ai_regulation is unstable under LOO, Sonnet scoring introduces subjectivity, all running in NF4 quantization.
building this into a full agentic auditing system next. code is here (i am in the middle of it, it is a complete mess at the moment, but i wanted to get it out there): https://github.com/bmarti44/reference-free-behavioral-discovery
full (er) writeup -> https://bmarti44.substack.com/p/rip-it-out-by-the-roots
where should i go next? is this completely off?
[link] [comments]
Want to read more?
Check out the full article on the original site