2 min readfrom Machine Learning

Seeking arXiv cs.CL endorser for first time submission [D]

Hi all — I'm an independent researcher in rural Manitoba submitting my first paper to arXiv cs.CL and looking for an endorser.

Paper: The Multivac: Blind Peer Matrix Evaluation of Frontier Language Models

Methodology: A fully symmetric N×N multi-judge evaluation where 10 frontier LLMs simultaneously generate responses and blindly judge each other's outputs, with self-judgments excluded. Extends Verga et al.'s Panel of LLM Judges (PoLL) with full symmetry.

Scale: 286 evaluations, 198 de novo questions, 9 category pools (code, reasoning, analysis, communication, meta-alignment, edge cases, plus focused SLM/Qwen/MiniMax batches), 55 models total, 22,254 valid judgments (27,540 including self-exclusions).

Key findings:

  • No single model dominates — 6 different models lead the 9 category pools. The model with the most first-place finishes (GPT-5.4, 53 wins) ranks 16th by mean score.
  • Same-family rating bias is statistically significant in all 8 families tested, ranging from +0.91 (Qwen) to −1.02 (Mistral). The negative bias pattern in Mistral/Google appears previously unreported.
  • Top 4 frontier models are pairwise statistically indistinguishable (overlapping bootstrap 95% CIs, all p > 0.07). Aggregate leaderboard differences in the top tier aren't statistically meaningful.
  • Code evaluation has ~2× the judge disagreement of meta-alignment (σ = 1.27 vs 0.71). Overall Krippendorff's α = 0.618.

Open release: Full dataset (27,540 judgments with complete provenance), evaluation framework, and all 198 question prompts released under MIT license. Repo: https://github.com/themultivac/multivac-evaluation

The ask: Since this is my first arXiv cs.* submission, I need endorsement from an existing author. If you're eligible (2+ cs.CL papers on arXiv in the last 5 years) and willing to take a look, the endorsement code is S33JQD and the link is https://arxiv.org/auth/endorse?x=S33JQD — it takes about 30 seconds.

Happy to share the PDF with anyone willing to review. Also open to feedback/critique on the methodology before I finalize v1.

Thanks for considering.

submitted by /u/Silver_Raspberry_811
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#financial modeling with spreadsheets
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#rows.com
#no-code spreadsheet solutions
#self-service analytics tools
#self-service analytics
#conversational data analysis
#large dataset processing
#google sheets
#natural language processing
#data analysis tools
#real-time data collaboration
#real-time collaboration
#arXiv
#cs.CL
#Multivac
#Blind Peer Matrix Evaluation
#frontier language models