PSA arXiv paper - image modeling briefing

Evaluation has two jobs.

Measure the score reliably. Discover the failures broadly. The paper shows these scale differently, which is exactly why VIVID matters for image AI product experience.

Source: Logarithmic Scores, Power-Law Discoveries, Jung & Na, 2026 preprint.
The problem

Single judges collapse user experience into one number.

Agent-based products fail through interaction paths: confusion accumulates, trust breaks, and different users expose different corners of the behavior space.

A model can score well and still leave decision-makers blind to the tail of user friction.

This is why the image modeling evaluation uses persona-specific journeys rather than one prompt, one image, one score.

Who tests the target?

The evaluator is a persona, not a generic judge prompt.

The paper starts by turning a judge into a structured virtual individual: traits, expertise, memory, tools, and repeated interaction behavior.

P

Persona Agent Judge

Big Five profile, task expertise, memory, and tool access.

Openness
Care
Risk
Expertise

32 diverse evaluator agents

A panel spans expertise, personality, and user context so discoveries do not collapse into one judge's preference.

Multi-round interaction Each persona uses and evaluates the target system over time.
Experiment Design

The experiment creates breadth without losing control.

The design crosses domains, complexity, model families, and output types. That structure lets the paper separate model quality from coverage of user-visible issues.

15 multi-domain tasks

SaaS
Dev
E-Com
Edu
Health
Complex
Simple
Medium
Hard
×
2 model pairs Proprietary vs. open-source target systems.
2 output channels
Quality score
Qualitative insights
960 sessions The panel is large enough to study scaling behavior.
Primary Findings 1

The paper validates both agreement and discovery depth.

The validation layer asks whether persona agents behave like useful evaluators: human-aligned enough for measurement, diverse enough to find what a single judge misses.

p = 0.38Agent-human score differences are indistinguishable from human-human disagreement.
50%Lower score variance when using structured persona conditioning.
+47%Increase in qualitative insights from structured personas.
1.9xExpert agents find more unique categories of issues than novice agents.
Primary Findings 2

Scores saturate fast. Discoveries keep growing.

The paper's central result is the split between measurement reliability and issue discovery coverage. The figure should be read as the evidence, not decoration.

Score coverage dissociation figure
82%N=8 is already strong for score reliability.
42%N=8 is still thin for discovery coverage.
b=.69Unique issue discovery follows a power-law tail.power-law discovery curve exponent
Why PSA matters? (vs. a simple prompt injection)

Structured personas control variance while preserving discovery diversity.

Section 4.5 tests this directly: structured persona conditioning beats simple prompting, no-persona prompting, and repeating one agent.

Section 4.5 ablation
Structured PSA keeps the panel diverse without making scores noisy.

The paper compares four ways to create evaluators: structured personas, simple persona prompts, no persona, and repeating one agent.

96 sessions 4 conditions 3 tasks 8 agents
0.087lowest score SD
+47%more insights vs simple
d=-0.35moderated expertise effect
Ablation matrix Score SD is better when lower. Insights/session is better when higher.
Structured is the only balanced condition
Condition Score SD ↓ Insights / session ↑ Expertise effect
Structuredfull persona profile
0.087
13.2
d = -0.35controlled
Simplepersona text only
0.160
9.0
d = -0.17weakly separated
No Personageneric evaluator
0.164
8.6
d = -1.03expertise dominates
Repeatedsame agent rerun
0.151
12.8
not a persona panel
Takeaway: simple prompts can create surface diversity, but structured personas give lower score variance and better issue coverage at the same panel size.
How VIVID can be applied to you

Use persona evaluation to turn model behavior into debugging evidence.

VIVID shows which user segments keep trusting, disengage, or route around image AI after repeated edits.

Step 1

Product lens

VIVID simulates user journeys with memory, persona, and goals.

Step 2

Paper evidence

Structured persona panels separate reliable measurement from broad issue discovery.

Step 3

Evaluation evidence

S1-S5 shows concrete friction: identity, memory, place, text, and commerce trust.

Appendix

Full paper experiment diagram.

32 persona agents, 960 sessions, 15 tasks, 5 domains, and 3 complexity levels.

32persona judges
960sessions
15tasks
5 x 3domains x complexity
Paper overview figure