PSA arXiv paper - image modeling briefing

Evaluation has two jobs.

Measure the score reliably. Discover the failures broadly. The paper shows these scale differently, which is exactly why VIVID matters for image AI product experience.

Start paper story Read arXiv paper Open Apple evaluation

Source: Logarithmic Scores, Power-Law Discoveries, Jung & Na, 2026 preprint.

The problem

Single judges collapse user experience into one number.

Agent-based products fail through interaction paths: confusion accumulates, trust breaks, and different users expose different corners of the behavior space.

A model can score well and still leave decision-makers blind to the tail of user friction.

This is why the image modeling evaluation uses persona-specific journeys rather than one prompt, one image, one score.

Who tests the target?

The evaluator is a persona, not a generic judge prompt.

The paper starts by turning a judge into a structured virtual individual: traits, expertise, memory, tools, and repeated interaction behavior.

P

Persona Agent Judge

Big Five profile, task expertise, memory, and tool access.

Openness

Care

Risk

Expertise

32 diverse evaluator agents

A panel spans expertise, personality, and user context so discoveries do not collapse into one judge's preference.

→

Multi-round interaction Each persona uses and evaluates the target system over time.

Experiment Design

The experiment creates breadth without losing control.

The design crosses domains, complexity, model families, and output types. That structure lets the paper separate model quality from coverage of user-visible issues.

15 multi-domain tasks

SaaS

Dev

E-Com

Edu

Health

Complex

Simple

Medium

Hard

×

2 model pairs Proprietary vs. open-source target systems.

2 output channels

Quality score

Qualitative insights

960 sessions The panel is large enough to study scaling behavior.

Primary Findings 1

The paper validates both agreement and discovery depth.

The validation layer asks whether persona agents behave like useful evaluators: human-aligned enough for measurement, diverse enough to find what a single judge misses.

p = 0.38Agent-human score differences are indistinguishable from human-human disagreement.

50%Lower score variance when using structured persona conditioning.

+47%Increase in qualitative insights from structured personas.

1.9xExpert agents find more unique categories of issues than novice agents.

Primary Findings 2

Scores saturate fast. Discoveries keep growing.

The paper's central result is the split between measurement reliability and issue discovery coverage. The figure should be read as the evidence, not decoration.

82%N=8 is already strong for score reliability.

42%N=8 is still thin for discovery coverage.

b=.69Unique issue discovery follows a power-law tail.power-law discovery curve exponent

Why PSA matters? (vs. a simple prompt injection)

Structured personas control variance while preserving discovery diversity.

Section 4.5 tests this directly: structured persona conditioning beats simple prompting, no-persona prompting, and repeating one agent.

Section 4.5 ablation

Structured PSA keeps the panel diverse without making scores noisy.

The paper compares four ways to create evaluators: structured personas, simple persona prompts, no persona, and repeating one agent.

96 sessions 4 conditions 3 tasks 8 agents

0.087lowest score SD

+47%more insights vs simple

d=-0.35moderated expertise effect

Ablation matrix Score SD is better when lower. Insights/session is better when higher.

Structured is the only balanced condition

Condition Score SD ↓ Insights / session ↑ Expertise effect

Structuredfull persona profile

0.087

13.2

d = -0.35controlled

Simplepersona text only

0.160

9.0

d = -0.17weakly separated

No Personageneric evaluator

0.164

8.6

d = -1.03expertise dominates

Repeatedsame agent rerun

0.151

12.8

—not a persona panel

Takeaway: simple prompts can create surface diversity, but structured personas give lower score variance and better issue coverage at the same panel size.

How VIVID can be applied to you

Use persona evaluation to turn model behavior into debugging evidence.

VIVID shows which user segments keep trusting, disengage, or route around image AI after repeated edits.

Step 1

Product lens

VIVID simulates user journeys with memory, persona, and goals.

Step 2

Paper evidence

Structured persona panels separate reliable measurement from broad issue discovery.

Step 3

Evaluation evidence

S1-S5 shows concrete friction: identity, memory, place, text, and commerce trust.

Continue to evaluation page

Appendix

Full paper experiment diagram.

32 persona agents, 960 sessions, 15 tasks, 5 domains, and 3 complexity levels.

32persona judges

960sessions

15tasks

5 x 3domains x complexity