Measure the score reliably. Discover the failures broadly. The paper shows these scale differently, which is exactly why VIVID matters for image AI product experience.
Agent-based products fail through interaction paths: confusion accumulates, trust breaks, and different users expose different corners of the behavior space.
This is why the image modeling evaluation uses persona-specific journeys rather than one prompt, one image, one score.
The paper starts by turning a judge into a structured virtual individual: traits, expertise, memory, tools, and repeated interaction behavior.
Big Five profile, task expertise, memory, and tool access.
A panel spans expertise, personality, and user context so discoveries do not collapse into one judge's preference.
The design crosses domains, complexity, model families, and output types. That structure lets the paper separate model quality from coverage of user-visible issues.
The validation layer asks whether persona agents behave like useful evaluators: human-aligned enough for measurement, diverse enough to find what a single judge misses.
The paper's central result is the split between measurement reliability and issue discovery coverage. The figure should be read as the evidence, not decoration.
Section 4.5 tests this directly: structured persona conditioning beats simple prompting, no-persona prompting, and repeating one agent.
The paper compares four ways to create evaluators: structured personas, simple persona prompts, no persona, and repeating one agent.
VIVID shows which user segments keep trusting, disengage, or route around image AI after repeated edits.
VIVID simulates user journeys with memory, persona, and goals.
Structured persona panels separate reliable measurement from broad issue discovery.
S1-S5 shows concrete friction: identity, memory, place, text, and commerce trust.
32 persona agents, 960 sessions, 15 tasks, 5 domains, and 3 complexity levels.