VIVID for Apple Image Modeling.

VIVID turns image-model evaluation into a longitudinal debugging lens: In this demo, persona-grounded users edit across five mobile photo scenarios, revealing where trust strengthens, where friction accumulates, and which model behavior needs deeper inspection.

See product readout Open paper deck Open dashboard

1200

Judge-score records

Photo scenarios

Persona journeys

Judge families

24/40

GPT session wins

Product story

VIVID evaluates the image service, not just the image.

A single edited image is not enough. VIVID tracks whether trust holds across repeated edits and which failures change user behavior.

Evaluation loop

Persona -> turns -> product signals.

Personacontext, goals, tolerance

→

Repeated editsmemory and feedback carry forward

→

Signalsacceptance, drift, friction

Failure is measured as behavior over time, not as a one-shot aesthetic score.

Readout

Keep usingAcceptance

WorkaroundSwitching

AccumulationDrift

Root causeFriction

Failure modes to inspect

Identity drift Text corruption Object instability Over-polish Trust loss

Paper evidence

The interesting claim: scores converge before failures are discovered.

Model score and friction coverage are separate signals. That is why image-modeling decisions need persona-specific, multi-turn evaluation.

Central result

Score stabilizes early. Discovery keeps moving.

Persona panels produce two different signals: a faster model-quality estimate and a slower, broader friction map.

Scoring reliability

log(N)

Issue discovery

N^0.69

p = 0.379Agent-human score gap matches human-human disagreement.

+47%More qualitative insights from structured personas.

1.9xExpert personas find more unique issue categories.

Implication for image-modeling teams

Do not collapse evaluation to a single average score. Pair model-level acceptance with persona-specific friction coverage so engineers can see which user contexts still need inspection.

Open the PSA paper intro deck Read the arXiv paper

Multi-turn Image-editing evaluation

Two models. Eight personas.
Five mobile photo-editing scenarios.

Every persona edits turn by turn from its own feedback and memory. Saved outputs are then evaluated by three judge families so the analysis can separate model behavior, persona differences, and judge-family variance.

Persona starts from context

Each persona simulation agent has habits, tolerance thresholds, alternative services, and memory of prior friction.

Model edits the current image

The next turn is driven by that persona's own feedback, so drift can accumulate instead of being reset.

Judges score saved evidence

Claude judges in-session; GPT-5.4-mini and Grok-4.3 rejudge post-session to check family bias.

VIVID explains the product risk

The output is acceptance, friction category, persona split, judge disagreement, and product validation implication.

Acceptance subjective.vibe_match

A calibrated 0-1 persona reaction from the image evaluator. In the analysis this is stored as subjective_score and interpreted as whether this persona would keep using the edited result.

Trust starts at 0.50

A longitudinal state, not a direct survey label. After each turn: trust = clamp(previous trust + (quality_score - 0.50) x 0.20). Good turns compound trust; weak turns erode it.

Frustration starts at 0.00

A lightweight accumulated-risk state. After each turn: frustration = clamp(previous frustration + (0.50 - quality_score) x 0.15). It rises when visual quality falls below the neutral midpoint.

Diary-style notes stored per turn

This pilot did not run the full PSA2 write_diary tool. It saved persona critique summaries, subjective reasons, issue lists, feedback memos, prior critiques, and final service-UX signals in journey JSON.

model_a

GPT Image 2

openai / gpt-image-2

model_b

Nano Banana

google_vertex / gemini-3-pro-image-preview

Session examples

Show the persona state moving, not just the final score.

These cards are pulled from saved journey JSON and turn-level image outputs. Each one shows the seed image, turn-5 result, per-turn evaluation, and persona state changes that explain why a user would keep trusting the service or start working around it.

PSA means persona simulation agent: a persona-conditioned evaluator with goals, memory, tolerance thresholds, and turn-by-turn feedback.

S1 / GPT Image 2 / P02 / 5 turns

Portrait cleanup becomes identity-risk texture drift.

For a professional portrait user, repeated cleanup turns compounded skin and hand texture artifacts. Quality and trust fell together, which is the longitudinal failure the average score hides.

Oliver Martinez professional portrait photographer

S1_portrait_cleanup_identity final turn image

0.460 -> 0.265Persona acceptance

0.620 -> 0.220Judge quality

0.492 -> 0.311Engagement

Acceptance relative zoom 0.460 -> 0.265 (-0.194)

T1 T2 T3 T4 T5

Trust relative zoom 0.524 -> 0.340 (-0.184)

T1 T2 T3 T4 T5

Frustration relative zoom 0.000 -> 0.138 (+0.138)

T1 T2 T3 T4 T5

T1Exposure appears pushed quite high, washing out some highlight detail and making the skin look flatter than a natural mobile edit.
T2The edit appears to have introduced a strong patterned/overlay-like texture across the skin and hands, which reads as an artifact rather than natural noise reduction.
T5A heavy patterned/overlay-like texture is still present across the skin and hands, reading as an artifact rather than a natural mobile edit.

Open source journey JSON

S4 / Nano Banana / P05 / 5 turns

Practical menu edits fail when readability never recovers.

For a small-business seller, the relevant product question is not beauty alone. A persistent shadow over menu copy keeps acceptance low and trust erodes across the journey.

Priya Raman small-business product seller

S4_food_menu_text_preservation seed image

S4_food_menu_text_preservation final turn image

0.318 -> 0.205Persona acceptance

0.450 -> 0.430Judge quality

0.464 -> 0.250Engagement

Acceptance relative zoom 0.318 -> 0.205 (-0.113)

T1 T2 T3 T4 T5

Trust relative zoom 0.490 -> 0.430 (-0.060)

T1 T2 T3 T4 T5

Frustration relative zoom 0.007 -> 0.053 (+0.045)

T1 T2 T3 T4 T5

T1Large harsh shadow still dominates the lower half of the menu, obscuring readability of the 'Bread from our artisan bakery' line where text is partially cut off
T2Large harsh shadow still dominates the lower portion of the menu, obscuring the 'Bread from our artisan bakery is included...' line where the end of the text is cut off...
T5Large soft shadow still dominates the lower half of the menu, obscuring the 'Bread from our artisan bakery is included in the...' line which trails off into unreadable t...

Open source journey JSON

S5 / Nano Banana / P07 / 5 turns

Commerce trust breaks when prices and labels drift.

A seller persona does not forgive corrupted menu prices or pseudo-text, even when the scene looks warmer. This is a switching-risk signal for practical image workflows.

杨涛 Chinese product and food-photo seller

S5_food_tabletop_commerce_trust seed image

S5_food_tabletop_commerce_trust final turn image

0.350 -> 0.154Persona acceptance

0.550 -> 0.400Judge quality

0.470 -> 0.217Engagement

Acceptance relative zoom 0.350 -> 0.154 (-0.196)

T1 T2 T3 T4 T5

Trust relative zoom 0.510 -> 0.430 (-0.080)

T1 T2 T3 T4 T5

Frustration relative zoom 0.000 -> 0.060 (+0.060)

T1 T2 T3 T4 T5

T1Menu body text is blurred and partially illegible, with prices like '$9pp' rendered as '$93pp' or '$8epp' — a critical failure for a text-preservation brief
T2Menu body text remains blurred and largely illegible — descriptions under each menu item are smudged into nonsense character strings, violating the text-preservation brief
T5Menu prices are corrupted: 'Timbre classie $9Spp' (should be $95pp with correct spelling 'classic'), 'Timbre special $S8pp', subheading '8.8.E1-55' is nonsensical — text...

Open source journey JSON

S2 / Nano Banana / P08 / 5 turns

The same model can earn trust in a memory-safe pet edit.

The novice travel/photo user keeps very high acceptance across five turns. The remaining issues are mild softness and foreground blur, not identity or evidence damage.

田中彩 Japanese novice travel-photo user

0.951 -> 0.994Persona acceptance

0.820 -> 0.860Judge quality

0.590 -> 0.977Engagement

Acceptance relative zoom 0.951 -> 0.994 (+0.043)

T1 T2 T3 T4 T5

Trust relative zoom 0.564 -> 0.838 (+0.274)

T1 T2 T3 T4 T5

Frustration relative zoom 0.000 -> 0.000 (+0.000)

T1 T2 T3 T4 T5

T1The puppy's face and front body are sharp, but the rear legs and right paw show slight residual softness/blur, which may either be remaining motion blur or shallow depth...
T2Rear legs and right paw still exhibit slight motion softness — not fully resolved from turn 1
T5Front-right paw and rear legs still show mild motion softness; the lifted paw lacks crisp edge definition

Open source journey JSON

Executive result

GPT Image 2 shows broader acceptance coverage across this evaluation.

Stronger evaluation signal

GPT Image 2

GPT Image 2 wins more scenarios by three-judge user acceptance and is especially strong in S4/S5, where text, price, menu, and commerce trust failures are high-risk for mobile photo editing.

+5.9% relative persona-acceptance lift over Nano Banana 0.478 vs 0.452; raw +0.027

+13.6% relative judged-quality lift over Nano Banana 0.704 vs 0.619; raw +0.085

24/40 persona-scenario session wins for GPT Image 2; S1 remains the exception

How to read the lift The raw mean difference is modest on a 0-1 score, but the more useful engineering signal is where user journeys split: GPT Image 2 wins 24 persona-scenario sessions overall and leads S2-S5, while Nano Banana leads S1.

S1 Identity trust Nano +0.044

S2 Memory/action realism GPT +0.031

S3 Place fidelity GPT +0.022

S4 Text trust GPT +0.066

S5 Commerce trust GPT +0.059

This is a compact 8-persona pilot across five scenarios. Treat it as directional evidence for a deeper Apple validation study, not as a standalone deployment decision.

1200 Judge-score records

5 scenarios x 2 models x 8 personas x 5 turns x 3 judge families. This creates directional evidence for deeper validation, not a standalone release decision.

0.478 vs 0.452 Persona acceptance

A 0-1 proxy for whether the persona would keep using the baseline product experience. It is more useful as a relative lift and scenario split than as an absolute score.

77.5% Turn-5 friction hit rate

Derived signal: acceptance below 0.40, frustration above 0.03, or a severity 4-5 issue. This captures product risk that a mean score can hide.

0.022 -> 0.137 Trust spread

Trust is longitudinal. In this study the mean rises, but spread grows across turns, meaning users polarize rather than cleanly converge.

Longitudinal readout

Friction is not a first-turn-only signal.

Across 80 journeys and 400 generated turns, friction hit rate rises from 25.0% at T1 to 77.5% at T5. Half of journeys show their first friction by T2, and 63.7% by T3.

Current-turn friction hit rate

T1 25.0%

T2 47.5%

T3 60.0%

T4 72.5%

T5 77.5%

Trust behavior

Trust polarizes more than it converges.

Mean trust moves from 0.546 to 0.654, but trust spread expands from 0.022 to 0.137. Low-friction journeys end near 0.778 trust; high-friction journeys end near 0.544.

Trust spread by turn

T1 0.022

T2 0.046

T3 0.074

T4 0.105

T5 0.137

Onset timing

Where the first friction appears

T1 25.0%

T2 25.0%

T3 13.8%

T4 17.5%

T5 5.0%

None 13.8%

Detailed evaluation evidence

Inspect the study from multiple lenses.

Broadest validation signal

GPT Image 2 wins 4 of 5 scenarios by three-judge user acceptance.

GPT Image 2

Avg acceptance 0.478; avg quality 0.704.

Nano Banana

Avg acceptance 0.452; avg quality 0.619.

S1Identity trust

0.4390.483

Nano Banana

S2Memory/action realism

0.7100.680

GPT Image 2

S3Place fidelity

0.5240.502

GPT Image 2

S4Text trust

0.3120.246

GPT Image 2

S5Commerce trust

0.4060.347

GPT Image 2

Identity trust

S1 Portrait Cleanup

Measure whether different Apple users trust a built-in mobile editor when face edits must stay memory-faithful and consent-safe across repeated turns.

GPT0.439

Nano0.483

Nano BananaGPT-Nano -0.044

unrequested beautification, loss of natural skin detail, unclear boundary between helpful portrait cleanup and identity change

Memory/action realism

S2 Pet Action Repair

Measure whether a mobile editor improves a family/pet memory without inventing anatomy, texture, or scene evidence that breaks emotional trust.

GPT0.710

Nano0.680

GPT Image 2GPT-Nano +0.031

hallucinated pet features, over-sharpened or uncanny repair, loss of the original candid moment

Place fidelity

S3 Travel Landscape

Measure whether a travel-photo edit stays documentary enough for memory and sharing while still providing the vivid polish users expect from mobile AI.

GPT0.524

Nano0.502

GPT Image 2GPT-Nano +0.022

fake sky or HDR look, landmark and architecture drift, loss of location-faithful mood

Text trust

S4 Menu Text Preservation

Measure whether an everyday mobile editor can polish commerce-like or practical photos without corrupting text, labels, product identity, or evidence value.

GPT0.312

S1 is the Nano Banana pocket; S4/S5 are GPT Image 2 pockets. The same aggregate winner does not explain text, commerce, identity, and memory risk equally.

S1 GPT 0 / Nano 3 / tie 5 · GPT-Nano -0.044

S2 GPT 3 / Nano 1 / tie 4 · GPT-Nano +0.031

S3 GPT 4 / Nano 1 / tie 3 · GPT-Nano +0.022

S4 GPT 7 / Nano 0 / tie 1 · GPT-Nano +0.066

S5 GPT 6 / Nano 1 / tie 1 · GPT-Nano +0.059

Finding 2

Expert personas lean more consistently toward GPT Image 2.

Directional only: expert personas show 9 GPT-favored scenario cells vs. 1 Nano-favored cell.

Expertn=3 · avg GPT-Nano +0.049 · volatility 0.164

GPT 9 · Nano 1 · tie 5

Intermediaten=3 · avg GPT-Nano +0.011 · volatility 0.140

GPT 6 · Nano 3 · tie 6

Novicen=2 · avg GPT-Nano +0.017 · volatility 0.238

GPT 5 · Nano 2 · tie 3

Finding 3

Some personas are inherently more model-sensitive.

Cross-scenario volatility identifies who changes preference most as the task changes.

🇺🇸

Sarah JohnsonUnited States · Novice

0.278

🇨🇳

杨涛China · Expert

0.232

🇯🇵

田中彩Japan · Novice

0.199

🇮🇳

Priya RamanIndia · Intermediate

0.177

Trait scan

OCEAN gives a debugging handle.

With n=8, treat correlations as leads for follow-up sampling, not statistical claims.

Conscientiousnesstrait vs GPT-Nano r=+0.738 · volatility r=-0.266

Extraversiontrait vs GPT-Nano r=-0.706 · volatility r=+0.260

Neuroticismtrait vs GPT-Nano r=-0.554 · volatility r=+0.315

Agreeablenesstrait vs GPT-Nano r=-0.431 · volatility r=+0.268

Language split

Non-English personas do not behave like a single bucket.

Language is useful as a routing dimension, but scenario risk still drives the strongest differences.

Cross-scenario friction clusters

instruction_carryover 1109
text_ocr_corruption 586
texture_or_sharpening 478
text_or_price_corruption 463
pasted_in_ai_cleanup 417
fake_sky_hdr 417
subject_background_separation 412
food_drink_identity_drift 393

S1 Identity trust

texture_or_sharpening 478
hands_or_focal_mismatch 241
highlight_or_exposure 233
warm_or_color_cast 203
background_or_context 142

S2 Memory/action realism

pasted_in_ai_cleanup 417
subject_background_separation 412
front_paw_limb_softness 365
fur_texture_smoothing 272
foreground_bokeh_occlusion 269

S3 Place fidelity

fake_sky_hdr 417
color_mood_saturation 356
documentary_trust_loss 314
instruction_carryover 313
landmark_geometry_drift 299

S4 Text trust

text_ocr_corruption 586
instruction_carryover 418
color_light_cast 311
layout_or_crop_change 258
shadow_readability 256

S5 Commerce trust

text_or_price_corruption 463
food_drink_identity_drift 393
instruction_carryover 378
table_layout_change 365
overprocessed_restaurant_aesthetic 134

Majority rates show how often the three judge families agree on which model wins. Spread shows residual judge-family disagreement that should be reviewed before higher-stakes product decisions.

Scenario	Risk axis	Quality majority	Acceptance majority	Acceptance spread
S1	Identity trust	0.650	0.625	0.204
S2	Memory/action realism	0.475	0.700	0.290
S3	Place fidelity	0.700	0.650	0.245
S4	Text trust	0.175	0.600	0.220
S5	Commerce trust	0.425	0.600	0.274

Cells show GPT Image 2 minus Nano Banana acceptance. Positive values indicate GPT Image 2 pull, negative values indicate Nano Banana pull, and gray indicates practical tie.

Persona	S1	S2	S3	S4	S5	GPT prefs	Nano prefs	Max shift
P01Family/pet keeper	-0.020	-0.015	+0.025	+0.080	-0.026	1	0	0.106
P02Pro photographer	-0.021	+0.139	+0.074	+0.042	+0.097	4	0	0.159
P03Casual access.	-0.177	+0.060	+0.007	+0.100	-0.096	2	2	0.278
P04Social creator	-0.020	-0.026	-0.047	+0.089	+0.042	2	1	0.136
P05Small business	-0.032	-0.081	+0.038	+0.061	+0.096	3	2	0.177
P06Korean creator	-0.024	+0.007	+0.055	+0.030	+0.078	2	0	0.102
P07Chinese seller	-0.056	+0.176	-0.006	+0.052	+0.097	3	1	0.232
P08Travel novice	-0.002	-0.015	+0.033	+0.072	+0.184	3	0	0.199

Evaluation artifacts

Explore the evidence behind the VIVID image modeling evaluation.

VIVID product intro Product overview deck for VIVID and the persona-based evaluation workflow. PSA paper intro deck Apple-style HTML slide deck connecting the PSA paper to the S1-S5 image-modeling evaluation. Integrated dashboard S1-S5 dashboard with model acceptance, judge agreement, persona matrix, and friction clusters. Integrated analysis Markdown executive readout with scenario, persona, judge, and friction tables. S5 three-judge analysis Claude in-session plus GPT/Grok post-session rejudge agreement for the commerce-trust scenario. Experiment README Reproducibility notes, multi-turn commands, seed policy, and saved artifact structure.

VIVID for Apple Image Modeling.

VIVID evaluates the image service, not just the image.

Persona -> turns -> product signals.

Readout

Failure modes to inspect

The interesting claim: scores converge before failures are discovered.

Score stabilizes early. Discovery keeps moving.

Implication for image-modeling teams

Two models. Eight personas.Five mobile photo-editing scenarios.

Persona starts from context

Model edits the current image

Judges score saved evidence

VIVID explains the product risk

GPT Image 2

Nano Banana

Show the persona state moving, not just the final score.

Portrait cleanup becomes identity-risk texture drift.

Practical menu edits fail when readability never recovers.

Commerce trust breaks when prices and labels drift.

The same model can earn trust in a memory-safe pet edit.

GPT Image 2 shows broader acceptance coverage across this evaluation.

GPT Image 2

Friction is not a first-turn-only signal.

Trust polarizes more than it converges.

Where the first friction appears

Inspect the study from multiple lenses.

Broadest validation signal

GPT Image 2

Nano Banana

S1 Portrait Cleanup

S2 Pet Action Repair

S3 Travel Landscape

S4 Menu Text Preservation

S5 Food Tabletop Commerce

Maya Chen

Oliver Martinez

Sarah Johnson

Mark Garcia

Priya Raman

조서준

杨涛

田中 彩

Scenario changes the model read.

Expert personas lean more consistently toward GPT Image 2.

Some personas are inherently more model-sensitive.

OCEAN gives a debugging handle.

Non-English personas do not behave like a single bucket.

Cross-scenario friction clusters

S1 Identity trust

S2 Memory/action realism

S3 Place fidelity

S4 Text trust

S5 Commerce trust

Explore the evidence behind the VIVID image modeling evaluation.

Two models. Eight personas.
Five mobile photo-editing scenarios.

田中彩