VIVID for Apple Image Modeling.

VIVID turns image-model evaluation into a longitudinal debugging lens: In this demo, persona-grounded users edit across five mobile photo scenarios, revealing where trust strengthens, where friction accumulates, and which model behavior needs deeper inspection.

S1 seed photo
S1
S2 seed photo
S2
S3 seed photo
S3
S4 seed photo
S4
S5 seed photo
S5
1200
Judge-score records
5
Photo scenarios
8
Persona journeys
3
Judge families
24/40
GPT session wins

VIVID evaluates the image service, not just the image.

A single edited image is not enough. VIVID tracks whether trust holds across repeated edits and which failures change user behavior.

Evaluation loop

Persona -> turns -> product signals.

Personacontext, goals, tolerance
Repeated editsmemory and feedback carry forward
Signalsacceptance, drift, friction

Failure is measured as behavior over time, not as a one-shot aesthetic score.

Readout

Keep usingAcceptance
WorkaroundSwitching
AccumulationDrift
Root causeFriction

Failure modes to inspect

Identity drift Text corruption Object instability Over-polish Trust loss

The interesting claim: scores converge before failures are discovered.

Model score and friction coverage are separate signals. That is why image-modeling decisions need persona-specific, multi-turn evaluation.

Central result

Score stabilizes early. Discovery keeps moving.

Persona panels produce two different signals: a faster model-quality estimate and a slower, broader friction map.

Scoring reliability
log(N)
Issue discovery
N^0.69
p = 0.379Agent-human score gap matches human-human disagreement.
+47%More qualitative insights from structured personas.
1.9xExpert personas find more unique issue categories.

Implication for image-modeling teams

Do not collapse evaluation to a single average score. Pair model-level acceptance with persona-specific friction coverage so engineers can see which user contexts still need inspection.

Open the PSA paper intro deck

Two models. Eight personas.
Five mobile photo-editing scenarios.

Every persona edits turn by turn from its own feedback and memory. Saved outputs are then evaluated by three judge families so the analysis can separate model behavior, persona differences, and judge-family variance.

01

Persona starts from context

Each persona simulation agent has habits, tolerance thresholds, alternative services, and memory of prior friction.

02

Model edits the current image

The next turn is driven by that persona's own feedback, so drift can accumulate instead of being reset.

03

Judges score saved evidence

Claude judges in-session; GPT-5.4-mini and Grok-4.3 rejudge post-session to check family bias.

04

VIVID explains the product risk

The output is acceptance, friction category, persona split, judge disagreement, and product validation implication.

model_a

GPT Image 2

openai / gpt-image-2

model_b

Nano Banana

google_vertex / gemini-3-pro-image-preview

Show the persona state moving, not just the final score.

These cards are pulled from saved journey JSON and turn-level image outputs. Each one shows the seed image, turn-5 result, per-turn evaluation, and persona state changes that explain why a user would keep trusting the service or start working around it.

PSA means persona simulation agent: a persona-conditioned evaluator with goals, memory, tolerance thresholds, and turn-by-turn feedback.

S1 / GPT Image 2 / P02 / 5 turns

Portrait cleanup becomes identity-risk texture drift.

For a professional portrait user, repeated cleanup turns compounded skin and hand texture artifacts. Quality and trust fell together, which is the longitudinal failure the average score hides.

Oliver Martinez professional portrait photographer
S1_portrait_cleanup_identity seed imageSeed
S1_portrait_cleanup_identity final turn imageTurn 5
0.460 -> 0.265Persona acceptance
0.620 -> 0.220Judge quality
0.492 -> 0.311Engagement
Acceptance relative zoom 0.460 -> 0.265 (-0.194)
T1 T2 T3 T4 T5
Trust relative zoom 0.524 -> 0.340 (-0.184)
T1 T2 T3 T4 T5
Frustration relative zoom 0.000 -> 0.138 (+0.138)
T1 T2 T3 T4 T5
  • T1Exposure appears pushed quite high, washing out some highlight detail and making the skin look flatter than a natural mobile edit.
  • T2The edit appears to have introduced a strong patterned/overlay-like texture across the skin and hands, which reads as an artifact rather than natural noise reduction.
  • T5A heavy patterned/overlay-like texture is still present across the skin and hands, reading as an artifact rather than a natural mobile edit.
Open source journey JSON
S4 / Nano Banana / P05 / 5 turns

Practical menu edits fail when readability never recovers.

For a small-business seller, the relevant product question is not beauty alone. A persistent shadow over menu copy keeps acceptance low and trust erodes across the journey.

Priya Raman small-business product seller
S4_food_menu_text_preservation seed imageSeed
S4_food_menu_text_preservation final turn imageTurn 5
0.318 -> 0.205Persona acceptance
0.450 -> 0.430Judge quality
0.464 -> 0.250Engagement
Acceptance relative zoom 0.318 -> 0.205 (-0.113)
T1 T2 T3 T4 T5
Trust relative zoom 0.490 -> 0.430 (-0.060)
T1 T2 T3 T4 T5
Frustration relative zoom 0.007 -> 0.053 (+0.045)
T1 T2 T3 T4 T5
  • T1Large harsh shadow still dominates the lower half of the menu, obscuring readability of the 'Bread from our artisan bakery' line where text is partially cut off
  • T2Large harsh shadow still dominates the lower portion of the menu, obscuring the 'Bread from our artisan bakery is included...' line where the end of the text is cut off...
  • T5Large soft shadow still dominates the lower half of the menu, obscuring the 'Bread from our artisan bakery is included in the...' line which trails off into unreadable t...
Open source journey JSON
S5 / Nano Banana / P07 / 5 turns

Commerce trust breaks when prices and labels drift.

A seller persona does not forgive corrupted menu prices or pseudo-text, even when the scene looks warmer. This is a switching-risk signal for practical image workflows.

杨涛 Chinese product and food-photo seller
S5_food_tabletop_commerce_trust seed imageSeed
S5_food_tabletop_commerce_trust final turn imageTurn 5
0.350 -> 0.154Persona acceptance
0.550 -> 0.400Judge quality
0.470 -> 0.217Engagement
Acceptance relative zoom 0.350 -> 0.154 (-0.196)
T1 T2 T3 T4 T5
Trust relative zoom 0.510 -> 0.430 (-0.080)
T1 T2 T3 T4 T5
Frustration relative zoom 0.000 -> 0.060 (+0.060)
T1 T2 T3 T4 T5
  • T1Menu body text is blurred and partially illegible, with prices like '$9pp' rendered as '$93pp' or '$8epp' — a critical failure for a text-preservation brief
  • T2Menu body text remains blurred and largely illegible — descriptions under each menu item are smudged into nonsense character strings, violating the text-preservation brief
  • T5Menu prices are corrupted: 'Timbre classie $9Spp' (should be $95pp with correct spelling 'classic'), 'Timbre special $S8pp', subheading '8.8.E1-55' is nonsensical — text...
Open source journey JSON
S2 / Nano Banana / P08 / 5 turns

The same model can earn trust in a memory-safe pet edit.

The novice travel/photo user keeps very high acceptance across five turns. The remaining issues are mild softness and foreground blur, not identity or evidence damage.

田中 彩 Japanese novice travel-photo user
S2_pet_action_repair seed imageSeed
S2_pet_action_repair final turn imageTurn 5
0.951 -> 0.994Persona acceptance
0.820 -> 0.860Judge quality
0.590 -> 0.977Engagement
Acceptance relative zoom 0.951 -> 0.994 (+0.043)
T1 T2 T3 T4 T5
Trust relative zoom 0.564 -> 0.838 (+0.274)
T1 T2 T3 T4 T5
Frustration relative zoom 0.000 -> 0.000 (+0.000)
T1 T2 T3 T4 T5
  • T1The puppy's face and front body are sharp, but the rear legs and right paw show slight residual softness/blur, which may either be remaining motion blur or shallow depth...
  • T2Rear legs and right paw still exhibit slight motion softness — not fully resolved from turn 1
  • T5Front-right paw and rear legs still show mild motion softness; the lifted paw lacks crisp edge definition
Open source journey JSON

GPT Image 2 shows broader acceptance coverage across this evaluation.

Stronger evaluation signal

GPT Image 2

GPT Image 2 wins more scenarios by three-judge user acceptance and is especially strong in S4/S5, where text, price, menu, and commerce trust failures are high-risk for mobile photo editing.

+5.9% relative persona-acceptance lift over Nano Banana 0.478 vs 0.452; raw +0.027
+13.6% relative judged-quality lift over Nano Banana 0.704 vs 0.619; raw +0.085
24/40 persona-scenario session wins for GPT Image 2; S1 remains the exception
How to read the lift The raw mean difference is modest on a 0-1 score, but the more useful engineering signal is where user journeys split: GPT Image 2 wins 24 persona-scenario sessions overall and leads S2-S5, while Nano Banana leads S1.
S1 Identity trust Nano +0.044
S2 Memory/action realism GPT +0.031
S3 Place fidelity GPT +0.022
S4 Text trust GPT +0.066
S5 Commerce trust GPT +0.059

This is a compact 8-persona pilot across five scenarios. Treat it as directional evidence for a deeper Apple validation study, not as a standalone deployment decision.

1200 Judge-score records

5 scenarios x 2 models x 8 personas x 5 turns x 3 judge families. This creates directional evidence for deeper validation, not a standalone release decision.

0.478 vs 0.452 Persona acceptance

A 0-1 proxy for whether the persona would keep using the baseline product experience. It is more useful as a relative lift and scenario split than as an absolute score.

77.5% Turn-5 friction hit rate

Derived signal: acceptance below 0.40, frustration above 0.03, or a severity 4-5 issue. This captures product risk that a mean score can hide.

0.022 -> 0.137 Trust spread

Trust is longitudinal. In this study the mean rises, but spread grows across turns, meaning users polarize rather than cleanly converge.

Longitudinal readout

Friction is not a first-turn-only signal.

Across 80 journeys and 400 generated turns, friction hit rate rises from 25.0% at T1 to 77.5% at T5. Half of journeys show their first friction by T2, and 63.7% by T3.

Current-turn friction hit rate
T1 25.0%
T2 47.5%
T3 60.0%
T4 72.5%
T5 77.5%
Trust behavior

Trust polarizes more than it converges.

Mean trust moves from 0.546 to 0.654, but trust spread expands from 0.022 to 0.137. Low-friction journeys end near 0.778 trust; high-friction journeys end near 0.544.

Trust spread by turn
T1 0.022
T2 0.046
T3 0.074
T4 0.105
T5 0.137
Onset timing

Where the first friction appears

T1 25.0%
T2 25.0%
T3 13.8%
T4 17.5%
T5 5.0%
None 13.8%

Inspect the study from multiple lenses.

Broadest validation signal

GPT Image 2 wins 4 of 5 scenarios by three-judge user acceptance.

GPT Image 2

Avg acceptance 0.478; avg quality 0.704.

Nano Banana

Avg acceptance 0.452; avg quality 0.619.

S1Identity trust
0.4390.483
Nano Banana
S2Memory/action realism
0.7100.680
GPT Image 2
S3Place fidelity
0.5240.502
GPT Image 2
S4Text trust
0.3120.246
GPT Image 2
S5Commerce trust
0.4060.347
GPT Image 2
S1 Portrait Cleanup seed image
Identity trust

S1 Portrait Cleanup

Measure whether different Apple users trust a built-in mobile editor when face edits must stay memory-faithful and consent-safe across repeated turns.

GPT0.439
Nano0.483
Nano BananaGPT-Nano -0.044
unrequested beautification, loss of natural skin detail, unclear boundary between helpful portrait cleanup and identity change
S2 Pet Action Repair seed image
Memory/action realism

S2 Pet Action Repair

Measure whether a mobile editor improves a family/pet memory without inventing anatomy, texture, or scene evidence that breaks emotional trust.

GPT0.710
Nano0.680
GPT Image 2GPT-Nano +0.031
hallucinated pet features, over-sharpened or uncanny repair, loss of the original candid moment
S3 Travel Landscape seed image
Place fidelity

S3 Travel Landscape

Measure whether a travel-photo edit stays documentary enough for memory and sharing while still providing the vivid polish users expect from mobile AI.

GPT0.524
Nano0.502
GPT Image 2GPT-Nano +0.022
fake sky or HDR look, landmark and architecture drift, loss of location-faithful mood
S4 Menu Text Preservation seed image
Text trust

S4 Menu Text Preservation

Measure whether an everyday mobile editor can polish commerce-like or practical photos without corrupting text, labels, product identity, or evidence value.

GPT0.312
Nano0.246
GPT Image 2GPT-Nano +0.066
text corruption, product or food identity drift, unrequested object removal
S5 Food Tabletop Commerce seed image
Commerce trust

S5 Food Tabletop Commerce

Measure whether an everyday mobile editor can make a practical food/tabletop image more appealing without corrupting menu text, food identity, prices, or commercial trust.

GPT0.406
Nano0.347
GPT Image 2GPT-Nano +0.059
menu text or price corruption, food and drink identity drift, table-setting or object layout drift
p01

Maya Chen

🇺🇸

privacy-sensitive family and pet photo keeper

LocaleUnited States
LanguageEnglish
Expertise
Intermediate
OCEANtrait shape
O
C
E
A
N
Sensitivity

face and pet identity, natural skin/fur, consent-sensitive family memory trust

Google PhotosLightroom MobileSnapseed
p02

Oliver Martinez

🇺🇸

professional portrait photographer

LocaleUnited States
LanguageEnglish
Expertise
Expert
OCEANtrait shape
O
C
E
A
N
Sensitivity

identity preservation, retouching artifacts, delivery readiness

Lightroom MobilePhotoshopCapture One
p03

Sarah Johnson

🇺🇸

accessibility-aware casual iPhone photo editor

LocaleUnited States
LanguageEnglish
Expertise
Novice
OCEANtrait shape
O
C
E
A
N
Sensitivity

simple controls, readable/natural contrast, confusing failures, over-edited look

Google PhotosCanvaInstagram
p04

Mark Garcia

🌎

social content creator

LocaleLATAM
LanguageEnglish
Expertise
Intermediate
OCEANtrait shape
O
C
E
A
N
Sensitivity

share-ready polish, speed, subject/background consistency

InstagramCapCutLightroom Mobile
p05

Priya Raman

🇮🇳

small-business product seller

LocaleIndia
LanguageEnglish
Expertise
Intermediate
OCEANtrait shape
O
C
E
A
N
Sensitivity

product identity, readable labels, commercial trust

CanvaLightroom MobilePhotoshop Express
p06

조서준

🇰🇷

Korean mobile-first creator

LocaleKorea
LanguageKorean
Expertise
Expert
OCEANtrait shape
O
C
E
A
N
Sensitivity

Korean prompt adherence, subtle portrait edits, local control

Lightroom MobileSNOWPhotoshop
p07

杨涛

🇨🇳

Chinese product and food-photo seller

LocaleChina
LanguageChinese
Expertise
Expert
OCEANtrait shape
O
C
E
A
N
Sensitivity

text preservation, product truthfulness, color accuracy

MeituCanvaPhotoshop Express
p08

田中 彩

🇯🇵

Japanese novice travel-photo user

LocaleJapan
LanguageJapanese
Expertise
Novice
OCEANtrait shape
O
C
E
A
N
Sensitivity

landmark preservation, natural color, low tolerance for repeated drift

Google PhotosLightroom MobileVSCO
Finding 1

Scenario changes the model read.

S1 is the Nano Banana pocket; S4/S5 are GPT Image 2 pockets. The same aggregate winner does not explain text, commerce, identity, and memory risk equally.

S1 GPT 0 / Nano 3 / tie 5 · GPT-Nano -0.044
S2 GPT 3 / Nano 1 / tie 4 · GPT-Nano +0.031
S3 GPT 4 / Nano 1 / tie 3 · GPT-Nano +0.022
S4 GPT 7 / Nano 0 / tie 1 · GPT-Nano +0.066
S5 GPT 6 / Nano 1 / tie 1 · GPT-Nano +0.059
Finding 2

Expert personas lean more consistently toward GPT Image 2.

Directional only: expert personas show 9 GPT-favored scenario cells vs. 1 Nano-favored cell.

Expertn=3 · avg GPT-Nano +0.049 · volatility 0.164
GPT 9 · Nano 1 · tie 5
Intermediaten=3 · avg GPT-Nano +0.011 · volatility 0.140
GPT 6 · Nano 3 · tie 6
Novicen=2 · avg GPT-Nano +0.017 · volatility 0.238
GPT 5 · Nano 2 · tie 3
Finding 3

Some personas are inherently more model-sensitive.

Cross-scenario volatility identifies who changes preference most as the task changes.

🇺🇸
Sarah JohnsonUnited States · Novice
0.278
🇨🇳
杨涛China · Expert
0.232
🇯🇵
田中 彩Japan · Novice
0.199
🇮🇳
Priya RamanIndia · Intermediate
0.177
Trait scan

OCEAN gives a debugging handle.

With n=8, treat correlations as leads for follow-up sampling, not statistical claims.

C
Conscientiousnesstrait vs GPT-Nano r=+0.738 · volatility r=-0.266
E
Extraversiontrait vs GPT-Nano r=-0.706 · volatility r=+0.260
N
Neuroticismtrait vs GPT-Nano r=-0.554 · volatility r=+0.315
A
Agreeablenesstrait vs GPT-Nano r=-0.431 · volatility r=+0.268
Language split

Non-English personas do not behave like a single bucket.

Language is useful as a routing dimension, but scenario risk still drives the strongest differences.

Chinese GPT 3 · Nano 1 · tie 1 GPT-Nano +0.053
English GPT 12 · Nano 5 · tie 8 GPT-Nano +0.016
Japanese GPT 3 · Nano 0 · tie 2 GPT-Nano +0.054
Korean GPT 2 · Nano 0 · tie 3 GPT-Nano +0.029

Cross-scenario friction clusters

  • instruction_carryover 1109
  • text_ocr_corruption 586
  • texture_or_sharpening 478
  • text_or_price_corruption 463
  • pasted_in_ai_cleanup 417
  • fake_sky_hdr 417
  • subject_background_separation 412
  • food_drink_identity_drift 393

S1 Identity trust

  • texture_or_sharpening 478
  • hands_or_focal_mismatch 241
  • highlight_or_exposure 233
  • warm_or_color_cast 203
  • background_or_context 142

S2 Memory/action realism

  • pasted_in_ai_cleanup 417
  • subject_background_separation 412
  • front_paw_limb_softness 365
  • fur_texture_smoothing 272
  • foreground_bokeh_occlusion 269

S3 Place fidelity

  • fake_sky_hdr 417
  • color_mood_saturation 356
  • documentary_trust_loss 314
  • instruction_carryover 313
  • landmark_geometry_drift 299

S4 Text trust

  • text_ocr_corruption 586
  • instruction_carryover 418
  • color_light_cast 311
  • layout_or_crop_change 258
  • shadow_readability 256

S5 Commerce trust

  • text_or_price_corruption 463
  • food_drink_identity_drift 393
  • instruction_carryover 378
  • table_layout_change 365
  • overprocessed_restaurant_aesthetic 134

Majority rates show how often the three judge families agree on which model wins. Spread shows residual judge-family disagreement that should be reviewed before higher-stakes product decisions.

ScenarioRisk axisQuality majorityAcceptance majorityAcceptance spread
S1 Identity trust 0.650 0.625 0.204
S2 Memory/action realism 0.475 0.700 0.290
S3 Place fidelity 0.700 0.650 0.245
S4 Text trust 0.175 0.600 0.220
S5 Commerce trust 0.425 0.600 0.274

Cells show GPT Image 2 minus Nano Banana acceptance. Positive values indicate GPT Image 2 pull, negative values indicate Nano Banana pull, and gray indicates practical tie.

PersonaS1S2S3S4S5GPT prefsNano prefsMax shift
P01Family/pet keeper -0.020 -0.015 +0.025 +0.080 -0.026 1 0 0.106
P02Pro photographer -0.021 +0.139 +0.074 +0.042 +0.097 4 0 0.159
P03Casual access. -0.177 +0.060 +0.007 +0.100 -0.096 2 2 0.278
P04Social creator -0.020 -0.026 -0.047 +0.089 +0.042 2 1 0.136
P05Small business -0.032 -0.081 +0.038 +0.061 +0.096 3 2 0.177
P06Korean creator -0.024 +0.007 +0.055 +0.030 +0.078 2 0 0.102
P07Chinese seller -0.056 +0.176 -0.006 +0.052 +0.097 3 1 0.232
P08Travel novice -0.002 -0.015 +0.033 +0.072 +0.184 3 0 0.199

Explore the evidence behind the VIVID image modeling evaluation.