# S1/S2/S3/S4/S5 Integrated Contextual Evaluation

Scope: five Apple mobile photo editing scenarios, each evaluated with three judge families over 2 models x 8 personas x 5 turns.

## Executive Readout

- Robust default from this pilot: `GPT Image 2`.
- S4/S5 add practical trust axes: readable text, menu/price stability, object layout, food identity, and commerce credibility.
- The strongest VIVID claim is contextual: persona and scenario change the model risk profile more than a single global score suggests.
- Lens scores remain above User Acceptance across scenarios, supporting persona-based evaluation as the user-experience layer.

## Scenario Comparison

| scenario | risk_axis | GPT Image 2 acceptance | Nano Banana acceptance | Nano-GPT acceptance | winner | acceptance majority | judge spread |
| --- | --- | --- | --- | --- | --- | --- | --- |
| S1 | Identity trust | 0.4389 | 0.4829 | 0.044 | Nano Banana | 0.625 | 0.2038 |
| S2 | Memory/action realism | 0.7102 | 0.6796 | -0.0306 | GPT Image 2 | 0.7 | 0.2904 |
| S3 | Place fidelity | 0.5242 | 0.5019 | -0.0223 | GPT Image 2 | 0.65 | 0.2452 |
| S4 | Text trust | 0.3117 | 0.2459 | -0.0658 | GPT Image 2 | 0.6 | 0.2203 |
| S5 | Commerce trust | 0.4064 | 0.3474 | -0.059 | GPT Image 2 | 0.6 | 0.2744 |

## Persona Preference Across Scenarios

| persona | label | S1 Nano-GPT | S2 Nano-GPT | S3 Nano-GPT | S4 Nano-GPT | S5 Nano-GPT | max shift | S5 pref |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| p03 | Casual access. | 0.1774 | -0.0601 | -0.0067 | -0.1001 | 0.0961 | 0.2775 | model_b |
| p07 | Chinese seller | 0.0561 | -0.1759 | 0.0061 | -0.0522 | -0.0968 | 0.232 | model_a |
| p08 | Travel novice | 0.0022 | 0.0151 | -0.033 | -0.0725 | -0.1836 | 0.1987 | model_a |
| p05 | Small business | 0.0323 | 0.0808 | -0.0376 | -0.0609 | -0.0963 | 0.1771 | model_a |
| p02 | Pro photographer | 0.0206 | -0.1388 | -0.074 | -0.0419 | -0.0969 | 0.1594 | model_a |
| p04 | Social creator | 0.02 | 0.0261 | 0.0468 | -0.0888 | -0.042 | 0.1356 | model_a |
| p01 | Family/pet keeper | 0.0197 | 0.0146 | -0.0248 | -0.0804 | 0.0256 | 0.106 | tie |
| p06 | Korean creator | 0.0238 | -0.0067 | -0.0551 | -0.0298 | -0.0781 | 0.1019 | model_a |

## Interpretation For Apple

S1-S5 now cover portrait identity, pet/action realism, travel/documentary fidelity, menu/text trust, and food/tabletop commerce trust. S5 strengthens the product read because it combines subjective appetite/appeal with more objective failure modes such as menu text, price, object layout, and credibility drift.