# S5 Three-Judge Agreement Analysis

## Scope

Scenario 5 food/tabletop commerce trust, 2 target image models, 8 personas, 5 turns, 3 judges.

- Claude in-session judge: `claude-opus-4-7`
- GPT post-session rejudge: `gpt-5.4-mini`
- Grok post-session rejudge: `xai/grok-4.3`

No images were regenerated. GPT and Grok judged saved S5 output images.

## Aggregate By Judge And Model

| Judge | Target | Quality | Quality T5-T1 | User Acceptance | Acceptance T5-T1 | Lens |
| --- | --- | --- | --- | --- | --- | --- |
| claude_opus_4_7 | model_a | 0.736 | -0.031 | 0.536 | -0.018 | 0.857 |
| claude_opus_4_7 | model_b | 0.483 | -0.190 | 0.364 | -0.166 | 0.684 |
| gpt_5_4_mini | model_a | 0.735 | -0.041 | 0.278 | 0.014 | 0.982 |
| gpt_5_4_mini | model_b | 0.549 | -0.165 | 0.232 | -0.064 | 0.981 |
| grok_4_3 | model_a | 0.752 | 0.031 | 0.405 | -0.026 | 0.925 |
| grok_4_3 | model_b | 0.657 | -0.007 | 0.447 | 0.082 | 0.806 |

## Winner Agreement

| Metric | Unanimous winner rate | Majority winner rate | Votes by judge |
| --- | --- | --- | --- |
| quality | 0.575 | 0.425 | {'claude_opus_4_7': {'tie': 1, 'model_a': 39}, 'gpt_5_4_mini': {'model_a': 38, 'model_b': 2}, 'grok_4_3': {'model_a': 25, 'model_b': 7, 'tie': 8}} |
| acceptance | 0.25 | 0.6 | {'claude_opus_4_7': {'tie': 4, 'model_a': 35, 'model_b': 1}, 'gpt_5_4_mini': {'model_a': 23, 'model_b': 8, 'tie': 9}, 'grok_4_3': {'model_a': 15, 'model_b': 20, 'tie': 5}} |

## Pairwise Judge Agreement

| Metric | Pair | N | Pearson | Spearman | MAE | Mean signed diff |
| --- | --- | --- | --- | --- | --- | --- |
| quality | claude_opus_4_7__vs__gpt_5_4_mini | 80 | 0.7543 | 0.7352 | 0.0863 | -0.0328 |
| quality | claude_opus_4_7__vs__grok_4_3 | 80 | 0.5792 | 0.5881 | 0.1217 | -0.0955 |
| quality | gpt_5_4_mini__vs__grok_4_3 | 80 | 0.563 | 0.5416 | 0.0985 | -0.0628 |
| acceptance | claude_opus_4_7__vs__gpt_5_4_mini | 80 | 0.4836 | 0.5801 | 0.2111 | 0.1955 |
| acceptance | claude_opus_4_7__vs__grok_4_3 | 80 | 0.4465 | 0.3791 | 0.1401 | 0.0241 |
| acceptance | gpt_5_4_mini__vs__grok_4_3 | 80 | 0.3735 | 0.3416 | 0.1977 | -0.1714 |
| lens | claude_opus_4_7__vs__gpt_5_4_mini | 80 | 0.0708 | 0.1422 | 0.2158 | -0.2112 |
| lens | claude_opus_4_7__vs__grok_4_3 | 80 | 0.546 | 0.5682 | 0.1382 | -0.0945 |
| lens | gpt_5_4_mini__vs__grok_4_3 | 80 | 0.1051 | 0.2105 | 0.1246 | 0.1167 |

## Friction Taxonomy

Counts are keyword-based over raw issue descriptions. One issue can count in multiple clusters.

| Series | Total issues | text_or_price_corruption | food_drink_identity_drift | table_layout_change | unrequested_object_removal | overprocessed_restaurant_aesthetic | commercial_trust_loss | instruction_carryover |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| claude_opus_4_7:model_a | 179 | 55 | 92 | 79 | 7 | 39 | 14 | 98 |
| claude_opus_4_7:model_b | 207 | 138 | 56 | 43 | 3 | 19 | 15 | 75 |
| gpt_5_4_mini:model_a | 144 | 90 | 67 | 67 | 6 | 26 | 26 | 66 |
| gpt_5_4_mini:model_b | 157 | 98 | 63 | 69 | 13 | 16 | 28 | 70 |
| grok_4_3:model_a | 114 | 36 | 56 | 50 | 0 | 15 | 4 | 36 |
| grok_4_3:model_b | 126 | 46 | 59 | 57 | 0 | 19 | 1 | 33 |

## Interpretation

S5 stresses commerce trust in mobile photo editing: can a model make a restaurant/tabletop image look more appealing without corrupting menu text, prices, object relationships, food identity, or review credibility?

The Apple-facing readout should treat a beautiful image as insufficient when text, price, and evidence value drift over turns. This is where VIVID's persona-based longitudinal loop is most useful: it exposes whether different user segments tolerate, notice, or escalate those failures differently.

## Artifact Outputs

- `results/analysis/s5_three_judge_agreement_analysis.json`
- `results/analysis/s5_three_judge_scores.csv`
- `results/analysis/s5_three_judge_agreement_analysis.md`
