# Apple Image Foundation Contextual Evaluation

SDK-native experiment skeleton for the Apple image foundation model talk/demo.

The experiment focuses on realistic mobile photo-editing workflows:

- portrait cleanup while preserving identity
- pet/action snapshot repair
- travel landscape sky/light adjustment
- food/menu/product text preservation

The primary study is now persona-specific closed-loop evaluation: every persona
starts from the same seed photo, edits turn by turn, evaluates the result, and
feeds only that persona's feedback into that persona's next edit. This is the
right shape for Apple service UX questions: who keeps using the default image
experience, who switches to another app, and why.

Replay evaluation remains as a lower-cost baseline path:
`ImageJourneyReplaySpec` + `run_image_journey_replay` consume fixed image
artifacts after generation.

## Current Gate

This folder is safe by default:

- no provider calls in `--mode dry`
- no provider calls in `--mode smoke --stub-eval`
- no DB writes
- no deployment
- no secrets required

Live image generation fixtures have been run once under:

- `results/fixture_runs/fixtures_full_4seed_v1/`
- `generated_images/`
- `results/analysis/analysis_full_fixture_v1/`

The first official VIVID SDK live replay attempt was blocked because this
checkout had no `VIVID_LICENSE_KEY`; that failed event is intentionally
preserved in `results/runs/eval_full_4seed_v1/events.jsonl`. The same run id
now also contains a successful live replay for `S1_portrait_cleanup_identity /
model_a` using `--internal-commercial-tier`, which records the override in the
saved artifacts.

## Commands

Validate all scenario/target specs and quote SDK replay cost:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/run_replay_eval.py --mode dry
```

Run one stubbed SDK replay smoke:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/run_replay_eval.py --mode smoke --stub-eval
```

Prepare 1024 x 1024 seed crops:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/prepare_seeds.py
```

Generate provider-backed image-edit fixtures, with resume support:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/generate_fixtures.py \
  --mode full \
  --run-id fixtures_full_4seed_v1 \
  --hard-cap-usd 8.0 \
  --resume-existing
```

Run one fixture-backed VIVID replay spec with an explicit internal demo tier
override:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/run_replay_eval.py \
  --mode full \
  --run-id eval_full_4seed_v1 \
  --scenario-id S1_portrait_cleanup_identity \
  --target-id model_a \
  --fixture-run-id fixtures_full_4seed_v1 \
  --resume-existing \
  --continue-on-error \
  --max-sdk-reference-usd 10.0 \
  --internal-commercial-tier
```

Remove `--scenario-id` and `--target-id` only when scheduling the full 8-spec
run. The first live spec took about 15 minutes on this checkout.

This records `tier_override: "internal_commercial"` in the saved artifacts.
For local environment-key generation, see
`docs/tutorial/vivid_local_license_keys.md`.

Build a replayable artifact index and contact sheet:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/summarize_results.py \
  --fixture-run-id fixtures_full_4seed_v1 \
  --eval-run-id eval_full_4seed_v1 \
  --analysis-id analysis_full_fixture_v1
```

Run a small closed-loop pilot where persona feedback from turn 1 is injected
into the turn 2 generation prompt:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/run_closed_loop.py \
  --run-id closed_loop_pilot_s1_model_a_v1 \
  --scenario-id S1_portrait_cleanup_identity \
  --target-id model_a \
  --turn-limit 2 \
  --persona-limit 2 \
  --hard-cap-usd 1.0 \
  --max-sdk-reference-usd 1.0 \
  --internal-commercial-tier
```

Current closed-loop pilot artifact:

- `results/closed_loop_runs/closed_loop_pilot_s1_model_a_v1/`
- `events.jsonl` proves the sequence:
  `generation_completed -> persona_eval_completed -> feedback_synthesized -> next generation_started`
- turn 2 prompt includes turn 1 persona feedback
- turn 2 evaluation caught worsening over-processing/stylization drift

Run the primary Apple UX dry plan with persona-specific image trajectories:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/run_closed_loop.py \
  --dry-run \
  --run-id closed_loop_persona_specific_dry_check \
  --loop-mode persona_specific \
  --scenario-id S1_portrait_cleanup_identity \
  --target-id model_a \
  --turn-limit 2 \
  --persona-limit 2 \
  --hard-cap-usd 1.0 \
  --max-sdk-reference-usd 1.0 \
  --internal-commercial-tier
```

Run one live persona-specific smoke before any larger study:

```bash
python experiments/2026-05-15_apple_image_foundation_contextual_eval/run_closed_loop.py \
  --run-id closed_loop_persona_specific_smoke_s1_model_a_v1 \
  --loop-mode persona_specific \
  --scenario-id S1_portrait_cleanup_identity \
  --target-id model_a \
  --turn-limit 1 \
  --persona-limit 1 \
  --hard-cap-usd 0.5 \
  --max-sdk-reference-usd 0.5 \
  --internal-commercial-tier
```

Saved artifacts for persona-specific runs include:

- `events.jsonl`: every generation/evaluation/feedback event
- `inputs/*.snapshot.json`: scenario, persona, seed, and plan snapshots
- `images/<scenario>/<target>/<persona_id>/turn_N.png`
- `turns/<persona_id>/turn_N.json`
- `journeys/<persona_id>.json`
- `summary.json` with service-UX signals, friction points, trust breaks, and
  switching likelihood

The first full provider-backed study should only happen after the dry plan,
one-turn live smoke, and seed provenance checks have passed.

## Seed Policy

Use 1024 x 1024 square crops. Keep provenance in `seed_candidates.json` and the
runtime fixture manifest:

- source URL
- author/source
- license observed date
- crop coordinates
- resize method
- SHA-256
- final local path

Unsplash is the default public-demo source. iStock is held until explicit AI/ML
use authorization is available.
