April 6, 2026

How to Build the Same Images-First AI Video Automation Flow JSONClip Uses on Its Homepage

A long-read guide to the same images-first automation pattern used in JSONClip homepage agent mode: prompt check, project creation, scenario generation, schema draft, asset prompts, image generation, schema finalization, and final render.

Long-read guide

JSONClip homepage agent mode is not just a chat box that magically returns a video. It is a concrete pipeline with explicit stages, explicit data flow, and explicit render output. The homepage calls it “One prompt to full generation flow (images-first),” and that wording matters because the images-first part is a design choice, not an accident.

This article explains how that pipeline works and how to build the same kind of automation process yourself. The goal is not to copy a UI widget. The goal is to reproduce the operating model: validate the prompt, create project state, generate a scenario, draft a schema and asset list, generate asset prompts, create missing images, finalize the schema with stored assets, and only then render the video.

That approach is more disciplined than asking a model to jump straight from one prompt to one final MP4 in a single opaque response. It is easier to observe, easier to debug, easier to cache, and easier to improve. It also fits how real production systems need to behave. If a step fails, you want to know which step failed and what data it produced. If a user wants to edit the result, you want the project state and schema to exist as concrete artifacts. If image generation works but rendering fails, you want to reuse the assets rather than regenerate them.

Why the homepage flow is images-first in the first place

The homepage flow defaults to images-first because still-image generation is currently the most stable entry point for one-prompt automation when the system needs to deliver predictable results across many prompts. It lets the agent define the story beats, produce visual anchors for each beat, store those assets, and build a render schema around real files instead of around hypothetical ones.

That does not mean video generation is irrelevant. The homepage settings already let the user choose a preferred video provider and model. But the current operating mode still validates and stores those settings while keeping the primary asset-generation path image-first. That is a pragmatic choice. An images-first pipeline is easier to make deterministic, especially when the purpose is to generate a reliable first cut from a single prompt.

Design choice	Why it exists	Operational upside
Scenario before assets	The system needs a concrete narrative plan before it can ask for visuals.	Prompts stay coherent instead of producing random disconnected images.
Schema draft before image generation	The system needs to know which asset slots are actually required.	You only generate the missing images the schema needs.
Images before final render	Render should reference real stored media, not hypothetical assets.	Failures become easier to isolate and retries become cheaper.
Explicit step polling	Long-running jobs need visible progress and failure boundaries.	You avoid one giant timeout-prone request.

The exact step sequence JSONClip uses

Homepage agent step order

1. prompt_check      -> Check prompt with OpenAI
2. project_create   -> Generate project name and create project
3. scenario_generate -> Generate scenario
4. schema_generate  -> Generate JSON schema draft and asset list
5. asset_prompts    -> Generate prompts for assets
6. images_generate  -> Generate missing image assets
7. schema_finalize  -> Finalize render schema with generated assets
8. render_video     -> Render video

That step list is the core of the whole system. The point is not that the names sound nice. The point is that each step narrows uncertainty. Prompt validation reduces garbage input. Project creation makes state durable. Scenario generation defines the editorial plan. Schema generation defines the structure and exact asset list. Asset prompts convert the plan into generation instructions. Image generation resolves missing visuals into real files. Schema finalization replaces placeholders with real asset URLs. Render then happens only when the movie is concrete.

What the request into the flow looks like

The user-facing homepage form is simple, but the request is not vague. It carries the JSONClip API key, the raw prompt, and selected generation settings for image, video, and narration. This matters because the backend can treat the run as a real job with explicit configuration instead of as a soft suggestion.

Example agent start request

POST /ui/agent-render/start

{
  "api_key": "YOUR_JSONCLIP_API_KEY",
  "prompt": "30 seconds of London history in a cinematic fast-paced style",
  "image_provider": "openai",
  "image_model_id": "gpt-image-1",
  "video_provider": "google",
  "video_model_id": "veo-2",
  "audio_provider": "elevenlabs",
  "audio_model_id": "eleven_multilingual_v2",
  "audio_voice_id": "rachel"
}

If you are rebuilding the same process in your own internal tool, keep this request narrow. Do not put the entire workflow in one unstructured string if you can avoid it. The prompt itself can still be natural language, but provider, model, and voice settings should be explicit fields. That is how you keep runs reproducible and auditable.

Step 1: prompt_check

This step is easy to underestimate. Prompt checking is not only about stopping abusive or malformed requests. It is also about making sure the request is sufficiently specific to generate a coherent scenario. If someone types “make a cool video,” a production-grade system should treat that as under-specified and either fail early or reshape the request carefully.

Check that the prompt is long enough and descriptive enough to imply a story or structure.
Check obvious format hints such as portrait versus landscape if the user mentioned them.
Reject prompts that cannot plausibly produce a useful output contract.
Preserve the original prompt in the run history so later stages can be audited against it.

This is the first place where many naive agent systems go wrong. They treat the prompt as sacred input even when it is weak. A better system treats the prompt as something that has to pass a quality gate before expensive work begins.

Step 2: project_create

The homepage flow creates a project early. That is operationally correct. A long-running generation pipeline should not hold everything in memory until the last step. As soon as the run is real, the project should exist. Then generated assets, schema drafts, and final outputs all have a durable place to live.

This also makes the flow editable after the run. The output is not just a dead MP4. It is a project with media, timeline structure, and state that can be opened in the editor, changed, and re-rendered.

Step 3: scenario_generate

Scenario generation is where the prompt turns into an editorial plan. The output should not yet be a final schema. It should be a structured understanding of what the video is trying to do: the beat order, the tone, the implied duration, and the narrative job of each section. If the prompt was “30 seconds of London history,” the scenario should not jump straight to clip coordinates. It should first decide that the story probably moves from origin to expansion to crisis to modernity.

Example scenario and asset-list output

{
  "scenario": {
    "title": "30 seconds of London history",
    "duration_target_sec": 30,
    "format": "1280x720",
    "tone": "cinematic, fast-paced, educational",
    "beats": [
      "Roman Londinium",
      "Medieval trading city",
      "Great Fire of London",
      "Industrial expansion",
      "Modern skyline and cultural reset"
    ]
  },
  "asset_list": [
    { "kind": "image", "role": "hook", "description": "Ancient map of the Thames and early city layout" },
    { "kind": "image", "role": "history", "description": "Roman Londinium with bridge and soldiers" },
    { "kind": "image", "role": "history", "description": "Great Fire of London night inferno" },
    { "kind": "image", "role": "modern", "description": "Modern London skyline with Shard and river" }
  ]
}

That is a much stronger intermediate artifact than a freeform paragraph. It is machine-usable, but a human can still read it and decide whether the story direction makes sense before any generation credits are spent.

Step 4: schema_generate

The homepage flow then generates a JSON schema draft and exact asset list. This is one of the most important design decisions in the whole system. The schema draft exists before the images. That means the system decides what assets are needed because of the structure it wants, not because the image generator happened to produce something interesting.

Example schema draft with asset slots

{
  "movie": {
    "format": {
      "width": 1280,
      "height": 720,
      "fps": 30,
      "background_color": "#000000"
    },
    "scenes": [
      { "type": "image", "asset_slot": "hook_01", "duration_ms": 3200 },
      { "type": "image", "asset_slot": "history_02", "duration_ms": 3400 },
      { "type": "image", "asset_slot": "history_03", "duration_ms": 3600 },
      { "type": "image", "asset_slot": "modern_04", "duration_ms": 3200 }
    ],
    "overlays": [
      {
        "type": "text",
        "text": "30 Seconds of London History",
        "from_ms": 120,
        "to_ms": 2600,
        "position_px": { "x": 640, "y": 110 },
        "width_px": 980,
        "style": { "font": "Avenir Next", "size_px": 68, "bold": true, "align": "center", "color": "#ffffff" },
        "stroke": { "color": "#000000", "width_px": 4 }
      }
    ],
    "effects": [],
    "audio": [],
    "captions": {}
  }
}

Notice the placeholder form here. The scenes reference asset slots, not final URLs. That makes the draft easy to inspect and easy to compare with the asset list. It also means image generation can target exact slots instead of wandering creatively.

Step 5: asset_prompts

Once the system knows which assets are required, it can write generation prompts specifically for those slots. This separation matters. If the scenario says “Great Fire of London” and the schema says that is the third beat, the asset prompt generator can produce a prompt that is visually appropriate for that beat, format, and role. That is much stronger than asking one model prompt to improvise all visuals at once.

Example asset prompts

[
  {
    "slot": "hook_01",
    "prompt": "Ancient map of London and Thames, parchment texture, cinematic composition, no text, 16:9"
  },
  {
    "slot": "history_02",
    "prompt": "Roman Londinium riverside, bridge, soldiers, historical illustration, dramatic light, 16:9"
  },
  {
    "slot": "history_03",
    "prompt": "Great Fire of London burning rooftops, smoke, night glow, historical epic style, 16:9"
  },
  {
    "slot": "modern_04",
    "prompt": "Modern London skyline with river and Shard, warm cinematic dusk, 16:9"
  }
]

A good asset prompt stage uses slot names, role names, and target aspect ratio explicitly. That lets the system regenerate only one missing image later without rethinking the whole story.

Step 6: images_generate

This is where the images-first strategy becomes concrete. The system generates only the missing visual assets and stores them in project storage. The result is not only an image response from a model API. The result is a stored asset with a durable URL that the render schema can actually use.

Generate only the asset slots that are missing, not the whole project every time.
Persist each generated asset in project storage immediately.
Keep the slot-to-file mapping explicit so later schema finalization is deterministic.
Store enough metadata to explain which prompt created which asset.

This is another place where simpler systems break down. They generate images, show them in a chat message, and then leave the rest of the workflow implicit. A production pipeline needs those assets to become first-class project files.

Step 7: schema_finalize

Schema finalization takes the draft structure and replaces placeholder asset slots with the actual stored asset URLs. This is also the moment to add or tighten transitions, overlays, timings, and effect choices now that the visual material is real. That distinction matters because a good timing decision depends on what the generated assets actually look like, not only on the original prompt.

Example final movie JSON after asset generation

{
  "movie": {
    "format": {
      "width": 1280,
      "height": 720,
      "fps": 30,
      "background_color": "#000000"
    },
    "scenes": [
      {
        "type": "image",
        "src": "https://store.jsonclip.com/jsonclip/users/.../hook_01.png",
        "duration_ms": 3200,
        "transition_out": { "type": "white_strobe", "duration_ms": 220 }
      },
      {
        "type": "image",
        "src": "https://store.jsonclip.com/jsonclip/users/.../history_02.png",
        "duration_ms": 3400,
        "transition_out": { "type": "blur", "duration_ms": 320 }
      },
      {
        "type": "image",
        "src": "https://store.jsonclip.com/jsonclip/users/.../history_03.png",
        "duration_ms": 3600,
        "transition_out": { "type": "snap_back", "duration_ms": 240 }
      },
      {
        "type": "image",
        "src": "https://store.jsonclip.com/jsonclip/users/.../modern_04.png",
        "duration_ms": 3200
      }
    ],
    "overlays": [
      {
        "type": "text",
        "text": "30 Seconds of London History",
        "from_ms": 120,
        "to_ms": 2600,
        "position_px": { "x": 640, "y": 110 },
        "width_px": 980,
        "style": { "font": "Avenir Next", "size_px": 68, "bold": true, "align": "center", "color": "#ffffff" },
        "stroke": { "color": "#000000", "width_px": 4 }
      }
    ],
    "effects": [
      { "type": "zoom_in", "from_ms": 0, "to_ms": 1800, "settings": { "strength": 1.08 } },
      { "type": "warm_flash", "from_ms": 6500, "to_ms": 7600 }
    ],
    "audio": [
      {
        "src": "https://store.jsonclip.com/jsonclip/users/.../narration.mp3",
        "role": "voiceover",
        "from_ms": 0,
        "to_ms": 13200
      }
    ],
    "captions": {
      "style": "bold_bottom",
      "cues": [
        { "from_ms": 0, "to_ms": 1800, "text": "London began as Roman Londinium" },
        { "from_ms": 1900, "to_ms": 4200, "text": "It grew into a medieval trade power" }
      ]
    }
  },
  "env": "prod"
}

Once this step is complete, the movie is not a plan anymore. It is a renderable object. If a human opens it in the editor, they can see and change real assets. If the renderer runs it, it has real URLs. If a failure happens later, the earlier stages do not need to be repeated.

Step 8: render_video

Only after the schema is fully grounded in real assets does the system render the final MP4. That is the right order. Rendering should be the last mile, not the phase where missing planning decisions are silently invented. At the end of a successful run, the system returns the generated project and the final movie URL.

Example completed run response

{
  "run_id": "01JHOMEPAGEAGENTEXAMPLE",
  "status": "done",
  "result": {
    "project_id": "a2e40c84-47b2-4157-8b14-246110704266",
    "duration_ms": 13200,
    "movie_url": "https://renderer.jsonclip.com/jsonclip/movies/example.mp4"
  }
}

This matters operationally because the render result is now attached to explicit preceding steps. If rendering fails, the system still has a project, a scenario, asset prompts, and stored images. That is a recoverable state, not a dead end.

Why this step-by-step design is better than one giant prompt

Approach	What it looks like	Main weakness
One giant prompt straight to video	One request asks the model to invent everything end to end invisibly.	Hard to observe, hard to retry selectively, and hard to audit.
Images-first staged flow	Prompt, scenario, schema, prompts, generated images, final schema, render.	Slightly more orchestration work, but much stronger operationally.
Manual editor-first flow	Human edits by hand after an idea prompt.	Great for bespoke work, weak for repeatable automation.

The staged flow wins when you care about production reliability. The point is not to add ceremony for its own sake. The point is to put boundaries around uncertainty. Each stage answers one question clearly before the next expensive stage begins.

How to build the same flow in your own system

Create a start endpoint that accepts API key, prompt, and provider/model settings.
Store each run with a real project as early as possible.
Run prompt validation before generation work begins.
Generate a structured scenario, not just freeform prose.
Generate a schema draft with explicit asset slots and required media roles.
Generate prompts only for the required asset slots.
Persist generated images immediately into project storage.
Finalize the schema only after the assets exist.
Render last, and keep step-level input/output logs for every stage.

If you skip any of those steps, you can still make a demo, but you do not really have the same operating model as the homepage flow. The defining characteristics are explicit stages, durable project state, slot-driven asset generation, and render-last discipline.

What to log at each stage

The homepage flow shows full input and output data per stage for a reason. Long-running AI workflows need observability. If a stage fails or produces weak output, you need to inspect the actual structured input and output of that stage, not guess from a final error string.

Stage	What to log	Why it matters
prompt_check	Original prompt, validation result, normalized format hints	Lets you diagnose weak requests early.
project_create	Generated project name, project ID, created timestamp	Connects the run to durable state.
scenario_generate	Scenario JSON and beat list	Shows whether the narrative plan was good before assets existed.
schema_generate	Draft schema and asset list	Defines the contract for later asset generation.
asset_prompts	Per-slot generation prompts	Lets you explain why one generated asset looks the way it does.
images_generate	Prompt-to-asset mapping and stored media URLs	Makes asset generation auditable and reusable.
schema_finalize	Final render schema	Shows the exact movie that was sent to render.
render_video	Renderer response and movie URL	Closes the loop with actual output status.

Common mistakes when teams try to copy this

They generate images before they have a schema draft and asset list.
They let the scenario stay as prose instead of structured beats.
They regenerate every image on every retry instead of only the missing ones.
They skip project creation and keep all run state ephemeral.
They finalize nothing and send placeholder slots to render.
They hide step data, which makes debugging much slower.

Those mistakes all come from one underlying problem: treating the pipeline as a magic conversation instead of as a staged system. The homepage agent mode is useful precisely because it is not pretending to be magic.

How retries should work in an images-first pipeline

A strong images-first system should not treat every failure as a reason to restart the run from zero. That is one of the major reasons to separate the stages. If prompt checking fails, nothing should be created. If scenario generation fails, the project can still exist but no expensive media work has started. If image generation partially succeeds, the successful assets should survive while only the missing slots are retried. If rendering fails after schema finalization, the final movie schema should remain inspectable and rerenderable without regenerating everything upstream.

Failure point	What should be reused	What should be retried
prompt_check	Nothing	Only prompt refinement or validation rerun
scenario_generate	Project state and original prompt	Scenario generation and later stages
schema_generate	Project state and scenario	Schema draft and later stages
asset_prompts	Scenario and schema draft	Prompt generation and later stages
images_generate	Already stored successful assets	Only failed or missing asset slots
schema_finalize	Stored assets and schema draft	Finalization and render
render_video	Whole finalized schema and stored assets	Only render

That retry design is one of the clearest differences between a real workflow and a toy demo. A demo tends to restart everything because it does not preserve intermediate state. A serious pipeline preserves state specifically so retries become cheap and predictable.

How humans should intervene without breaking the automation model

An images-first flow should still allow human intervention, but intervention needs to happen at the right layer. The best place for human review is usually after the scenario, after the asset prompts, or after the generated images exist but before the final render. Those checkpoints let a human improve the output without forcing the entire workflow back into manual editing.

Review the scenario if the editorial angle feels wrong.
Review asset prompts if the visual descriptions feel off-brand or too generic.
Review generated images if one beat is weak and should be regenerated before schema finalization.
Review the final schema if overlays, transitions, or pacing need minor cleanup before render.

What humans should not do is break the contract by editing random files without understanding the stage boundaries. If a human swaps one generated image manually, the project should still record that as the asset for that slot. If a human changes one transition, the final schema should still be the source of truth for render. The intervention should stay inside the same workflow, not become a side-channel.

Why project storage matters more than people think

Project storage is not just where the files happen to land. It is what makes the pipeline durable. Once generated images, audio, and schema files live inside one project, the whole run becomes editable, inspectable, and reproducible. Without project storage, the automation output is only a transient chat result. With project storage, it becomes a living artifact that can be reopened in the editor, revised in the API, or rerendered later.

This is also where the homepage flow quietly does the right thing. It does not stop at “the model replied with an idea.” It creates a project, stores generated assets, and builds around those files. That turns the run from a fragile one-off into a real production object. Teams that skip this step often end up rebuilding the same internal tooling later because the first version was impossible to audit.

What to monitor next if you operate a flow like this

Editorial rules for an images-first agent pipeline

- Keep the first version images-first unless video generation is genuinely required
- Make the scenario specific before generating assets
- Draft schema before generating images so the asset list is explicit
- Generate only the missing assets, not everything
- Replace asset slots with real stored URLs only after generation succeeds
- Render only after schema_finalize has produced a coherent movie

In practice, the most important metrics are not “did a video render.” They are earlier. How often does prompt_check reject or reshape weak prompts? How often does schema_generate request too many assets? How often do generated images need replacement? How often does schema_finalize produce a result that a human still wants to trim manually? Those are the metrics that tell you whether the pipeline itself is improving.

Conclusion

The homepage agent mode works because it turns one prompt into a sequence of explicit, inspectable decisions instead of one opaque leap. That is the real lesson. If you want the same kind of automation process in your own stack, copy the structure, not just the surface. Validate first. Create project state early. Generate scenario before schema. Generate schema before assets. Finalize only after assets exist. Render last.

That images-first pattern is not only easier to explain. It is easier to run, easier to debug, easier to edit, and easier to scale. The result is a pipeline that behaves like software instead of like a one-off AI trick.

FAQ

Why is the homepage flow images-first instead of video-first? Because images-first is currently the more deterministic path for prompt-to-video automation while still allowing later video-generation preferences to be stored and validated.

Can I use portrait output? Yes. The homepage flow defaults to 1280x720, but if the user asks for portrait or another format, that should be carried through the scenario and schema stages explicitly.

Why not render immediately after scenario generation? Because the scenario is only the editorial plan. The renderer needs a finalized schema with real asset URLs.

What is the main benefit of schema_generate before images_generate? It makes asset generation slot-driven and bounded, instead of open-ended.

Should I expose step logs to users? For operator-facing or developer-facing tools, usually yes. Observability is one of the biggest advantages of this design.

Methodology and internal references

This article is based on the live JSONClip homepage agent-mode flow, current step names and stage order in the web UI and backend, plus existing JSONClip render-schema patterns used by the service on April 6, 2026.