How to Turn a Photo into a Video with Claude Code and Wonda

By Thomas Gak-DeluenMay 19, 2026tutorials

Terminal showing a Claude Code conversation that animates a still photo into a vertical TikTok clip using Wonda

A practical guide to animating a single photo into a TikTok-ready clip from the terminal. Claude Code picks the right model, writes the prompt, and chains the workflow. You stay focused on the photo and the motion you want.

Almost every short-form video you scroll past started as a still image at some point.

That used to mean a slideshow with crossfades. In 2026 it means a real video clip with camera motion, subject motion, ambient detail, and a vertical format that fills a phone screen. The shift is mainstream and the tooling has caught up. The interesting question is no longer "can I animate this photo?" It is "which model should run it, and how do I keep the identity stable across the clip?"

Both of those questions have specific answers, and they are exactly the kind of thing you want an agent to handle for you. This post is the terminal-first version of the photo-to-video workflow: Claude Code as the operator, Wonda as the execution surface, and a small set of routing rules that decide which model touches your image.

If you have not read the broader Claude Code plus Wonda argument yet, You Don't Need to Learn the CLI: Let Claude Code Run Wonda for You is the place to start. If you want the wider video-model landscape, The Developer's Guide to AI Video Generation in 2026 covers the model lineup in more depth. And if you have long footage rather than a still photo, clipping it into captioned shorts is the companion workflow. This post is narrower. One photo in, one polished short clip out, fully driven from a single terminal session.

Key Takeaways

The choice of image-to-video model is not optional. It depends on what is in the photo, and the wrong model is one of the fastest ways to get bad output.

For a photo with a visible person or face, Kling 3 Pro is the safest default. For other photos, Seedance 2 is the default.

Claude Code reads Wonda's skill file, picks the right model, writes a motion-only prompt, and chains the workflow. You describe the photo and the intent.

Photo-to-video is iterative. Plan to regenerate one or two times. The first pass exists to calibrate the prompt, not to be the final clip.

What Does "Photo to Video" Actually Mean in 2026?

The phrase covers four different operations that are easy to confuse.

The first is image animation: one still photo becomes a 5-10 second clip with camera motion and subject motion. This is what most people mean when they say "photo to video," and it is what we will spend most of this post on.

The second is photo montage: many photos become one longer video, with transitions, music, and pacing applied automatically. That is a different workflow, closer to slideshow assembly than image animation.

The third is motion transfer: you supply a still image and a separate reference video, and the model applies the motion from the reference to the subject in the image. Dance trends and character animation use this.

The fourth is talking photo: a portrait plus a script plus a voice produces a lip-synced presenter clip. Different stack again, closer to avatar generation than to image animation.

This post focuses on the first one and touches the third. The other two deserve their own walkthroughs.

Why Does the Model Choice Matter So Much?

Because the model is the only thing that decides whether the identity in your photo survives the clip.

Most disappointing image-to-video output is not a prompt problem. It is a routing problem. You handed a portrait to a model that does not preserve faces well, or you handed a product shot to a model that drifts on small text and logos. The output looks "off" in a way that is hard to describe and hard to fix with a better prompt. The first fix is the right model.

Wonda's skill file documents the rule clearly. When you --attach a reference image to wonda generate video, the routing guidance is strict:

Person or face visible in the reference image: use kling_3_pro. Kling preserves identity more reliably across motion. This is the safest default when a real human face is in the photo.
No person in the reference image: use seedance-2. Seedance handles products, landscapes, food, architecture, and abstract scenes well, and is faster.
Motion transfer (image plus a reference dance or gesture video): use kling_2_6_motion_control. This is a separate model entirely. It needs both inputs.

You do not memorize this. Claude Code reads it from the skill file and applies it based on what you tell it about the photo. The agent is the one that has to remember the rule. You just have to describe what is in the image.

How Do You Set Up the Workflow?

If you already have Wonda installed from a previous Claude Code session, skip this. If not, three commands get you to the starting line.

curl -fsSL https://wonda.sh/install.sh | bash
wonda auth login
wonda accounts tiktok   # or instagram, linkedin, whichever you plan to publish to

That is the whole installation surface. Open Claude Code in the same terminal and you are ready. The rest of this post is what you say, and what the agent runs underneath.

How Should You Pick the Photo?

Before any of the model routing matters, the photo itself has to be usable.

A useful frame: the model is not "moving the existing pixels." It is using your image as a reference and generating entirely new frames around it. A clearer reference gives the model less room to invent. Bad inputs produce uncanny outputs no matter how good the prompt is.

Use a photo where:

The main subject is clearly separated from the background
Lighting is even enough to reveal depth and texture
Resolution is high (the model has more to work with)
The face or product is sharp, not motion-blurred
Important text or tiny logos are not load-bearing for the shot

Avoid a photo where:

Hands are visible and have to remain stable (hands are still the hardest single thing for image-to-video models)
The subject is cropped at the head or cut off at the edge
The background is cluttered with overlapping shapes
Critical detail sits in the bottom 10% of the frame where mobile UI will cover it
There is text baked into the image that must remain readable

One more thing that is not technical but matters. If the photo includes a real person, only use images you have the right to animate. The line is straightforward: do not generate motion of a real person without their consent, especially anything that makes them appear to say or do something. The platforms are tightening on this and so should you.

What Does the Claude Code Loop Actually Look Like?

This is the part the agent is for. You describe the photo and the intent, Claude Code picks the model, writes the prompt, kicks off generation, captions the result, and hands you the final clip.

Step 1: Hand Claude Code the photo and describe the intent

You do not run an upload command. You describe what you want:

I have a portrait photo at ./assets/sara-portrait.jpg. I want an 8-second
vertical clip for Instagram. Slow camera push-in, hair moving gently in
the wind, warm late-afternoon light. Keep her face stable, no smile,
no dramatic motion. This is for a personal brand intro.

Three things in that brief do most of the work. The agent knows the photo contains a person, so it routes to kling_3_pro. It knows the format is 9:16 vertical. And it has explicit constraints on what should not change.

Under the hood, Claude Code uploads the file and submits the job. The command shape is:

REF=$(wonda media upload ./assets/sara-portrait.jpg --quiet)

JOB=$(wonda generate video \
  --model kling_3_pro \
  --attach "$REF" \
  --prompt "Slow camera push-in on the woman in the photo, hair moving gently in a soft breeze, warm late-afternoon golden light. Subject remains in place. Preserve identity, facial structure, expression, hair color, and clothing. Subtle, cinematic, photographic." \
  --aspect-ratio 9:16 \
  --duration 8 \
  --wait --quiet)

VID=$(wonda jobs get inference "$JOB" --jq '.outputs[0].media.mediaId')

That is shown for transparency, not as something you type. Doing it by hand means you have to remember the model name, the right aspect ratio flag, the prompt rules for Kling specifically, and how to capture both the job ID and the media ID. The agent owns all of that. You own the brief.

Step 2: Let the agent write a motion prompt, not a scene description

This is where most photo-to-video output goes wrong, and where Claude Code adds the most value.

The photo already encodes the subject, setting, lighting, composition, and colors. The model can see all of that. The prompt should describe only what is supposed to move, plus the constraints on what should not move. A scene description fights with the image. A motion description directs the image.

Claude Code translates your natural-language brief into a motion-first prompt using a small set of rules baked into Wonda's skill file:

One camera motion (push-in, dolly, orbit, tilt, parallax)
One subject motion at most (blink, slight smile, hair sway, breathing)
One environmental motion (wind, light shift, drifting clouds, particles)
One mood word (cinematic, editorial, warm, dreamy)
One or two constraints (preserve identity, preserve product shape, no morphing)

If you cram more than that in, the motions start competing and the output looks chaotic. The agent will resist adding extra verbs even when you ask for them.

There is also a model-specific subtlety worth knowing about, even though Claude Code handles it for you. Kling's prompt field is capped around 2,500 characters and responds badly to the Sora-style structured briefs that work elsewhere (the ones with SCENE: / SUBJECT: / MOTION: headers). Kling needs natural-language prose, ideally with the hero subject in the opening sentence. Wonda's skill file tells Claude Code to rewrite prompts when escalating from Seedance to Kling. You will not see this happen, but it is what keeps the output stable.

Step 3: Add captions and a hook overlay

Once the clip is generated, the rest of the loop is the same chain we use on every short-form video. You do not run the edits. You describe them.

On that clip, add a white-highlight hook overlay that says
"How I show up online" in the first second. Then layer the standard
red animated captions on top.

Claude Code chains the two edits together and threads the intermediate media IDs:

HOOK=$(wonda edit video \
  --operation textOverlay \
  --media "$VID" \
  --prompt-text "How I show up online" \
  --preset "TikTok White Highlight" \
  --wait --quiet)

HOOK_MEDIA=$(wonda jobs get editor "$HOOK" --jq '.outputs[0].mediaId')

CAP=$(wonda edit video \
  --operation animatedCaptions \
  --media "$HOOK_MEDIA" \
  --preset "TikTok Red Captions" \
  --wait --quiet)

FINAL=$(wonda jobs get editor "$CAP" --jq '.outputs[0].mediaId')

Animated captions and hook overlays are part of the daily short-form workflow because they make the clip understandable before audio, context, or captions load. For this article, the important part is that Claude Code treats them as follow-on jobs after the animation succeeds.

Step 4: Publish with the right platform settings

The publish step is the same as any other short-form workflow. You confirm the destination, caption, and platform-specific settings in chat.

Publish that to TikTok with the caption:
"Same photo. Just doesn't feel still anymore."
Privacy public. AI disclosure on.

Claude Code looks up the account, attaches the media, and sets the TikTok --aigc flag. For Instagram, the same publish step uses Instagram-specific settings like caption, alt text, product tags, and share-to-feed. Letting the agent own the publish command is the right call because platform flags are the details you skip when tired.

When Should You Regenerate Instead of Editing?

Most first generations are calibration runs, not final outputs. The right reflex is to look at the result, decide whether it is fixable, and either regenerate or move on.

Regenerate when:

The face has drifted noticeably from the original photo (identity loss is rarely fixable in the edit)
The subject deformed or melted during motion (a prompt-level problem, not an edit-level one)
The motion is so chaotic that the clip feels accidentally generated
The background changed texture or color in a way that breaks the intent

Edit instead when:

The clip is technically clean but the hook copy needs work
The captions read poorly and the visual underneath is fine
The framing is close but you want to add a logo or call to action overlay
The pacing is right but the music is wrong

Telling Claude Code "regenerate with a slower push-in and add a constraint that her face must not move" is faster than you trying to fix identity drift in post. Telling Claude Code "the visual is fine, swap the hook to 'I changed my approach last year'" is faster than regenerating from scratch.

What About Multi-Photo and Motion Transfer Workflows?

The same operator pattern extends to the other workflows, with two changes worth knowing.

Multi-photo continuity. Kling 3 Pro supports start and end frame images. If you want a clip that moves from one photo to another (the same subject from two angles, a product in two states), you give the agent both images and describe the in-between motion. It is the same loop, just with two --attach references.

Motion transfer. When you want a person or character to perform a specific dance or gesture, the model changes. Claude Code routes to kling_2_6_motion_control, which takes both a reference image and a reference motion video. You describe the inputs:

Apply the motion from ./refs/dance-reference.mp4 onto the character in
./assets/character.png. 9:16, 5 seconds, preserve the character's outfit
and proportions.

This is the same model that powers TikTok dance-trend recreations, but you get to use it without leaving the terminal. The output works well when the reference video has clean, single-subject motion and the still image has a clear body and limbs.

Where Does This Workflow Genuinely Fail?

Three places, and they are worth naming because the failures look like the agent's fault when they are actually input or expectation problems.

1. The photo was never going to work

If the input has cropped hands, motion blur on the subject, or a face partly out of frame, no model will recover. The agent can generate a clip, but it will be uncanny. Claude Code will sometimes warn you about this if you describe the photo accurately. If it does not, look at the input yourself before generating.

2. The motion you want is not physically simple

Asking for "she turns around, smiles, waves at the camera while the background shifts to night" is asking for three motions and a global change. Image-to-video models cannot reliably do that today. They are good at one camera motion plus one subtle subject motion plus one environmental shift. Anything more and you should script it as two clips and cut between them.

3. The brand requires absolute identity stability

Some product shots and some founder portraits cannot tolerate any drift. For those, accept that the AI clip is b-roll, not the hero shot. Use a slow, restrained motion. Keep important details (logos, faces, package text) inside the safe zone. Add overlays for anything that must remain exactly readable.

Frequently Asked Questions

Do I need to write prompts myself, or does Claude Code do that?

Claude Code does. You describe the photo and the intent in plain English. The agent translates that into the motion-only prompt that Kling or Seedance actually wants. If you have specific motion vocabulary in mind ("dolly-in," "rack focus," "parallax depth"), include it and the agent will preserve it.

Why can I not just use one model for everything?

Because identity preservation differs significantly across models. Kling 3 Pro handles faces in a way Seedance does not. Seedance handles non-person scenes faster and often more cleanly than Kling. Using the wrong one is the most common cause of bad output. Wonda's skill file makes the routing rule explicit so Claude Code applies it for you.

How long should the clip be?

5 to 8 seconds for a single-photo animation on TikTok, Reels, or Shorts. Long enough to feel intentional, short enough that the model does not run out of believable motion. If you need 15 to 30 seconds, plan it as two or three clips stitched together rather than one long generation.

Can I keep the same face exactly across multiple generated clips?

Mostly. Kling 3 Pro is the right model and the right constraints help (preserve identity, facial structure, hair color, clothing). Do not expect frame-perfect consistency across long sequences. If the consistency requirement is mission-critical (sponsored content, talent likeness), treat the generated clip as a draft and consult the talent before publishing.

Is there a free tier I can try this on?

Yes. Wonda's CLI exposes generation on logged-in free accounts and reserves skill commands for paid plans. Run wonda pricing list to see current model rates, or wonda pricing estimate before you generate. You can also check wonda.sh for current plan details.

The Bottom Line

Photo-to-video is one of the workflows where having Claude Code as the operator changes the experience the most.

The model routing is real and load-bearing. The prompt format is model-specific and not obvious. The chaining from animation to captions to publish involves four jobs and three intermediate media IDs. Doing it by hand for one clip is fine. Doing it daily for thirty days is the kind of friction that quietly kills a posting habit.

The right division of labor: you supply the photo and the intent. Claude Code picks the model, writes the motion prompt, chains the jobs, and hands you a clip ready to publish. You keep the parts that matter: which photo is worth animating, whether the result is on-brand, and what gets posted.

If you want to plug this into a daily posting rhythm, How to Grow on TikTok Fast covers the broader consistency loop. If you want to extend the same operator pattern to image-only workflows, How to Generate AI Images from the Command Line is the matching read.

#ai-video #claude-code #image-to-video #tutorial #wonda