Product · Jun 7, 2026

Facio's Media Generation Tools: How AI Agents Create Images and Videos Programmatically

AI agents that can only produce text are leaving half their potential on the table. Facio's generate_image and generate_video tools let agents create visual content programmatically — across OpenAI, Google Gemini, Replicate, and fal.ai — with provider-agnostic APIs, HITL approval gates, and direct delivery to any channel. Here's how autonomous visual content creation works in production.

Image GenerationVideo GenerationContent CreationMulti-ProviderAI Media

Facio's Media Generation Tools: How AI Agents Create Images and Videos Programmatically

Most AI agent platforms are text-only. They write blog posts, generate reports, analyze spreadsheets, and send messages — all in text. But the internet is visual. Social media posts need images. Product launches need hero graphics. Marketing campaigns need video. An agent that can only produce text is an agent that can only do half the job.

Facio's generate_image and generate_video tools close that gap — giving agents the ability to create visual content programmatically, across multiple providers, with HITL oversight and direct channel delivery. Here's how the architecture works and what it enables.

The Architecture: Provider-Agnostic, Multi-Model

Both tools follow the same provider-neutral pattern as switch_model: the agent specifies a prompt and optional parameters, and the runtime routes to the configured provider.

generate_image supports:

OpenAI — gpt-image-1, dall-e-3
Google Gemini — gemini-2.5-flash-preview-image-generation
Replicate — black-forest-labs/flux-schnell, black-forest-labs/flux-dev, and any community model
fal.ai — fal-ai/flux/schnell, fal-ai/flux/dev, and other fal-hosted models

generate_video supports:

OpenAI Sora — sora-2, sora-2-pro
Google Veo — veo-3.0-generate-001, veo-3.1-generate-001
Replicate — Kling, Runway, Minimax, and other video models
fal.ai — fal-ai/minimax-video/video-01 and similar

The agent doesn't need provider-specific code. It writes a prompt, picks a model (or lets the default apply), and the runtime handles authentication, API negotiation, and file delivery.

generate_image(
    prompt="A professional hero image for a blog post about AI agent security, featuring a shield with circuit patterns on a dark blue background, clean tech aesthetic",
    model="gpt-image-1",
    size="1536x1024",
    output_format="png"
)

The result is a file path in the workspace — ready for the agent to read, analyze, attach to a message, or pass through an HITL review.

Image-to-Image and Image-to-Video: Beyond Text Prompts

Both tools accept reference images as input — not just text prompts. This enables workflows that pure text-to-image generation can't handle:

Image-to-image generation:

generate_image(
    prompt="Apply a cyberpunk aesthetic to this product photo, adding neon accents and a dark background",
    image_url="https://example.com/product-photo.jpg",
    model="fal-ai/flux/dev"
)

Image-to-video generation:

generate_video(
    prompt="Smooth camera pan across this hero image with subtle particle effects",
    image_url="https://example.com/hero-image.png",
    model="sora-2",
    duration=6
)

The reference image workflow is critical for brand consistency. An agent generating social media content can start from a brand-approved template image and generate platform-specific variants — rather than creating visuals from scratch that might drift from brand guidelines.

Resolution, Quality, and Format Control

Both tools expose the full parameter surface that professional content creation requires:

Parameter	`generate_image`	`generate_video`
Size/resolution	`1024x1024`, `1536x1024`, `1024x1536`, `auto`	`1280x720`, `1920x1080`
Quality	`low`, `medium`, `high`, `auto`	N/A (duration-based)
Duration	N/A	4, 6, 8, 16, 20 seconds
Output format	`png`, `jpeg`, `webp`	Provider-specific
Batch size (n)	1-4 images per call	1 video per call

The agent can generate four image variants in one call, pick the best, and request a high-resolution export of the winner — all within the same tool surface.

HITL Integration: Human Review Before Publication

Visual content is high-stakes. A typo in a blog post is fixable. A generated image with mangled text or anatomical errors on a brand's social media feed is a reputational problem.

Facio's media tools integrate with the HITL pipeline through the same pattern as every other tool:

Agent generates candidate images: generate_image(prompt="...", n=3)
Agent attaches generated files to an approval request: ask_approval(title="Review hero images", media=["/path/to/img1.png", "/path/to/img2.png"])
Human reviews the images in their Placet.io inbox.
Human approves (or rejects, with comments for regeneration).
Agent publishes the approved image to the target channel via message(media=["..."]).

The approval gate sits between generation and publication — not before generation. The agent does the creative work (writing prompts, generating variants, selecting candidates) without blocking. The human only reviews the final candidates.

Direct Channel Delivery

Generated media doesn't just sit on disk. The message tool delivers it directly to any configured channel:

message(
    content="New blog post hero image, ready for review.",
    media=["/workspace/output/hero-image-v3.png"],
    channel="placet"
)

The media parameter accepts local file paths. The files are uploaded and attached to the message automatically. No manual upload step, no separate CDN, no copy-paste workflow. The agent creates, reviews, and delivers — all within the same tool surface.

Production Patterns

Pattern 1: Social Media Content Pipeline

1. Agent writes a blog post (text).
2. Agent generates 4 hero image variants for the post.
3. Agent presents the best 2 to a human via ask_approval.
4. Human selects variant #2.
5. Agent publishes the post with the selected image.
6. Agent generates platform-specific variants (1:1 for Instagram, 16:9 for LinkedIn).
7. Agent delivers each variant to the appropriate channel.

One content pipeline, multiple image generations, human oversight at the decision point — not at every generation step.

Pattern 2: Automated Thumbnail Generation

A video publishing workflow:

1. Agent uploads a video to a hosting platform.
2. Agent extracts a key frame as a reference image.
3. Agent generates 3 thumbnail variants with different text overlays.
4. Agent runs A/B testing logic to select the best thumbnail.
5. Agent assigns the winning thumbnail to the video.

The agent handles the entire thumbnail workflow without a designer. The generated thumbnails are programmatic — consistent, on-brand, and optimized for click-through.

Pattern 3: Multi-Format Campaign Asset Generation

A product launch campaign needs assets across formats:

1. generate_image → hero banner (1536x1024, for blog)
2. generate_image → square post (1024x1024, for Instagram)
3. generate_image → vertical story (1024x1536, for Stories/Reels)
4. generate_video → 6-second teaser (1920x1080, for LinkedIn/X)
5. generate_video → 16-second walkthrough (1280x720, for YouTube Shorts)

Five assets, two tools, one agent — delivering a complete multi-format campaign from a single creative brief.

Provider Selection: When to Use Which

Different providers excel at different things, and the agent can route accordingly:

Provider	Best for	Trade-off
OpenAI (gpt-image-1)	Photorealistic, detailed compositions	Higher cost per image
Google Gemini	Fast iteration, integrated with Gemini's multimodal reasoning	Less control over fine details
Replicate (Flux)	Open-source models, community fine-tunes, lower cost	Requires Replicate account and credits
fal.ai (Flux)	Fast inference, good for batch generation	Credit-based pricing
OpenAI Sora	Cinematic video, complex scene composition	Higher cost, longer generation time
Google Veo	Fast video generation, good for short-form content	Less cinematic than Sora

The agent makes these routing decisions at runtime — just like it routes between LLM providers with switch_model. A content pipeline might use Flux for batch thumbnail generation (cheap, fast) and Sora for the hero video (high quality, worth the cost).

Bottom Line

Text-only agents are half-agents. The internet runs on visuals — hero images, thumbnails, social cards, product photos, teaser videos, explainer animations. An agent that can't produce these is an agent that needs a human to do the visual half of every job.

Facio's generate_image and generate_video tools give agents the full creative surface: multi-provider, multi-format, image-to-image and image-to-video pipelines, HITL-gated publication, and direct channel delivery. The agent writes the prompt, generates the asset, gets human approval, and publishes — all in one workflow.

Because "write a blog post" shouldn't mean "write a blog post, then wait for a human to make the hero image." The agent should do both.

See the media generation documentation for provider configuration, model-specific parameters, and HITL review patterns.

Facio's Media Generation Tools: How AI Agents Create Images and Videos Programmatically

Facio's Media Generation Tools: How AI Agents Create Images and Videos Programmatically

The Architecture: Provider-Agnostic, Multi-Model

Image-to-Image and Image-to-Video: Beyond Text Prompts

Resolution, Quality, and Format Control

HITL Integration: Human Review Before Publication

Direct Channel Delivery

Production Patterns

Pattern 1: Social Media Content Pipeline

Pattern 2: Automated Thumbnail Generation

Pattern 3: Multi-Format Campaign Asset Generation

Provider Selection: When to Use Which

Bottom Line

More on Product

Facio's Anti-Abuse Discipline: How AI Agent Systems Detect and Stop Prompt Injection, Loops, and Exfiltration Before Damage Is Done

Facio's Dead-Letter Discipline: How AI Agent Systems Handle the Work That Will Never Succeed

Facio's Compliance Mode: How AI Agents Operate Inside Regulated Industries Without Becoming the Compliance Problem