Facio's Media Generation Tools: How AI Agents Create Images and Videos Programmatically
Most AI agent platforms are text-only. They write blog posts, generate reports, analyze spreadsheets, and send messages — all in text. But the internet is visual. Social media posts need images. Product launches need hero graphics. Marketing campaigns need video. An agent that can only produce text is an agent that can only do half the job.
Facio's generate_image and generate_video tools close that gap — giving agents the ability to create visual content programmatically, across multiple providers, with HITL oversight and direct channel delivery. Here's how the architecture works and what it enables.
The Architecture: Provider-Agnostic, Multi-Model
Both tools follow the same provider-neutral pattern as switch_model: the agent specifies a prompt and optional parameters, and the runtime routes to the configured provider.
generate_image supports:
- OpenAI —
gpt-image-1,dall-e-3 - Google Gemini —
gemini-2.5-flash-preview-image-generation - Replicate —
black-forest-labs/flux-schnell,black-forest-labs/flux-dev, and any community model - fal.ai —
fal-ai/flux/schnell,fal-ai/flux/dev, and other fal-hosted models
generate_video supports:
- OpenAI Sora —
sora-2,sora-2-pro - Google Veo —
veo-3.0-generate-001,veo-3.1-generate-001 - Replicate — Kling, Runway, Minimax, and other video models
- fal.ai —
fal-ai/minimax-video/video-01and similar
The agent doesn't need provider-specific code. It writes a prompt, picks a model (or lets the default apply), and the runtime handles authentication, API negotiation, and file delivery.
generate_image(
prompt="A professional hero image for a blog post about AI agent security, featuring a shield with circuit patterns on a dark blue background, clean tech aesthetic",
model="gpt-image-1",
size="1536x1024",
output_format="png"
)
The result is a file path in the workspace — ready for the agent to read, analyze, attach to a message, or pass through an HITL review.
Image-to-Image and Image-to-Video: Beyond Text Prompts
Both tools accept reference images as input — not just text prompts. This enables workflows that pure text-to-image generation can't handle:
Image-to-image generation:
generate_image(
prompt="Apply a cyberpunk aesthetic to this product photo, adding neon accents and a dark background",
image_url="https://example.com/product-photo.jpg",
model="fal-ai/flux/dev"
)
Image-to-video generation:
generate_video(
prompt="Smooth camera pan across this hero image with subtle particle effects",
image_url="https://example.com/hero-image.png",
model="sora-2",
duration=6
)
The reference image workflow is critical for brand consistency. An agent generating social media content can start from a brand-approved template image and generate platform-specific variants — rather than creating visuals from scratch that might drift from brand guidelines.
Resolution, Quality, and Format Control
Both tools expose the full parameter surface that professional content creation requires:
| Parameter | generate_image | generate_video |
|---|---|---|
| Size/resolution | 1024x1024, 1536x1024, 1024x1536, auto | 1280x720, 1920x1080 |
| Quality | low, medium, high, auto | N/A (duration-based) |
| Duration | N/A | 4, 6, 8, 16, 20 seconds |
| Output format | png, jpeg, webp | Provider-specific |
| Batch size (n) | 1-4 images per call | 1 video per call |
The agent can generate four image variants in one call, pick the best, and request a high-resolution export of the winner — all within the same tool surface.
HITL Integration: Human Review Before Publication
Visual content is high-stakes. A typo in a blog post is fixable. A generated image with mangled text or anatomical errors on a brand's social media feed is a reputational problem.
Facio's media tools integrate with the HITL pipeline through the same pattern as every other tool:
- Agent generates candidate images:
generate_image(prompt="...", n=3) - Agent attaches generated files to an approval request:
ask_approval(title="Review hero images", media=["/path/to/img1.png", "/path/to/img2.png"]) - Human reviews the images in their Placet.io inbox.
- Human approves (or rejects, with comments for regeneration).
- Agent publishes the approved image to the target channel via
message(media=["..."]).
The approval gate sits between generation and publication — not before generation. The agent does the creative work (writing prompts, generating variants, selecting candidates) without blocking. The human only reviews the final candidates.
Direct Channel Delivery
Generated media doesn't just sit on disk. The message tool delivers it directly to any configured channel:
message(
content="New blog post hero image, ready for review.",
media=["/workspace/output/hero-image-v3.png"],
channel="placet"
)
The media parameter accepts local file paths. The files are uploaded and attached to the message automatically. No manual upload step, no separate CDN, no copy-paste workflow. The agent creates, reviews, and delivers — all within the same tool surface.
Production Patterns
Pattern 1: Social Media Content Pipeline
1. Agent writes a blog post (text).
2. Agent generates 4 hero image variants for the post.
3. Agent presents the best 2 to a human via ask_approval.
4. Human selects variant #2.
5. Agent publishes the post with the selected image.
6. Agent generates platform-specific variants (1:1 for Instagram, 16:9 for LinkedIn).
7. Agent delivers each variant to the appropriate channel.
One content pipeline, multiple image generations, human oversight at the decision point — not at every generation step.
Pattern 2: Automated Thumbnail Generation
A video publishing workflow:
1. Agent uploads a video to a hosting platform.
2. Agent extracts a key frame as a reference image.
3. Agent generates 3 thumbnail variants with different text overlays.
4. Agent runs A/B testing logic to select the best thumbnail.
5. Agent assigns the winning thumbnail to the video.
The agent handles the entire thumbnail workflow without a designer. The generated thumbnails are programmatic — consistent, on-brand, and optimized for click-through.
Pattern 3: Multi-Format Campaign Asset Generation
A product launch campaign needs assets across formats:
1. generate_image → hero banner (1536x1024, for blog)
2. generate_image → square post (1024x1024, for Instagram)
3. generate_image → vertical story (1024x1536, for Stories/Reels)
4. generate_video → 6-second teaser (1920x1080, for LinkedIn/X)
5. generate_video → 16-second walkthrough (1280x720, for YouTube Shorts)
Five assets, two tools, one agent — delivering a complete multi-format campaign from a single creative brief.
Provider Selection: When to Use Which
Different providers excel at different things, and the agent can route accordingly:
| Provider | Best for | Trade-off |
|---|---|---|
| OpenAI (gpt-image-1) | Photorealistic, detailed compositions | Higher cost per image |
| Google Gemini | Fast iteration, integrated with Gemini's multimodal reasoning | Less control over fine details |
| Replicate (Flux) | Open-source models, community fine-tunes, lower cost | Requires Replicate account and credits |
| fal.ai (Flux) | Fast inference, good for batch generation | Credit-based pricing |
| OpenAI Sora | Cinematic video, complex scene composition | Higher cost, longer generation time |
| Google Veo | Fast video generation, good for short-form content | Less cinematic than Sora |
The agent makes these routing decisions at runtime — just like it routes between LLM providers with switch_model. A content pipeline might use Flux for batch thumbnail generation (cheap, fast) and Sora for the hero video (high quality, worth the cost).
Bottom Line
Text-only agents are half-agents. The internet runs on visuals — hero images, thumbnails, social cards, product photos, teaser videos, explainer animations. An agent that can't produce these is an agent that needs a human to do the visual half of every job.
Facio's generate_image and generate_video tools give agents the full creative surface: multi-provider, multi-format, image-to-image and image-to-video pipelines, HITL-gated publication, and direct channel delivery. The agent writes the prompt, generates the asset, gets human approval, and publishes — all in one workflow.
Because "write a blog post" shouldn't mean "write a blog post, then wait for a human to make the hero image." The agent should do both.
See the media generation documentation for provider configuration, model-specific parameters, and HITL review patterns.