---
title: Multimodal Capabilities
description: Send images, PDFs, audio, and video to Agnic AI Gateway models
---

# Multimodal Capabilities

Agnic AI Gateway supports multiple input modalities beyond text, allowing you to send images, PDFs, audio, and video files to compatible models through our unified API. This enables rich multimodal interactions for a wide variety of use cases.

## Supported Modalities

### Images

Send images to vision-capable models for analysis, description, OCR, and more. Agnic supports multiple image formats and both URL-based and base64-encoded images.

[Learn more about image inputs →](/docs/ai-gateway/multimodal/images)

### Image Generation

Generate images from text prompts using AI models with image output capabilities. Create high-quality images based on your descriptions.

[Learn more about image generation →](/docs/ai-gateway/multimodal/image-generation)

### PDFs

Process PDF documents with compatible models. Our intelligent PDF parsing system extracts text and handles both text-based and scanned documents.

[Learn more about PDF processing →](/docs/ai-gateway/multimodal/pdfs)

### Audio

Send audio files to speech-capable models for transcription, analysis, and processing. Agnic supports common audio formats with automatic routing to compatible models.

[Learn more about audio inputs →](/docs/ai-gateway/multimodal/audio)

### Video

Send video files to video-capable models for analysis, description, object detection, and action recognition. Process multiple video formats for comprehensive video understanding tasks.

[Learn more about video inputs →](/docs/ai-gateway/multimodal/video)

---

## Getting Started

All multimodal inputs use the same `/v1/chat/completions` endpoint with the `messages` parameter. Different content types are specified in the message content array:

| Modality | Content Type | Format |
|----------|--------------|--------|
| Images | `image_url` | URL or base64 data URL |
| PDFs | `file` | URL or base64 data URL |
| Audio | `input_audio` | Base64 encoded |
| Video | `video_url` | URL or base64 data URL |

You can combine multiple modalities in a single request, and the number of files you can send varies by provider and model.

---

## Model Compatibility

Not all models support every modality. Agnic automatically routes requests to compatible models:

| Modality | Required Capability | Example Models |
|----------|---------------------|----------------|
| Images | Vision models | GPT-4o, Claude 3, Gemini |
| PDFs | File-compatible | Claude 3, Gemini 1.5 Pro |
| Audio | Audio-capable | GPT-4o Audio, Gemini |
| Video | Video-capable | Gemini 2.0, GPT-4o |

Use the [Models API](/docs/ai-gateway/models) to find models that support your desired input modalities by checking the `architecture.input_modalities` field.

---

## Input Format Support

Agnic supports both **direct URLs** and **base64-encoded data** for multimodal inputs:

### URLs (Recommended for public content)

```
Images: https://example.com/image.jpg
PDFs:   https://example.com/document.pdf
Video:  https://youtube.com/watch?v=... (provider-specific)
```

### Base64 Encoding (Required for local files)

```
Images: data:image/jpeg;base64,{base64_data}
PDFs:   data:application/pdf;base64,{base64_data}
Audio:  Raw base64 string with format specification
Video:  data:video/mp4;base64,{base64_data}
```

<Callout type="tip">
  URLs are more efficient for large files as they don't require local encoding and reduce request payload size. Base64 encoding is required for local files or when the content is not publicly accessible.
</Callout>

---

## Quick Example

```python
from openai import OpenAI

client = OpenAI(
    api_key="agnic_tok_YOUR_TOKEN",
    base_url="https://api.agnic.ai/v1"
)

# Send an image for analysis
response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)
```

---

## Pricing

Multimodal content is priced based on the modality:

- **Images**: Typically priced per image or as input tokens
- **PDFs**: Priced based on page count or as input tokens
- **Audio**: Priced as input tokens based on duration
- **Video**: Priced as input tokens based on duration and resolution

<Callout type="info">
  For current pricing on all models, visit [**AI Gateway Pricing**](https://app.agnic.ai/ai-gateway/pricing).
</Callout>

---

## Next Steps

<Cards>
  <Card title="Image Inputs" href="/docs/ai-gateway/multimodal/images" />
  <Card title="Image Generation" href="/docs/ai-gateway/multimodal/image-generation" />
  <Card title="PDF Processing" href="/docs/ai-gateway/multimodal/pdfs" />
  <Card title="Audio Inputs" href="/docs/ai-gateway/multimodal/audio" />
  <Card title="Video Inputs" href="/docs/ai-gateway/multimodal/video" />
</Cards>
