Streaming
Real-time streaming responses from the AI Gateway
Streaming Responses
Stream AI responses in real-time for a better user experience. Instead of waiting for the complete response, receive tokens as they're generated.
Why Streaming?
- Better UX - Users see responses immediately
- Perceived speed - App feels faster and more responsive
- Cancel early - Stop generation if response isn't helpful
- Memory efficient - Process tokens as they arrive
Python Streaming
Basic Streaming
Async Streaming
Collecting Full Response
JavaScript Streaming
Basic Streaming
Browser (React Example)
Never expose API tokens in browser code in production. Use a backend proxy instead.
cURL Streaming
The -N flag disables buffering for real-time output.
Stream Event Format
Each chunk follows the Server-Sent Events (SSE) format:
Chunk Structure
Final Chunk
The last chunk has finish_reason set and empty delta:
Handling Stream Events
Python with Events
Canceling Streams
Python
JavaScript
Error Handling in Streams
Best Practices
- Always handle partial responses - Streams can disconnect mid-response
- Implement timeouts - Don't wait forever for chunks
- Show loading state - Indicate when waiting for first chunk
- Buffer for display - Some UI frameworks work better with small batches
- Track usage - Final chunk may include token usage info