DeepInfra

North America Est. 2022 Pay-per-token; no minimum commitment; free tier for low-volume use.

inference-providerserverlessopen-source

Serverless GPU inference cloud for open-weight LLMs with an OpenAI-compatible API.

What They Do

DeepInfra runs popular open-source models (Llama, Mistral, DeepSeek, Qwen, etc.) on managed GPU clusters and exposes them through a drop-in-compatible OpenAI REST API. Developers pay only for tokens generated — there is no server allocation or idle cost.

Mission

Make open-source AI inference affordable and highly available for every developer.

FAQ

: Yes — base URL is https://api.deepinfra.com/v1/openai and it supports the /chat/completions, /completions, and /embeddings endpoints.