Serverless GPU inference cloud for open-weight LLMs with an OpenAI-compatible API.
What They Do
DeepInfra runs popular open-source models (Llama, Mistral, DeepSeek, Qwen, etc.) on managed GPU clusters and exposes them through a drop-in-compatible OpenAI REST API. Developers pay only for tokens generated — there is no server allocation or idle cost.
Mission
Make open-source AI inference affordable and highly available for every developer.
FAQ
Yes — base URL is https://api.deepinfra.com/v1/openai and it supports the /chat/completions, /completions, and /embeddings endpoints.