Groq

North America Est. 2016 Pay-per-token; free tier with generous daily rate limits.

inference-providerhardwarefast-inference

AI chip company and cloud provider offering the world's fastest LLM inference via its proprietary LPU hardware.

What They Do

Groq designs Language Processing Units (LPUs) — custom silicon optimised for autoregressive token generation, achieving 300–800 tokens/second on 70B models. GroqCloud is the hosted inference API offering pay-per-token access to popular open models.

Mission

Democratise access to low-latency, cost-efficient AI compute through purpose-built chip technology.

Available Models

Model	Context	Input /M	Output /M
allam-2-7b	—	—	—
canopylabs/orpheus-arabic-saudi	—	—	—
canopylabs/orpheus-v1-english	—	—	—
groq/compound	—	—	—
groq/compound-mini	—	—	—
llama-3.1-8b-instant	—	—	—
llama-3.3-70b-versatile	—	—	—
meta-llama/llama-4-scout-17b-16e-instruct	—	—	—
meta-llama/llama-prompt-guard-2-22m	—	—	—
meta-llama/llama-prompt-guard-2-86m	—	—	—
openai/gpt-oss-120b	—	—	—
openai/gpt-oss-20b	—	—	—
openai/gpt-oss-safeguard-20b	—	—	—
qwen/qwen3-32b	—	—	—
qwen/qwen3.6-27b	—	—	—
whisper-large-v3	—	—	—
whisper-large-v3-turbo	—	—	—

FAQ

: A Language Processing Unit is Groq's custom ASIC designed specifically for inference workloads. Unlike GPUs optimised for parallelism, the LPU uses a deterministic dataflow architecture that eliminates memory stalls during token generation.
: Yes. Change the base URL to https://api.groq.com/openai/v1 and existing code works unchanged.