AI Inference

What is Inference?

Inference is the process where a trained AI model processes input data and generates a result (response, classification, prediction). This is the "using" stage of the model — as opposed to training, which creates it.

Inference costs and performance

In production, inference is the dominant AI cost: each query = tokens = API fee. Optimization includes: quantization (reducing model precision, e.g., fp16 to int8 — 2x faster, 2x cheaper), batching (grouping queries), speculative decoding, and KV cache.

Local vs cloud inference

Local inference (on company servers) eliminates API costs and privacy concerns but requires GPU hardware. Cloud inference is flexible but generates costs and compliance risks. Multi-tier routing combines both approaches.

What is Inference?

Inference costs and performance

Local vs cloud inference

Related terms

Related services and products