Back to glossary Technology

AI Inference

The process of generating responses by a trained AI model — the production stage where the model processes inputs and returns results.

What is Inference?

Inference is the process where a trained AI model processes input data and generates a result (response, classification, prediction). This is the "using" stage of the model — as opposed to training, which creates it.

Inference costs and performance

In production, inference is the dominant AI cost: each query = tokens = API fee. Optimization includes: quantization (reducing model precision, e.g., fp16 to int8 — 2x faster, 2x cheaper), batching (grouping queries), speculative decoding, and KV cache.

Local vs cloud inference

Local inference (on company servers) eliminates API costs and privacy concerns but requires GPU hardware. Cloud inference is flexible but generates costs and compliance risks. Multi-tier routing combines both approaches.

Related services and products