Inference Engineering
Inference engineering focuses on efficient inference serving after model deployment, including inference acceleration, quantization, and serving infrastructure.
Contents:
- Local Inference Deployment — ONNX Runtime, model conversion, edge deployment
- vLLM — PagedAttention, continuous batching, high-throughput LLM serving
- KV Cache & Long Context — KV cache management, positional encoding, long context optimization
- Inference Quantization — GPTQ, AWQ, GGUF, FP8 quantization methods
- TensorRT-LLM & TGI — NVIDIA and HuggingFace inference engines