Inference Engineering

Inference engineering focuses on efficient inference serving after model deployment, including inference acceleration, quantization, and serving infrastructure.

Contents:

Local Inference Deployment — ONNX Runtime, model conversion, edge deployment
vLLM — PagedAttention, continuous batching, high-throughput LLM serving
KV Cache & Long Context — KV cache management, positional encoding, long context optimization
Inference Quantization — GPTQ, AWQ, GGUF, FP8 quantization methods
TensorRT-LLM & TGI — NVIDIA and HuggingFace inference engines

Inference Engineering

评论 #