Overview
Inference is the 'production' phase of AI. While training is about learning, inference is about applying that learning to solve real-world problems.
Optimization
Inference needs to be fast and efficient. Techniques like quantization and pruning are used to make models run faster during inference on devices like smartphones or edge nodes.
Cost
For large-scale applications, the cost of running inference (compute power) can be significant, leading to the development of specialized inference chips.