Model evaluation is the process of measuring how well an AI or machine learning model performs, using metrics, benchmarks, and test sets to judge accuracy, robustness, and fitness for a task. It is how teams decide whether a model is good enough to ship and whether a change actually improved it.
Evaluation assumes the data underneath is stable. If the same model is scored against a different data state each time, a swing in the metric cannot be attributed: was it the model, the prompt, or the fact that the input state quietly moved?
Reliable evaluation therefore has a precondition that sits before the metric. The data state a model ran against has to be reproducible, so a result can be compared to a past one on equal terms instead of guessed at.