Beyond Benchmarks: Evaluating AI the Way Operators Actually Use It