How AI Gets Tested: Benchmarks, Scores, and What They Actually Measure