The current evaluation benchmarks suffer from the following problems:
Need for automated scoring forces a multiple choice format setting up the task for predictive solutions
Evaluation does not take into account what specific curated knowledge was used and the steps taken to produce the answers
They are rendered obsolete with researchers tuning their system to the benchmark without actually building the capability that the benchmark was designed to test
The panelists will address the following questions:
Is there a way for us to design tests that address the above problems?
If automated scoring/cost was not an issue, what tests could we use?
What are natural tests that also take into account the computation performed by the system?
Could there be tests that are fitted to qualitative reasoning and reasoning about actions?
Ernie Davis has a nice survey of benchmarks in which he outlines several desiderata for good benchmarks and analyzes numerous problems with the existing ones. The panel will be informed by this prior survey,