Interactive Math Evals

Interactive Evaluation of AI-Based Mathematics Assistants

Project Co-Leads: Katie Collins and Albert Jiang

This survey has been approved by the University of Cambridge Dept of Computer Science Ethics Committee

Welcome!

In this study, you will be helping evaluate state-of-the-art AI systems at theorem proving. In particular, you will evaluate how effective these systems can be as *mathematical assistants* through multi-step interactions.

If you already know how to solve a problem easily, please evaluate these systems from the perspective of how they *could* assist a *undergraduate-level* mathematician, who does not immediately know how to solve the problem.

Your Task

You will get to interact with three different AI systems. You will interact with each model on some problem (from a topic of mathematics that you select, e.g., algebra or group theory) and then rate your preference over which system you prefer. You can repeat that process (interacting with each model on a problem, and rating your preference) for up to three iterations. You can stop whenever you'd like.

The rating process for each model, for each problem, will be as follows:

First, you will indicate how confident you are that you could currently solve the problem by yourself, without any assistance.
You will then be presented with a chat box to interact with the model.
Here, please imagine what kinds of interactions you would do to get the model to assist you at solving the task.
- If you already know how to solve the task, please imagine what kind of behavior you would have liked in a system when you were first approaching this problem, e.g., if you were an undergraduate mathematics student who did not immediately know how to solve the problem.
When you feel you are done interacting, you will rate each step of the interaction along two dimensions:
- - 1) You will score whether the generated text from the AI chatbot was more helpful or harmful towards assisting you in solving the problem;
  - 2) You will score whether the generated text was mathematically correct in response to your query (that is, if it contained any mathematical information).

Once you've repeated the above for all three models (you'll rate models on different problems), you will be asked to rank your preference for which model you would prefer to use as a mathematical assistant. You can then begin again with another round of interacting with the models and rating preference (order of models will be shuffled).

Age restriction

You must be 18 years old or above to participate in the study.

Privacy and Data Usage

We very much value your privacy! We will not save any identifying information, beyond your self-reported level of mathematical expertise and how much you have played with interactive AI systems (to help us ground the evaluations).

Please note that the interaction traces will be saved. We plan to release these ratings anonymously (only tagged with the self-report survey, per above) open-source for other researchers. The interaction traces, but not the ratings, might also be retained by the companies who provide the models. Please do not participate if you are not comfortable with those data sharing procedures.

Please note that in this study, you will be interacting with live AI systems. We do not know what they will generate. While we intend for all generations to stay in the realm of mathematics and theorem proving, it is possible that harmful language could be generated. You are welcome to leave the study at any time.

Your participation in this research is voluntary. You may discontinue participation at any time during the survey.

Please only proceed to the study if you are comfortable with the above, and acknowledge that you wish to participate in this research.

If you are comfortable with all of the above -- please read the instructions closely! -- then we welcome your participation in our benchmarking.

Please participate at the following link:

LINK

Please do not share this link externally.

Note, there may be a slight delay as the site is loading. When many people are interacting with the system, there may also be a delay per generation. This is because we're using sizable LLMs!

Thank you for your time!

Contact: kmc61@cam.ac.uk and qj213@cam.ac.uk