AI4ALL Summer Research Presentations at UMD

Team: Omar Fida, Noah Comabatir, Cyrus Yelverton, Suhail Ali, Nathaniel Oladele

Faculty/ Graduate Students: Huy Nghiem, Sandra Sandoval, Lingjun Zhao

Project Question: How well does AI answer riddles compared to humans?

Our approach:

To answer our question, we first picked twenty random riddles from RiddleSense. We then created a survey on Amazon Mechanical Turk (MTurk) containing these twenty riddles and sent it out to three different people. The respondents were given five choices per riddle and were required to pick one. In total, we received sixty responses from these three respondents.

We also used three different computer programmes, the ChatGPT API, RoBERTa, and our own random answer generator. Each of these programmes produced varied and interesting results.

What is RiddleSense?

RiddleSense is a project developed by five researchers at USC to study how language models answer riddles. It is a collection of over 5,700 riddles, each having five choices for answers. It is currently the largest riddle dataset.

RiddleSense is the source of all the riddles we used during our tests on both humans and machines.

What is MTurk?

In the previous text, we mentioned a program called MTurk. MTurk is a platform developed by Amazon and is used for crowdsourcing information. It is used in situations where it is more economical to work with humans rather than machines.

In our case, we used MTurk in order to survey three people on the Internet to answer twenty riddles each. It is the only part of the project where we used HTML rather than Python.

What is ChatGPT?

We also mentioned the usage of ChatGPT, a now-famous program. ChatGPT is a highly versatile software, used by many different people for many different purposes. Users can enter questions, prompts, or other sorts of inputs, and ChatGPT will generate a response. It has been trained on a large collection of data and also utilizes conversations with humans in order to gather more information and improve its performance.

In our project, we used ChatGPT as one of three computer tests to see how programming compares to humans when it comes to answering riddles.

What is RoBERTa?

RoBERTa is a language model, a powerful artificial intelligence system that excels at understanding and generating human-like text. It's an improved version of BERT, its predecessor, built on similar principles but trained on even more data for a more comprehensive understanding of language. RoBERTa has been pre-trained on a massive amount of text from the internet, making it capable of handling various natural language processing tasks, like comprehension, summarization, and even answering questions. Its efficiency and accuracy make it a popular choice for many language-related AI applications.

We used RoBERTa alongside ChatGPT and our random answer generator to see how computers respond to and answer riddles compared to humans.

Our Action Plan:

Step 1: Prepare Our Data

Import the RiddleSense Dataset and vital libraries (Numpy, Pandas, OpenAI)
Using mTurk, survey workers to receive human answers to these riddles. With this info, we will be able to evaluate human performance against the AI.

Step 2: Feature Engineer our Data

Manipulate the rows and columns of the RiddleSense Data to make it accessible to our models (ChatGPT, RoBERTa, RNG).

Step 3: Train/Test Our Models

Feed our riddles through our models, storing the answers each model returns within a list of responses to compare to an answer key.

Step 4: Evaluation

Evaluate the accuracies of the models. Then, compare the results of our AI to that of a human and examine why these differences between different models appear.

Group 5: Riddles Presentation

Page updated

Report abuse