ArchReasoning Challenge

Testing the Limits of LLM Reasoning in Computer Architecture and Systems

Objective:

The goal of this competition is to evaluate the reasoning capabilities of modern large language models (LLMs) in computer architecture and systems. Participants will submit challenging, reasoning-based questions, and the best questions will be tested against top AI reasoning models. The objective is to identify questions where AI fails, revealing the limits of LLM reasoning in computer architecture and systems.

Submission Guidelines:

We will only evaluate 30 questions per participant. If you submit more than 30, we will randomly select 30 for evaluation, and the rest will not be considered. Choose your best questions carefully to maximize your chance.
- Note: We will use a text embedding mechanism to detect substantially similar questions across all submissions. In cases where similar questions are identified—whether from the same participant or from different participants—only the earliest submission will be considered, and duplicates may be filtered out by the judges.
- Note: Submissions derived from exams, quizzes, or other copyrighted materials are acceptable only if you have obtained all necessary approvals and strictly adhere to your institution's licensing policies. We reserve the right to disqualify winners if it is later discovered that any submitted questions violate these policies.
To ensure quality and accountability, you must submit your affiliation (e.g., academic institutions, research labs, or industry organizations) for your submission to be accepted.
Questions must meet the following criteria:
- Clear and Concise: Formulate a clear, concise, and complex reasoning-based question related to computer architecture and systems.
- Computer Architecture Specific: The question must be deeply rooted in architecture concepts (e.g., pipelining, branch prediction, cache behavior, or your own research).
- Reasoning-Based: The question must require logical reasoning, step-by-step deduction, or multi-concept synthesis, rather than factual recall.
- Non-Trivial: The question should not be directly answerable from standard textbooks or documentation without reasoning.
- Accompanied by a Rationale: The participant must provide a clear correct answer with an explanation of why it is correct in the rationale textbox. Give a detailed, step-by-step explanation of why the answer is correct. Include citations to relevant references (e.g. textbooks, research papers, blogs, etc.)
- AI Generated Questions: AI generated questions are not encouraged and can be removed at the judges’ discretion.

Why Participate?

Test the limits of AI reasoning in computer architecture.
Engage in a thought-provoking challenge of designing tough reasoning problems.
Earn recognition as an expert in designing hard architecture questions.
Top contributors will be acknowledged at MLArchSys 2025 and may have their questions featured in a future research paper or benchmark.
- 🏆 Winners of the competition will receive a monetary award at the workshop.

🏁 Evaluation Process:

Filtering Stage: Judges review submitted questions based on the criteria above. Questions that do not meet the criteria will be filtered out.
- Technical Accuracy: Is the question based on sound computer architecture principles?
- Clarity: Is the question clear, unambiguous, and well-written?
- Relevance: Does the question genuinely require reasoning, or can it be answered by simple lookup? Does it test logical deduction and multi-step synthesis, rather than exploiting known weaknesses or prompting failures in LLMs?
- Rationale Quality: Is the rationale complete, correct, and convincing?
- Duplicate and Redundant Questions: Does the submitted question significantly overlap with another in content, structure, or required reasoning? If multiple questions test nearly identical concepts or reasoning paths, only the best-phrased and most challenging one will be selected.
- Diversity of Thought: Does the submissions test a broad range of reasoning skills and architectural concepts?
- Difficulty: Would an established researcher or industry veteran, with access to arbitrary computing and internet resources, require more than one hour to solve this problem? If not, is it truly difficult enough to challenge LLM reasoning?
- Practicality: Does the question reflect real-world challenges in computer architecture and systems? While difficult questions are encouraged, they should not be artificially complex or unrealistic solely to maximize LLM failure rates. Questions should be grounded in practical, meaningful reasoning tasks relevant to the field.
LLM Evaluation Stage:
- The accepted questions will be tested against few LLMs with reasoning capabilities.
- The names of the LLMs will not be disclosed to participants to mitigate bias.
- The goal is to assess whether the models can reason through the questions correctly.
- You are free to test your questions with any LLMs but once you submit your questions you can not modify them. Feel free to test your questions with any publicly accessible models before submission. However, once submitted, you cannot modify them.
Scoring & Winner Selection:
- Category 1: "Total Defeat": Questions that all LLMs fail to answer correctly. The participant with the most such questions wins.
- Category 2: "Persistent Challenge": The average failure rate across the reasoning models. The participant with the highest failure percentage across their accepted questions (after filtering) wins. The scores are calculated using an averaged score over several samples per question.
- The tie-breaking decision is made by the judges based on the quality and diversity of the questions.

📑 Submission Instructions:

We use the QuArch infrastructure (https://quarch.ai/) to collect your question submissions. Please follow these steps carefully:

Sign in to QuArch
- Go to https://quarch.ai/mlarchsys/.
- Sign in with Gmail or GitHub to make account.
  - Make sure to update your Name and Affiliation when making an account to receive credit.
Write Your Question
- Craft a clear, detailed, and reasoning-based question related to computer architecture and systems.
- Ensure the question requires logical deduction or multi-step reasoning, rather than factual recall.
- Make sure all clarifying details are provided that an architect would need to answer the question accurately.
  - Note: You may attach images to provide additional clarifications.
Provide Answer Options
- Submit exactly four multiple-choice answers (one correct and three distractors).
- Ensure all options are reasonable and technically valid to avoid obvious eliminations.
  - Note: The options may or may not be shown to our reasoning models during evaluation.
Select the Correct Answer
- Clearly mark the correct answer among the four options.
Write a Detailed Rationale
- Explain why the correct answer is right and why the other options are incorrect.
- Use a step-by-step to break down the reasoning process.
- Provide citations to relevant sources (e.g., textbooks, research papers, technical blogs) if applicable.
  - Note: You may attach images to provide additional clarifications.
Submit Your Question
- Review your submission for clarity and completeness.
- Click “Submit” to finalize your question in the competition database.

Examples of Accepted vs. Disqualified Questions:

🚫 Fact-Based (Simple Recall)

In ___, each processor has its own local memory system.

(a) symmetric multiprocessing

(b) asymmetric multiprocessing

(d) clustered multiprocessing

This is a factual QA where you can jump to answer in “one-shot.”

✅ Reasoning-Required (Multi-Step Deduction)

Assume a DRAM system with a burst size of 256 bytes and a peak bandwidth of 240 GB/s. Assume a thread block size of 256 and warp size of 32 and that A is a float array in the global memory. What is the maximal memory data access throughput we can hope to achieve in the following access to A?

int i = 4*blockIdx.x * blockDim.x + threadIdx.x;

float temp = A[i];

(A) 240 GB/s

(B) 120 GB/s

(D) 30 GB/s

This QA requires (complex) multi-step generation with intermediate steps & understanding relationships between burst size, block size, warp size, memory bandwidth, etc.

High-level questions that we think may be a good fit for this challenge:

Implement an arbiter in SystemVerilog with the given signature, such that it meets a specific loose fairness constraint and a particularly tight timing constraint.
Suppose there is a cache with given capacity/associativity/latency properties. Calculate the cache hit rate of this C++ program.
Perform P&R for this SystemVerilog module, with access to some open-source P&R tools, where the tool will fail to fit in the allotted area on the first naive pass.
Calculate the optimal layout and roofline performance for DeepSeek R1 on a cluster of DGX H100s given these specs.
Given this small C++ simulator harness and various options to adjust cache hierarchy design within area constraints, optimize performance for this MxNxK matmul shape.
If I pipeline my level-1 data cache, what is a possible consequence?

> You can draw inspiration from existing public university quizzes in computer architecture, such as Part B (Complex Pipelining).

> Consider exploring similar datasets, for example the AIME [sample problem + answer], including sample problems and answers, as a source of inspiration.

System Prompt:

We will use a similar system prompt for evaluating your submitted questions:

Act as an expert in computer architecture and system design. You have been asked the following question. Provide your answer with detailed rationale.

{Question}

⏳ Timeline:

Submission Deadline: May 2, 2025 (no extensions)
Results Announced: Jun 21, 2025

🏆 Awards and Recognition:

The top winners will be recognized at MLArchSys 2025 (co-located in ISCA 2025) and awarded monetary prizes.
The best questions and analysis may be incorporated into a research paper, with your contributions acknowledged as co-authors.

Page updated

Google Sites

Report abuse