Human beings have long recognized that aggregating intelligence can outperform relying on a single decision-maker. From juries and committees to prediction markets and crowdsourcing, collective judgment often exhibits higher accuracy, stability, and resistance to individual biases. This intuition that many imperfect views can combine into a stronger whole has been deeply ingrained in human decision-making for centuries.
In the field of Machine Learning, this intuition has been formalized through ensembling. Techniques such as bagging, boosting, and random forests demonstrate that combining multiple weak or noisy predictors can significantly improve robustness and generalization. Even when individual models are imperfect, their errors tend to cancel out under aggregation, yielding a predictor that is both more accurate and more statistically reliable than any single constituent model.
This same principle has naturally carried over to modern multi-agent LLM systems. Today, ensembling is widely used in agentic pipelines, often implemented via majority voting across multiple agents. When these agents are heterogeneous (e.g., different prompts, tools, or models), voting aggregates diverse reasoning paths. When they are homogeneous (i.e., multiple samples from the same LLM), this process is commonly referred to as Self-Consistency, where diversity arises from stochastic decoding rather than architectural differences. In both cases, majority voting acts as a simple yet powerful mechanism for stabilizing the answers to closed-ended questions. For example, consider a multiple-choice question where five answers are sampled from a set of LLM agents, and the final decision is made by majority voting. Suppose three of the five agents select choice A, while the remaining two select B, and the correct answer is indeed A. Although any individual agent answers correctly with probability only about 3/5=60%, selecting the majority vote substantially increases the system's overall probability of producing the correct answer.
NOTE: Strictly speaking, the majority vote refers to the decision-making rule requiring more than half (over 50%) of the total votes cast to elect a candidate, and the voting protocol described in this post is actually "Plurality Voting". Nevertheless, following common usage in the literature, I will adopt the term, majority voting, throughout this post to refer to this plurality-based aggregation rule.
This so called the "Magnifying Effect" of majority voting has also been theoretically shown in this work by Choi et al. We showed that aggregation does more than merely average performance: it amplifies the probability of selecting the correct answer even when individual agents are only slightly better than random. In other words, small advantages at the individual level can translate into disproportionately large gains at the collective level. Specifically, the success probability of majority voting is lower-bounded by:
Theorem 1 in Choi et al., (NeurIPS 2025)
where N is the number of agents, K is the number of choices (you can regard this as the number of options in a multiple-choice question), and Δ is the belief difference between the most probable answer and the second most probable answer (roughly, you can regard this as the agent's relative level of certainty in its answer). As N goes to infinity, the right-hand side of the inequality asymptotically approaches 1, even when Δ might be relatively small. In other words, if we take a majority vote among a sufficiently large set of agents, the system's probability of choosing the most probable answer is "magnified".
For example, let's say we have a 5-way multiple-choice question whose answer is "A". An agent's likelihood of choosing "A" is 0.201, and that of choosing "B" is 0.200. Then, the absolute probability of this single agent getting the correct answer is only 20.1%. But now let's say we have 10 copies of this agent. In this scenario, N=10, K=5, and Δ=0.01, yielding RHS=0.621. This means that the 10-agent system's probability of choosing the correct answer is suddenly at least 62.1%! Overall, this helps explain why majority voting remains remarkably strong even in noisy and ambiguous tasks, and why it continues to serve as a foundational primitive in multi-agent LLM systems today.
Majority voting can be understood not only as a counting procedure, but also in terms of a geometric and relational structure. Let's again think about a multi-agent system comprising 5 individual agents, each responding to a shared multiple-choice question. Three of the agents chose "A", and two of them chose "B". Then, we can represent the relationship between these agent answers as a graph, like the following, where two answer nodes are connected with an undirected edge if the answers are identical.
Majority Voting is a Largest-Clique Identification problem.
Given this graphical representation of agent answers, the task of doing a majority vote becomes identical to "finding the largest clique in the agent graph". In other words, we want to find the biggest cluster or subgraph within the graph of agents, which represents a certain answer choice. From this perspective, majority voting implicitly selects the densest region of this graph space. When many agents independently arrive at similar conclusions, their outputs form a tight cluster.
For closed-ended tasks like multiple-choice question answering, the graphical intuition behind majority voting is pretty straightforward. Each agent simply picks one option from a small, fixed set of answers, so counting votes and finding the "majority" is easy and intuitive. Now, consider open-ended tasks, where things get much more interesting! In these types of tasks, each agent's output is best viewed as a point in a high-dimensional text space. Instead of choosing from a finite list of options, an agent can generate responses of arbitrary length and content: anything from a block of code in a code-generation task to a free-form summary of a document. In this setting, the answer space explodes. Rather than voting over a handful of discrete choices, we are now dealing with an enormous, effectively unbounded space of possible outputs. This makes the idea of "majority voting" far less obvious: when every response looks slightly different, what does it even mean for one answer to be the majority?
Once we move beyond multiple-choice settings to free-form open-ended tasks, outputs rarely match exactly, yet many may still express the same underlying solution or intent. For example, two summaries may differ syntactically while conveying the same core information, or two code snippets may implement the same algorithm using different structures. So in such tasks, majority voting must be generalized. In generalized majority voting, the definition of "exact matching" of answer choices is replaced with a notion of similarity. Instead of asking whether two outputs are identical, we ask whether they are close enough according to some semantic, functional, or structural metric. Under this lens, voting becomes a problem of identifying the most representative group of outputs rather than tallying identical tokens.
So how do we do that? The answer is simple: first, we define connectivity between outputs. Instead of asking whether two responses are exactly the same, we ask how similar they are. For instance, we can measure the similarity between any pair of generated texts by computing the Jaccard similarity between their respective word sets. Doing this for all pairs of outputs gives us a similarity graph, where each node represents an agent's output and each edge weight reflects how similar the two outputs are. Equivalently, this graph can be represented by an adjacency matrix encoding pairwise similarities.
The relation of free-form outputs can be represented in terms of a similarity graph.
Once we have this graph, majority voting can be reinterpreted in a much more general way. Instead of counting identical answers or trying to find the largest clique, we look for regions of the graph where many outputs are densely connected. Intuitively, the “majority” answer is no longer a single discrete option, but a cluster of mutually similar responses that dominates the overall similarity structure. From this perspective, identifying the majority answer reduces to a graph problem: we perform clustering or graph cuts on the similarity graph and identify the resulting modal cluster as the "majority group".
Now, the next question is: how can we implement this idea into a working algorithm?
We can imagine multiple ways to do this, but in my recent work, we introduced a three-step process:
Let's say we have five different answers from an LLM (or from multiple LLMs) for an arbitrary task. For example, each answer is:
Answer 1: "The cat sat on the mat."
Answer 2: "A cat was sitting on the mat."
Answer 3: "A feline sat on the mat."
Answer 4: "On the mat, the cat sat."
Answer 5: "Dogs are barking loudly."
Using this set of texts, we can compute the lexical similarity using the Jaccard Index to retrieve the adjacency matrix that defines a similarity graph. The actual values are computed as follows:
Similarity graph adjacency matrix.
Since the fifth answer is talking about dogs barking, it is apparently far from the other four answers in the text space. That trend is evident in the adjacency matrix, where the fifth node (the fifth row and column) has lower affinity to the other four.
Now that we have the weighted similarity graph of agent answers, the next challenge is to identify the representative (or "majority") cluster. There are many ways one could approach this, but we take a particularly elegant and well-studied route: spectral graph clustering, using the Fiedler vector.
The Fiedler vector is the eigenvector corresponding to the second-smallest eigenvalue of the graph Laplacian. While that may sound technical, its intuition is very simple. You can think of the Fiedler vector as providing a one-dimensional projection of the graph that tries to separate nodes while cutting as few strong connections as possible. In other words, it naturally reveals a "soft split" of the graph into two groups that are weakly connected to each other, but strongly connected internally. Concretely, each node (i.e., each agent's output) gets a scalar value from the Fiedler vector. By thresholding or sorting these values, we can partition the graph into clusters. Outputs that are highly similar tend to have similar Fiedler values and end up grouped together, while dissimilar outputs are pushed apart.
Spectral Graph Clustering renders the "majority" (i.e., "modal") node.
At this point, spectral clustering gives us a clean way to split the similarity graph into two coherent groups. But in practice, one split might not eb enough. The "majority" structure we're looking for may be nested inside a larger cluster, or mixed with several smaller, less consistent groups. To handle this, we apply recursive clustering.
The idea is simple: instead of stopping after a single spectral cut, we repeatedly apply the same clustering procedure to the resulting subgraphs. At each step, we take the current cluster, construct its induced similarity graph, compute the Fiedler vector, and split it again. This recursive process gradually peels away disagreement. Early splits tend to separate clearly different modes (e.g., fundamentally different interpretations of a task), while later splits refine the dominant mode into tighter and more coherent subclusters.
Importantly, we are not forcing the graph into a predefined number of clusters. The structure of the graph itself determines how many splits are needed. At each level of recursion, we can evaluate which subcluster best represents the "majority", for example by looking at its internal connectivity via the conductance ratio. The recursion naturally terminates when further splits no longer improve coherence, at which point the remaining cluster can be treated as the modal cluster.
After the spectral graph cluster step, let's assume that the three nodes (answers) survived.
Answer 1: "The cat sat on the mat."
Answer 2: "A cat was sitting on the mat."
Answer 4: "On the mat, the cat sat."
Now what? We know that these answers represent the "majority", but which one should be chosen in the end?
The goal of the final step is to select the answer that best represents the modal cluster. That is, we want to identify the "centroid" among these answers! To do this, we turn to the similarity graph adjacency again. The adjacency matrix for these three answers is:
Intuitivley, the "centroid" among these nodes should have the strongest connection to each other overall. That is, we compute the degree of each answer node, and choose the node with the greatest value. In this case, the degree of each node is 2.18, 1.94, and 2.12, respectively. Thus, we choose the first answer: "The cat sat on the mat."
What do you think? Do you think the answer is a good representation of the "majority"?
The greatest strength of our approach (Mode Extraction; ModeX) is that it can be applied to ANY open-ended tasks. All you need is a set of diverse responses generated by multiple agents for the same prompt; ModeX takes care of the rest.
We evaluated ModeX across three very different open-ended tasks: news article summarization, code generation, and math reasoning. In each case, ModeX successfully identifies the dominant mode of agreement. It selects the most concise and focused summary, retrieves the most representative (or "typical") code solution, and isolates the most probable reasoning trace for solving the math problem. See below for the table.
Note: ModeX-Lite is a more efficient version of ModeX. Check out our paper for details!
If you find our works interesting and useful, consider citing our papers!
@article{choi2026modex,
title={ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation},
author={Choi, Hyeong Kyu and Li, Sharon},
journal={arXiv preprint arXiv:2601.02535},
year={2026}
}
@inproceedings{choi2025debate,
title={Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?},
author={Choi, Hyeong Kyu and Zhu, Jerry and Li, Sharon},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
year={2025}
}