One of the main reason why it is so difficult to estimate the results of Indian elections is that it uses first-past-the-post system in a district-based setting, i.e. a particular seat is won by that party/candidate who has the highest number of votes there. Such systems are often characterized by lack of uniformity and linearity in terms of vote shares and seat shares. The number of seats won by a party depends not only on how many votes it receives overall, but also how these votes are spatially distributed across the districts. Remember the famous US Presidential elections of 2016, when Donald Trump won despite having received less votes overall than Hillary Clinton - her votes were geographically too concentrated around urban centres (like California), while his votes were distributed over wider regions, allowing him to win more seats. But in a multi-party system like India's, geographical concentration of votes is also not a bad thing - smaller or weaker parties can concentrate their votes and win a few seats. It is this spatial heterogeneity that trumps most pollsters in India. Most polls do not do a bad job in estimating the overall vote shares, but they are often unable to estimate the seat share because there is no simple mathematical relation between the two.
For the past 3 years, I have been working on simulation of voter behavior using agent-based models. The main aim of these models is to explore how the results of an election can change depending how the voters of different parties are spatially distributed. The models have some parameters which can be tuned, and I have shown in my research papers that for proper choice of parameters, these models can appropriately represent the vote share-seat share relation for multiple elections in India. For other choices of parameter values, these models show what the results could have been, if the spatial distribution of voters was different. These models essentially simulate an election, producing how many votes different parties can get in each seat. The models are agent-based, i.e. they aim to mimic certain aspects of a voter's behavior, especially the fact that their vote is likely to be influenced by the district where they live (or vice versa). Once an election is simulated, there is another model which can simulate surveys also, based on the principle of uniform random sampling. The aim of this article is to examine how the results of the 2024 Indian General Elections can be reproduced by these models, and also what the survey results are likely to be, if done fairly, i.e. based on uniform random sampling. Below, I present my observations not only at all-India level but also for key states where the survey results greatly missed the mark. These results are based on the gross numbers of vote share that I have now, and hence may not be fully accurate.
The results can be refined after the Election Commission releases full statistical reports of the elections, which will take a few months.
At the All-India level, there are a total of 543 seats, and a total of 645442929 people (64.5 crores) cast their votes according to ECI data. There were two large alliances - N.D.A. which received 42.5% of the votes and I.N.D.I.A. which received 40.6% of the votes. There is a caveat - not all seats had one candidate from NDA and one from INDIA - especially there were several seats with multiple INDIA candidates - 42 in West Bengal, 20 in Kerala, 13 in Punjab and a few more here and there. But let us not take that into consideration.
When I ran my models with this data, my Seatwise Polarization Model (SPM) predicted 282-309 seats for NDA, 218-241 seats for INDIA, 15-27 seats for the remaining unaligned parties. The mean result was NDA-293, INDIA-229, rest 21. This is remarkably close to the actual results - NDA-293, INDIA-234, others 16. This success is a sanity check for my model, you can trust it for further analysis. As already mentioned, these results are for a specific value of the parameter which can be changed - but I have a standard estimate for the parameter which seems to do a good job for most Indian elections.
Now, let us simulate the survey. An uniformly random survey has two parameters - how many people were surveyed (sample size), and from how many of the seats/districts. Most polls that were released in the media reported that they considered a few lakhs of people, i.e. about 0.1% of the total voters. They do not reveal the spatial extent of the survey - so we consider two cases: i) all seats were covered ii) a random subset of 50% of the seats were covered. In the first case, our model says that a survey can predict 248-303 seats for NDA, 204-264 seats for INDIA and 20-52 for the rest. If only 50% seats were covered, then they may predict 238-332 seats for NDA and 188-282 seats for INDIA. This wide range is because the results depend on which districts were chosen and which people were surveyed, which can vary from one survey to another. We report these numbers over a 1000 runs of the simulated surveys. But we find that no survey comes remotely close to projecting 400+ seats for NDA, which many of the exit polls did.
Considering all-India numbers is never a great idea, because they hide spatial heterogeneities. This election had marked spatial heterogeneity across the states. Let us now focus on a few states which threw up curious results for a variety of reasons.
India's most populous state, where 87680220 people voted over 80 seats, produced results which totally contradicted both pre-election opinion polls and post-election exit polls. This state saw 4-way contest between NDA, INDIA, BSP and various smaller parties and independent candidates, all amalgamated together as a single party for simulation purposes. The vote shares were 43.7% for INDIA, 43.5% for NDA, 9.4% for BSP. With these vote shares, my SPM model predicted 34-45 seats for INDIA, 34-45 for NDA, 0-1 for BSP and 0-1 for others. The mean result was INDIA-41, NDA-38, BSP-1, others-0. These were not at odds with the actual results (INDIA-43, NDA-36, others-1), but can we do better? So we considered Partywise Concentration Model (PCM), which allows the possibility that the voters of different parties are spatially distributed in different ways (like the 2016 US Presidential Elections). This is a more powerful model, but on the flipside it has more parameters which are difficult to tune. I am currently exploring a suitable Machine Learning algorithm to find the optimal parameters for it. But coming back to UP, we find that for optimal parameter configuration PCM predicts 42-48 seats for INDIA and 32-38 for NDA. None of the models could predict that "others" (Azad Samaj Party-Kanshi Ram) would win 1 seat with very less vote share overall.
Now coming to the surveys. I do not have full election results yet, so I carried out the survey on the simulated election that most closely matched the results. Assuming about 0.1% of the voters were surveyed over all districts, surveys are likely to project 34-52 seats for INDIA and 28-46 for NDA. There is a 20% chance that a survey may find NDA winning more seats than INDIA. If the survey was limited to 50% of the seats, then the range of possible projections is unchanged, but the probability of getting the winner wrong increases to 32%. But under no circumstance do we find 65-70 seats being projected for NDA!
Another state where the surveys went horribly wrong was West Bengal, with 60435345 voters over 42 seats. This state had a 3-way contest between TMC, NDA and the LF-INC alliance (INC, TMC, LF are all part of INDIA at all-India level). The vote shares turned out to be around 47% for TMC, 39% for NDA and 11% for LF-INC. According to SPM model with the default parameter setting, this translates to a seat distribution of 23-29 for TMC, 13-19 for NDA and 0-3 for LF-INC, which is consistent with the actual distribution of 29-12-1. According to PCM model, we can have 27-32 for TMC, 10-14 fro NDA and 0-1 to LF-INC.
Now coming to surveys, most pre-election opinion polls had suggested a tight-race between the first two parties, and both were projected to win around 20 seats. The exit polls projected a clear win for NDA with 23-31 seats, which clearly failed. According to our survey simulations, if a survey considers a voter sample size of 0.1% over all 42 seats, it may project 25-31 seats for TMC, 9-17 for NDA and 0-2 for LF-INC. Even if the survey is limited to only 50% of the seats, it may project 18-38 seats for TMC (massive range!), 4-22 for NDA, 0-4 for LF-INC. In this case, there is only about 2% chance that it will predict the wrong winner, as all the exit polls did!
Note that in UP, a major reason for the survey failure may be due to voters not revealing their actual votes to surveyors due to the prevailing socio-political situation there. Furthermore, if the surveys polished their raw data with the "priors" from the past few elections, the failure may be understandable to an extent. The same cannot be said about West Bengal. That way, the exit poll failure in West Bengal is astounding.
One more politically significant state is Maharashtra, which witnessed very complex political realignments in the past 2 years. Here, 57231341 voters cast their votes over 48 seats. The main players were NDA (locally called Maha-yuti) and INDIA (locally called Maha-Aghadi), though there were smaller parties and independents, which we clubbed together as "others". The two main sides won 43.55% and 43.71% votes respectively, which suggest a keenly contested election, but the seat distribution (17 for NDA, 30 for INDIA, 1 for others) did not reflect the same, suggesting significantly different spatial heterogeneity for the two sides. This result could not be produced by the SPM model, so we used the PCM model that is more flexible. On optimal configuration of parameters, PCM predicted 27-32 seats for INDIA,16-21 for NDA and 0-1 for others. For the most realistic election simulated by PCM, a survey over 0.1% of the voters covering all 48 seats could project 26-35 seats for INDIA, 13-22 for NDA, 0 for others. If only 50% seats were covered by the survey, the projection could be 20-42 for INDIA and 6-28 for NDA, but this huge range is mostly due to outliers. There is only about 1-2% chance that a survey, if done according to uniform sampling, could get the winning side wrong, as most of the exit polls and many of the opinion polls did.
A particularly challenging case was Punjab, which saw a 5-way contest between INC, AAP, BJP, SAD and others, including independents. 13474616 people voted on 13 seats There was no NDA or INDIA alliance in this state. The vote share was quite evenly distributed (27% for INC, 26% for AAP, 19% for BJP, 14% for SAD and 14% for others), but the seat share (7-3-0-1-2) wasn't so even. Representing these results was a nightmare for the models. SPM projected [3-7, 3-7, 0-4, 0-3, 0-3] seat distribution, i.e. large ranges for all sides. Running PCM was very difficult as there were too many parameters to tune, and the results could be in the range of 5-9 seats for INC, 3-5 for AAP, 0 for BJP, 0-1 for SAD and 1-3 for "others". The fact that BJP won 0 seats despite having the third-largest vote share suggested that their votes were spread thin all over the place, while SAD and "others" could win seats with fewer votes as their votes were concentrated, probably around individual candidates.
Coming to surveys, there was a huge room for error. A survey of 0.1% voters over all 13 seats could project 3-10 seats for INC, 0-8 for AAP, 1 for SAD, 0 for BJP and 1-2 for others. There is a significant chance (13%) of predicting wrong winner (AAP instead of INC). If the survey was limited to 50% seats, its margin of uncertainty would increase even further, with 17% risk of identifying wrong winner. But at least according to the simulated election, there would be no chance of predicting any seat for BJP, while some exit polls predicted upto 4 seats for them!
In my opinion, the strangest results were seen in Telangana, where 21798661 people voted for 17 seats. There were 4 main parties (INC, BJP, BRS, AIMIM), whose vote shares were (41%, 36%, 18%, 3%) respectively. But despite the full 5% gap in vote share between INC and BJP, both of them won 8 seats each, while BRS won none despite having a significant popular support (it was probably spread thin, like BJP's support in Punjab). AIMIM, with its votes concentrated around 1 seat, managed to win it. None of our models were able to reproduce this seat share well, especially the first matter involving BJP and INC. The closest that PCM could get was 9 seats for INC, 8 for BJP and 0 for the rest, but this was an exceptional case - most PCM simulations suggested 10-12 seats for INC. SPM suggested 8-12 for INC, 4-7 for BJP, 0-2 for BRS. It does not make sense to simulate surveys here, since we could not really simulate the election results. Still, if we considered the simulated election which came the closest, (9-8-0-0), a survey over all 17 seats involving 0.1% voters can project 8-11 seats for INC, 6-9 for BJP, 0 for the rest. If limited to only 50% seats, the survey can project 4-16 seats for INC, 1-13 for BJP, 0 for rest. In the first case, there is 14% risk of predicting that BJP would win more seats than INC, while in the second case, this risk is as high as 37%! So at least in this case, the pollsters who called 11-12 seats for BJP and 6-7 for INC can be given the benefit of doubt.
Two more states where most opinion/exit polls projected significantly more seats for NDA than what they actually won are Rajasthan and Karnataka. Both of them had bipolar contests (NDA-vs-INDIA), and both results can be explained by SPM model under its default parameter setting. In Rajasthan NDA won 14 of the 25 seats and INDIA, 11, though most polls predicted 18-25 seats for NDA. On considering uniformly random surveys in Rajasthan over all 25 seats, the ranges are found to be quite large: 9-21 for NDA, 4-16 for INDIA and 0-1 for others. If we consider only 50% seats, this range is even wider. Similarly in Karnataka, (NDA-19, INDIA-9) a survey over all the 28 seats can project 12-25 seats for NDA and 3-16 for INDIA, while most media polls reported 22-25 for NDA. If the survey considers only 50% of the seats, then the ranges are even larger- 10-28 for NDA and 0-18 for INDIA! This suggests that both of these states had very close contests, and the polls can be given benefit of doubt - though wrong, they definitely were within the range of possibilities.
Conclusion:
1) Except Telangana, my agent-based models can simulate most of the election results (seat share), given the vote share
2) We find that surveys, if done correctly according to the principles of uniform random sampling, are very unlikely to produce the results that were reported in media as exit polls and opinion polls for some states like UP, WB and MH, as well as all-India. But, I won't hazard any guess on what went wrong with the polls reported by the media. Possible reasons include faulty randomization in selecting voters, or insufficient seat coverage, not to mention extraneous factors.
Disclaimer: The estimates regarding surveys are based on simulated, and not actual data for the full elections. Once the actual data (seat-level) is available from the ECI reports, this article will be revised.