Abstract

Ecological inference methods are devised to estimate unknown inner-cells of 2-way contingency tables by inferring conditional distribution probabilities. This outlines one of the more long-standing social science problems, chiefly frequent in political science and sociology. To solve the problem, ecological inference algorithms consider an asymmetric relationship, with a main characteristic (e.g. race or social class) mapped to rows impacting on a dependent variable, usually the vote, mapped to columns. The problem arises because different solutions are reached depending on how variables are assigned to rows and columns. The models are asymmetric. In this paper, we propose 2 new sets of ecological inference algorithms and explore if accuracy could be improved by handling the problem in a symmetric way. We assess the accuracy of the proposed methods using real data from more than 550 concurrent elections where the true district-level cross-classifications of votes (straight- and split-tickets) are known. Our empirical assessment clearly identifies the symmetric solutions as more accurate. They outperform asymmetric methods 90% of the time and reduce error, on average, by 11%. Our results are based on data from simultaneous elections, so further research is required to see whether our conclusions can be maintained in other ecological inference contexts. Interested readers can easily use the proposed methods as they are implemented in the R package lphom.

1. Introduction

In elections, aggregate-level data are abundant and reliable whereas individual-level data are largely unavailable due to the secret ballot, and when they are available they tend to be inexact and scarce to estimate voter transitions (changes in voters’ choices between elections) as they usually originate from polls (Romero et al., 2020). Despite this, many agents, including political parties, the media, and US legal practitioners of voting rights, are still interested in knowing the voting behaviour of different subgroups of people. Ecological inference has been devised to solve this issue by exploiting the aggregate-level data relationships (King, 1997).

Ecological inference refers to the process of inferring individual-level relationships from aggregate (‘ecological’) data when individual-level data are not available. It is routinely employed in many disciplines, from economics and epidemiology to sociology and political science (Pavía, 2022), despite being exposed to the so-called ecological fallacy (Robinson, 1950). For instance, ecological inference algorithms are used to estimate vote transfer matrices between elections, infer split-ticket voting behaviours or reveal social and racial voting patterns (Barreto et al., 2022; O’Loughlin, 2000; Park et al., 2014; Romero & Pavía, 2021).

Using a classical contingency table representation in which individuals are classified by rows according to the groups to which they belong (e.g. social class, previous vote or race) and by columns by, for instance, their votes, the unknown cross-distributions are estimated using as data the observed row and column marginal distributions in a set of non-overlapping (geographical) units (e.g. precincts). The difficulty arises because substantially different inner cell counts can give rise to the same aggregated row and column totals, with this indeterminacy cannot being completely removed. It is intrinsic to the problem (see, e.g. Forcina & Pellegrino, 2019; Greiner, 2007; Manski, 2007). The solution depends on the assumptions made, which in a large extent are unverifiable with the available information (Gelman et al., 2001; Glynn & Wakefield, 2010; Wakefield, 2004), an issue that has led to festering disputes (see, e.g. Freedman et al., 1998, 1999; Greiner, 2007; Tam Cho, 1998; Tam Cho & Gaines, 2004). Practitioners customarily hypothesize similar/related conditional row (underlying) probability/fraction distributions across tables (sometimes with the help of covariates) relying on the common observation that people belonging to the same group tend to follow, probabilistically, similar behaviour patterns within the same political context (Pavía & Romero, 2022).

Although an extensive list of different ecological inference procedures has been suggested over time from fields as diverse as frequentist and Bayesian statistics (e.g. Brown & Payne, 1986; Goodman, 1953, 1959; Greiner & Quinn, 2009; King, 1997; King et al., 2004; Klima et al., 2019; Rosen et al., 2001), mathematical optimization (e.g. Hawkes, 1969; Pavía & Romero, 2022; Tziafetas, 1986), or information theory (e.g. Bernardini-Papalia & Fernández-Vázquez, 2020; Judge et al., 2004), they all consider analogous underlying assumptions and share a similar framework. This framework is based on a hypothesis of similarity/relationship in (electoral) behaviour by group across units and a scheme with an explanatory and a response variable.

Ecological inference may, therefore, be observed as an inverse problem where the goal is estimating (inferring) the conditional row fraction (underlying probability) distributions using the observed count marginal distributions as data (Jiang et al., 2020). Once the estimates of probability distributions are attained, cross-classification estimates of counts are obtained by multiplying the observed row margins and the estimated probabilities. In some models, such as in Greiner and Quinn (2009) and in its extension (Greiner & Quinn, 2010), counts are directly inferred.

The above general scheme means that different estimates are reached depending on which variable is assigned to rows and which to columns. In other words, even using the same method and the same data, a different solution is obtained if rows and columns are flipped. Thomsen’s model (1987), which assumes that voters’ choices are driven by individual latent factors, is the exception. The models proposed by Johnston and colleagues (e.g. Johnston et al., 1983; Johnston & Pattie, 2003), based on entropy maximization, while treating tables symmetrically, cannot be classified as genuine ecological inference approaches since they require prior information about the target cross-distributions to be applied.

In many applications, deciding which classification should be assigned to rows and which to columns is straightforward. For example, in a study of polarized voting in which we want to know how different collectives support different candidates, characteristics of the electorate (such as race or gender) are naturally assigned to rows while the categories of the columns are defined by the candidates. Equally, if we want to know the levels of voters’ loyalty (and switching) by party between two consecutive elections, the natural way is to assign the electoral options of the first and second elections to rows and columns, respectively.

In the above examples, the implicit assumption is that somehow there is a causality relationship or, at least, a natural origin-destination temporal arrangement. But what happens in simultaneous elections? In this case, the answer might not be so straightforward.

When two elections are held simultaneously, it is usually considered that there is a first-order election and a subordinate second-order election, so the relationship is studied in that order. It is implicitly accepted that the majority of voters make a sequential choice: they first decide their vote for the first-order election to subsequently, in a second step, choose their vote for the subordinate election (Pavía & Cantarino, 2017). Sometimes, the order of primacy is clear, such as when a general election and a referendum coincide. Other times, however, such as when electors vote simultaneously in a national and in a regional election or for a party list and for a candidate, this is not so clear even though we can argue over an order of primacy. Indeed, the argument itself, the existence of a hierarchy between elections, is being progressively challenged by the literature (see, e.g. Schakel & Romanova, 2020). Anyway, although we can reasonably argue the presence of a first-order and a second-order election, the question is whether this is the proper way to proceed. In other words, can we improve the accuracy of the estimated counts by considering both elections at the same level? More generally, given that (almost) all the ecological inference models exploit correlations and not causal relationships, we should ask ourselves whether the average accuracy of ecological inference estimates may be improved using methods that deal with rows and columns symmetrically. The aim of this paper is to provide answers to these questions.

To answer these questions, we propose two new sets of methods whose solutions do not depend on how classifications are assigned to rows and columns. On the one hand, we propose a new set of algorithms that achieve their solutions handling rows and columns symmetrically from the outset, by definition. This type of models can be logically specified from a mathematical optimization framework. Both dual constraints as well as congruency constraints can be naturally introduced in a linear programme, within the same optimization problem. Hence, in this research we adopt a linear programming approach. It is important to note that, although these models could also be specified within a quadratic programming framework, we prefer to state our models within the linear framework because, as Tziafetas (1986) shows, linear approaches are more efficient than quadratic ones in this context. On the other hand, we also consider a group of algorithms based on asymmetric solutions. This involves methods that reach their estimates by combining the solutions attained after applying an asymmetric ecological inference model to the two possible ways of assigning classifications to rows and columns.

In a recent paper, Pavía and Romero (2022) propose two new models, tslphom and nslphom, that, according to the authors, ‘place the linear programming approach once again in a prominent position in the ecological inference toolkit’. These new models, based on linear programming, generate estimates for each unit table and solve the tendency of lphom, the basic linear programming model, to estimate extreme probabilities (zeros and ones). They have also been shown to be at least as accurate and significantly simpler to use (Pavía & Romero, 2023) than the multimonial-Dirichlet R × C ecological inference statistical model, previously identified as the best in the literature (Katz, 2014; Klima et al., 2016; Plescia & De Sio, 2018). In this paper, we propose new algorithms that deal with rows and columns symmetrically by building on the Pavía and Romero (2022) models in two directions.

On the one hand, using either lphom, tslphom, or nslphom, we suggest three new methods, each of them generating three reasonable solutions, by solving the two underlying dual problems, swapping origin and destination. We identify these new methods by adding the suffix ‘_dual’ to the corresponding base algorithm and their respective new solutions by adding the suffixes ‘_min’, ‘_dual_a’, and ‘_dual_w’. On the other hand, we directly modify lphom, tslphom, and nslphom, looking for the joint congruent optimal solution of both dual problems. We specify new linear programmes in which the optimal solutions, being congruent, simultaneously verify the constraints imposed by the two dual ecological inference specifications. We identify these new models and their corresponding solutions adding the suffix ‘_joint’ to the respective base algorithm. The main innovation of this paper, therefore, lies in proposing, as far as we know for the first time in the literature, ecological inference models that explicitly deal with rows and columns in a symmetric fashion. Conceptually, this is a novelty that could also be explored from other frameworks.

The performance of all the proposed algorithms is assessed using the data available in the R package ei.Datasets (Pavía, 2022), which contains the global true party-candidate cross-classification tables corresponding to more than 500 mixed-member elections. After comparing actual tables and estimates from all the new algorithms and also considering, as baseline, the solutions obtained using the base asymmetric models, we can assert that, at least for these datasets, more accurate solutions are obtained exploiting the information both ways: parties to candidates and candidates to parties.

The rest of the paper is organized as follows. Section 2 states the problem and introduces the notation. Section 3 details the methods based on lphom, tslphom, and nslphom proposed to obtain the dual solutions. Section 4 revises the basic linear programme model (lphom), adapting it to the symmetric case. Section 5 goes further and reworks tslphom and nslphom after modifying the lphom_local algorithm of Pavía and Romero (2022) to make it symmetric. Section 6 presents the results and discusses the findings of the empirical comparisons. Section 7 deepens and explores on the factors impacting on the accuracy of the estimates. Section 8 concludes, discusses, and suggests directions for further research.

2. Mathematical representation of the problem

For convenience and without loss of generality, from now on we adopt a framework of simultaneous elections where voters, grouped in a set of units that jointly define a partition of the electoral space, cast two votes; one for a party list and another for a candidate. We denote by I, J, and K the number of units, parties, and candidates, respectively.

The observed data are the votes xij recorded for each of the j=1,,J parties and the votes yik gained for each of the k=1,,K candidates in each of the i=1,,I voting units. The basic unknowns are the number of voters vjk who simultaneously voted for party j and candidate k in the entire population. Of interest are also the equivalent numbers in each polling unit vjki.

As intermediate J×K and K×J unknowns, we denote pjk as the proportion of voters in the entire electoral space who voted for candidate k among those who voted for party j and, equivalently, qkj as the proportion of voters in the entire electoral space who voted for party j among those who voted for party k. Equally, we define the corresponding proportions at the unit level as pjki and qkji. Note that if we consider a probabilistic approach, the intermediate unknowns could be expressed by: pk|j=P(Y=k|X=j) and qj|k=P(X=j|Y=k), with X and Y representing the party and candidate vote variables, respectively. Similarly, one can define probabilities at the unit level. Ecological inference algorithms are designed to estimate these intermediate unknowns. The pjk (and pjki) proportions are the usual outputs when the xij and yik votes are assigned, respectively, to rows and columns while the qkj (and qkji) proportions are the outputs in the dual case, transposing rows and columns; see Figure 1.

The two dual ways of representing the basic problem in a typical i unit: parties to candidates (left panel) and candidates to parties (right panel). Inner quantities are the unobserved proportions, the (intermediate) targets of ecological inference. In general, different solutions are reached solving the problem from parties-to-candidates than from candidates-to-parties. Symmetric algorithms generate congruent Pi,J×K and Qi,K×J matrices: xijp^jki=yikq^kji.
Figure 1.

The two dual ways of representing the basic problem in a typical i unit: parties to candidates (left panel) and candidates to parties (right panel). Inner quantities are the unobserved proportions, the (intermediate) targets of ecological inference. In general, different solutions are reached solving the problem from parties-to-candidates than from candidates-to-parties. Symmetric algorithms generate congruent Pi,J×K and Qi,K×J matrices: xijp^jki=yikq^kji.

All the above quantities are closely related. Among other relationships, the following equalities hold: xij=kvjki, yik=jvjki, vjk=ivjki, pjk=vjk/ixij, qkj=vjk/kyik, pjki=vjki/xij, and qkji=vjki/yik. Given these relationships, once we have estimates for p^jk, q^kj,p^jki, and q^kji estimates can be directly obtained for v^jk and v^jki employing the observed quantities xij and yik.

In general, different estimates for vjk and vjki are obtained depending on the assignment of parties and candidates to rows and columns. We define an algorithm as symmetric if it produces the same estimates when rows and columns are interchanged: xijp^jki=yikq^kji. These methods are characterized by the fact that their row-conditional probability estimates verify the Bayes theorem with observed proportions, i.e. p^jkipji=q^kjiqki, where pji=xij/rxir and qki=yik/cyic. We refer to this property as the probabilistic symmetric condition.

In matrix form, we denote by X=[xij] the I×J observed matrix of party votes, by Y=[yik] the I×K observed matrix of candidate votes, by P=[pjk] the J×K unknown matrix of parties-to-candidates global proportions, by Pi=[pjki] the J×K unknown matrix of parties-to-candidates matrix proportions of unit i, by Q=[qkj] the K×J unknown matrix of candidates-to-parties matrix of global proportions, and by Qi=[qkji] the K×J unknown matrix of candidates-to-parties matrix proportions of unit i. We will use hat-symbols to represent estimates of the corresponding unknown proportions and matrices.

Likewise, we denote by V=[vjk] and Vi=[vjki], respectively, the global and unit i unknown J×K matrices (contingency tables) of votes, by V^=[p^jkixij] and V^i=[p^jkxij] the corresponding estimates obtained using parties-to-candidates proportion estimates, and by V~=[q^kjkyik]T and V~i=[q^kjyik]T the ones attained using candidates-to-parties proportion estimates, where the superscript T identifies transpose.

Additionally, we use In for naming the identity matrix of order n, 0n×m and 1n×m for representing the n×m matrices of zeros and ones, respectively, and diag(a), rowh(A), and vec(A) for noting the diagonalization, row-extractor, and matrix-vectorization operators. They operate as follows. Given a vector a of size n and a matrix A of order n×m, diag(a) builds a matrix of order n with a in its main diagonal, rowh(A) extracts the hth-row of A as a column vector of order m and vec(A) is the nm×1 column vector obtained by stacking the rows of the matrix A on top of one another in row-wise order. We use and to denote, respectively, the Kronecker product and the Hadamard (or matrix element-wise multiplication) product operators.

3. Solutions based on asymmetric methods

Given that the symmetric models we propose build on the lphom, tslphom, and nslphom algorithms, we limit ourselves to using the same framework for defining our asymmetric-based dual methods. In our view, this makes comparisons fairer. This auto-constraint could of course be relaxed and asymmetric-based dual solutions be defined using as building blocks other R × C ecological inference procedures, such as the Rosen et al. (2001) Bayesian-based multinomial-Dirichlet model (Lau et al., 2020), the iterative version of the 2 × 2 model proposed by King (Collingwood et al., 2020; King, 1997), or the generalization of the Goodman regression method (Collingwood et al., 2016, 2020), to name just some possibilities.

Before defining our asymmetric-based dual methods in Section 3.2, we first detail the lphom, tslphom, and nslphom models in Section 3.1. To do this, we focus on the nslphom model given that, although lphom, tslphom, and nslphom were suggested as different models, as Pavía and Romero (2023) shows, the lphom and the tslphom solutions can be attained as by-products of the nslphom algorithm, these being intermediate outputs of its iterative process: lphom after solving its first linear system (iteration 0) and tslphom after solving its firsts 2I+1 linear systems (iteration 1).

In this section, we introduce the equations when xij and yik are assigned, respectively, to rows and columns (see left panel in Figure 1). The equations of the dual problem (see right panel of Figure 1) are detailed in Sections 4 and 5, during the process of specifying the genuine symmetric models. There, we will adopt a more compact matrix representation. Despite the density of some of the mathematical formulae, we will include all the equations of the models because, in our view, this is the more precise way of defining them. Nevertheless, for those readers that prefer to skip the more mathematical details, we also offer explanations and intuitions for all of the equations as well as insights about the statistical and rational basis of the proposed algorithms.

3.1 Asymmetric linear programming models

The nslphom procedure is an iterative algorithm that, for simultaneous elections and typical ecological inference problems, uses equations (1)–(15) to attain its solution. In its iteration zero, nslphom solves the basic lphom linear system (Romero et al., 2020) defined by equations (1)–(5)—whose unknowns are the pjk unobserved global proportions as well as the (non-negative) auxiliary quantities eik+ and eik. The lphom model estimates the pjk values by solving a linear programme in which the solutions, besides fulfilling the obvious constraints of sum 1 for each row of the district-level proportion matrix and non-negativity (equations (2) and (1)), must be perfectly congruent to the observed outcomes at the global level (equation (3)) and minimize the sum of L1-discrepancies at the unit level (equations (4) and (5)). Solving of the system (1)–(5) generates an initial solution, P^0=[0p^jk], of the matrix of global proportions, P, which is used to start up the iterative process that characterizes nslphom. The estimate P^0 is the (only) matrix that in addition to verifying all the aggregate row and column constraints minimizes the sum across units of the expected count discrepancies, in L1-norm, between the observed and expected unit column totals when P^0 is applied to the observed unit row totals.

(1)
(2)
(3)
(4)
(5)

In the next iterations, for l=1,,ns (where ns is a fixed exogenous parameter), the algorithm generates recursive estimates of the unit proportions, pjki, and of the global proportions, pjk, through a two-step process. In the first step, the current available estimates of the global proportions, l1p^jk, are employed to obtain unit proportion estimates, lp^jki, by solving the 2I sequential linear systems defined, respectively, by equations (6)–(10) and (11)–(13)—where εjki+, εjki, δjki+, and δjki are auxiliary (non-negative) unknowns. In particular, the observed votes in each unit together with the estimated global proportions l1p^jk are used as inputs to solve in each unit another linear programme in which the unit level proportions, pjki, are estimated imposing a perfect match among the observed outcomes at the unit level (equation (8)) by minimizing the sum of the L1-distances between the values that would be obtained by applying to xij either pjki or l1p^jk (equations (9) and (10)). As this programme is indeterminate, the final solutions for pjki are reached through a new linear programme (equations (11)–(13)) in which the new linear programme chooses among all the solutions of the (6)–(10) linear programme (equation (11)) the set of pjki that minimizes the sum of the L1-distances between all the (pjki, l1p^jk) pairs (equations (12) and (13)).

(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)

In the second step, the unit proportion estimates are in turn used to update the global proportion estimates, lp^jk, via equation (14), where the global proportions are expressed as a weighted average of unit proportions.

(14)

In each iteration, the statistic defined by equation (15), which measures the distance to homogeneity of the lth-solution, is also calculated. This measure is employed to select the nslphom solution (Pavía & Romero, 2022), which corresponds to the estimates associated with the iteration l* that minimizes (15).

(15)

Once the iterative process is finished, nslphom provides the lphom, tslphom, and nlsphom solutions. The matrix P^0 corresponds to the lphom solution. The set of matrices P^1 and P^1i=[1p^jki], for i=1,,I, achieved at the end of iteration 1 defines the tslphom solution. And the nslphom solution is composed of the set of matrices P^l* and P^l*i=[l*p^jki], for i=1,,I, selected taking into account (15). Note that the lphom solution only contains estimates of the global proportions.

The above algorithm, programmed in the function nslphom of the R package lphom (Pavía & Romero, 2022), has one parameter to be chosen: the number of iterations, ns. The default option of the nslphom function considers only 10 iterations since the minimum of equation (15) is typically reached in very few iterations. At the end of the iterative process, the matrices P^l*i (i=1,,I) comprise the set of simultaneously updated unit matrices closest to P^0 that, verifying all the aggregate unit row and column constraints, prompt the compound matrix P^l* that minimizes the sum across units of the expected count discrepancies, in L1-norm, between the observed and expected unit column totals when P^l is applied to the observed unit row totals.

3.2 Symmetric models based on dual solutions

In the same way that equations (1)–(15) are defined, a similar linear system can be specified to obtain estimates of the proportion matrices Q=[qkj] and Qi=[qkji], for i=1,,I. The problem, however, lies in the fact that these two dual representations of the same underlying ecological inference problem generate non-congruent solutions. The set of estimates P^ and P^i obtained using the xij votes as predictor variables and the yik votes as response variables is not congruent with the set of estimates Q^ and Q^i that would be attained switching X and Y. In general, the P-based and Q-based estimates for the number of votes of any generic (j,k) entry will be different. That is, that (i=1Ixij)p^jk(i=1Iyik)q^kj and that xijp^jkiyikq^kji. Therefore, once the problem is solved both ways, XY and YX, we have to look for some (ad hoc) way of reconciliation if we want to exploit the information contained in both solutions.

We envisage two basic reconciliation approaches maintaining the accounting constraints that delimit the problem. On the one hand, just one of the two solutions could be selected. On the other hand, both solutions could be combined in a manner that preserves the constraints among votes and proportions. Below, we explore solutions for both types.

As an alternative to the classical approach of reasoning a logical (causal, temporal, or primacy) order between classifications, equation (15) could be employed to choose between solutions. Given the central role played by the heterogeneity index determining linear programming solutions and its previously reported close relationship with the accuracy of the (global) estimates (Pavía & Romero, 2022, 2023; Romero et al., 2020), we propose selecting between the sets {P^,P^i} and {Q^,Qi} the one with the smallest HETe. In the case of lphom solutions, the expression ik|eik++eik|/ijxij, suggested by Romero et al. (2020), could be used to approximate HETe. We identify these solutions with the suffix ‘_min’.

The above ‘combination’ of asymmetric solutions scarcely exploits the information contained in one of the two dual solutions, so we propose building combined solutions as averages, since averages compared to individual estimates tend to reduce expected errors. We consider either unweighted or weighted averages, of the two dual solutions, with the weighted solutions attained employing the inverses of the HETe statistics as weights. We denote with the suffix ‘_dual_a’ the solutions associated with the sets of votes estimates {V¯a,V¯ia} given by equations (16) and (17) and with the suffix ‘_dual_w’ the weighted solutions associated with the sets of votes estimates {V¯w,V¯iw} obtained using equations (18) and (19), where HETp and HETq represent the values of the HETe statistic of the respective solutions XY and YX. Note that for lphom_dual solutions, equations (17) and (19) do not apply.

(16)
(17)
(18)
(19)

4. A symmetric linear model: lphom for simultaneous elections

The methods proposed in the previous section build ad hoc symmetric (congruent) solutions (i.e. solutions that do not depend on which classification is mapped to rows and which to columns) by selecting or averaging solutions from asymmetric methods. In this section, we modify the lphom model to directly produce global (optimal) symmetric solutions. In the next section, we extend the approach to also embrace the tslphom and nslphom models.

To make the notation more compact, we adopt a matrix representation. In particular, in matrix form, the linear programme, XY, defined by equations (1)–(5) can be expressed by

and, equally, the dual problem, YX, can be expressed as

where fpT=[01×JK,11×2IK], fqT=[01×KJ11×2IJ], Ap is the (J+K+IK)×(JK+2IK) matrix given by equation (20), Aq is the (J+K+IJ)×(JK+2IJ) matrix stated in equation (21), and the rest of components are defined in (22); being Ep+=[peik+] and Ep=[peik] and Eq+=[qeij+] and Eq=[peij], respectively, I×K and I×J matrices of auxiliary non-negative unknowns.

(20)
(21)
(22)

Stacking both systems in the same programme and solving the resulting linear programme is not enough to guarantee congruency among the P and Q proportion estimates. To do this, restrictions detailed in equation (23) must be included as additional constraints. Equation (23) states that the global matrix of votes should be the same independent of what row-standardized proportion matrix is used to compute it, either P or Q.

(23)

Obviously, these constraints can also be expressed in matrix form. In particular, [BpBq][vec(P)T,vec(Q)T]T=0JK×1, where Bp=diag(11×IX)IK and Bq=k=1Krowk(Ik)T(IJrowk(Ik)(11×KY)k), with (11×KY)k being the kth-component of the vector 11×KY.

Combining all the equations, a new linear programme system emerges. The programme given by (24) describes the lphom_joint model, which corresponds to the symmetric version of the lphom model.

(24)

In this model, the unknowns pjk and qkj are jointly estimated using a single linear programme that includes the constraints of a lphom model in which the unknowns are the pjk and those of another lphom model in which the unknowns are the qkj as well as JK additional restrictions to guarantee congruence (i.e. that the probabilistic symmetric condition is fulfilled) between the pjk and the qkj estimates for each (j,k) pair. These last additional constraints force the vjk estimates from both sets of unknowns to coincide. That is, referring to solutions of both models in terms of matrices, V, of votes, not in terms of matrices of proportions, it can be seen that lphom_joint verifies lphom_joint(X,Y)=lphom_joint(Y,X)T, in contrast to lphom, where lphom(X,Y)lphom(Y,X)T.

The estimate V^0 of (24) is the (only) matrix that in addition to verifying all the aggregate/global row and column constraints minimizes, in L1-norm, the sum across units of the accumulation of (i) the expected count discrepancies between the observed and expected unit column totals when diag(V^01K×1)1V^0 is applied to the observed unit row totals and (ii) the expected count discrepancies between the observed and expected unit row totals when diag(11×JV^0)1V^0T is applied to the observed unit column totals.

5. Estimating local transfer matrices: nslphom for simultaneous elections

Although the lphom_joint model solves the inconsistency of the lphom procedure by generating estimates of matrices of votes that do not depend on how classifications are mapped to row and columns, it still reveals a crucial limitation of lphom. lphom_joint only estimates global (aggregate) matrices, yet having estimates of local (unit) matrices (Pi, Qi, or Vi) is also useful. Having local estimates is not only of interest in itself in many applications, but it also helps to obtain better aggregate estimates (King, 1997). The process of obtaining local estimates calls for a more intensive exploitation of all the unit observations, an approach that, as a rule and regardless of the methodological framework used, leads to achieving more accurate global solutions. Indeed, the tslphom and nslphom algorithms have also been revealed as significantly more accurate than the lphom procedure for a very wide range of scenarios (Pavía & Romero, 2023).

Continuing on from the previous sections, this section amends the tslphom and nslphom algorithms by proposing new versions of the procedures that, through their specification, force consistency between the estimates of all the (Pi,Qi) pairs. We anticipate that the new models (which we call tslphom_joint and nslphom_joint), in addition to generating local estimates, will also produce more accurate global solutions.

Both tslphom and nslphom are based on the lphom_local procedure (Pavía & Romero, 2022), a procedure devised to estimate unit matrices that is at the core of both algorithms. Indeed, their equations, starting from an arbitrary initial global transfer matrix, are for a fixed i similar to equations (6)–(13). Therefore, we first adapt, in Section 5.1, the lphom_local algorithm to build the lphom_local_joint method. Subsequently, in Section 5.2, we build the _joint versions of tslphom and nslphom on this.

5.1 The lphom_local_joint algorithm

The lphom_local algorithm for simultaneous elections can be observed as a two-step linear programme procedure that takes as inputs a row-standardized proportion matrix T (or U) of order J×K (or K×J) and two non-negative vectors r=[r1,r2,,rJ] and s=[s1,s2,,sK], verifying jrj=ksk, and delivers as output a new row-standardized proportion matrix, T^ (or U^), as close as possible to T (or U), verifying rT^=s (or sU^=r). The second linear programme is required because the solution of the first programme is often indeterminate.

For a fixed unit i, the first two dual linear programmes, which correspond to equations (6)–(10), can be expressed with our notation as

where Api and Aqi are the (J+K+JK)×3JK matrices given by (25), and the rest of the components are defined in (26) and (27); being PG and QG, respectively, a parties-to-candidates and a candidates-to-parties proportion transfer matrix, xi=rowi(X) and yi=rowi(Y) the observed column-vectors of votes to parties and candidates, and Epi+=[pεjki+] and Epi=[pεjki] and Eq+=[qεkji+] and Eq=[qεkji], respectively, J×K and K×J matrices of auxiliary non-negative unknowns.

(25)
(26)
(27)

As with lphom, independently solving the two dual problems does not assure the congruency between both dual solutions, even starting from a congruent (PG,QG) pair of proportion matrices. To guarantee congruency, we need to also include the constraints given by (28), which states that the value for the votes vjki must be the same independent of whether we compute them following a parties-to-candidates route or the reverse, candidates-to-parties. Equation (28) can be expressed in matrix form as [BpiBqi][vec(PG)T,vec(QG)T]T=0JK×1, where Bpi=diag(xi)IK and Bqi=k=1Krowk(Ik)T(IJrowk(Ik)yik).

(28)

Similarly, the second two dual linear programmes, which correspond to equations (11)–(13), can be stated as

where hT=[01×3JK,11×2JK], Cpi and Cqi are the (J+K+1+2JK)×5JK matrices defined in equation (29), and the rest of the components are defined in (30); being Δpi+=[pδjki+] and Δpi=[pδjki] and Δq+=[qδkji+] and Δq=[qδkji], respectively, J×K and K×J matrices of auxiliary non-negative unknowns.

(29)
(30)

Putting together all the pieces, we propose as lphom_local_joint algorithm the two-step linear programme procedure that yields as outcome the result of sequentially solving the programmes defined by the systems (31) and (32).

(31)

Once the minimum value for ωi is achieved solving the system (31), which characterizes the set of potential congruent solutions also verifying the two first dual local linear systems, the lphom_local_joint solution is attained by solving the extended system given by equation (32).

(32)

where wi=[zpiT,zpiT,vec(Δpi+)T,vec(Δpi)T,vec(Δqi+)T,vec(Δqi)T]T.

This model generates solutions of the estimated unit level proportions that, verifying all the ecological accounting equalities, including constraints (28), equivalent to the probabilistic symmetric condition, are closest (in L1-norm) to the PG and QG matrices.

As with the lphom_joint model, the lphom_local_joint algorithm also verifies the symmetry property. That is, given a matrix of counts V of order J×K and a fixed unit i, the equality lphom_local_joint(PG,xi,yi)=lphom_local_joint(QG,yi,xi)T holds, whereas the relationship lphom_local(PG,xi,yi)lphom_local(QG,yi,xi)T is not verified (in general); where PG=diag(V1K×1)1V and QG=diag(11×JV)1VT and the outputs of the algorithms refer to local matrices of votes, not of proportions.

5.2 The nslphom_joint model

Once lphom and lphom_local are adapted to make them symmetric, nslphom can easily be modified in the same way; that is, we can build the nslphom_joint model. In particular, for a fixed ns, the natural definition of the nslphom_joint model is given by the following iterative procedure:

  • Iteration 0. Apply lphom_joint to X and Y to obtain initial congruent estimates P0 and Q0 of the global proportion matrices, from which the aggregate matrix of votes can be derived through: V0=P0[(11×IX)T11×K]=[Q0[(11×IY)T11×J]]T.

  • Iteration l, for l=1,,ns. Compute Pl1=diag(Vl11K×1)1Vl1 and Ql1=diag(11×JVl1)1Vl1T and apply lphom_joint_local using as inputs Pl1, Ql1, xi=rowi(X), and yi=rowi(Y), for i=1,,I. This produces congruent estimates of proportion matrices Pli and Qli for each unit i; from which an estimate of the global matrix of votes is built by aggregating the estimates of the corresponding unit matrices: Vl=i=1IVli; where Vli=Pli(xi11×K)=[Qli(yi11xJ)]T.

During each iteration, the statistic given by equation (33), where lvjki, lpjk, and lqkj are the generic (j,k) entries of the matrices Vli, Pl, and Ql, is also calculated.

(33)

In a similar vein to nslphom, the lphom_joint, tslphom_joint and nlsphom_joint solutions are also attained once the above iterative process is finished. The set of matrices {P0,Q0,V0} corresponds to the lphom_joint solution. The sets of matrices {P1,Q1,V1} and {P1i,Q1i,V1i}i=1I achieved at the end of iteration 1 encapsulate the tslphom_joint solution. Finally, the nslphom_joint solution is made up of the sets of matrices {Pl*,Ql*,Vl*} and {Pl*i,Ql*i,Vl*i}i=1I, with l* corresponding to the iteration for which (33) reaches its minimum in the ns iterations.

At the end of the iterative process, the estimates Pl*i,Ql*i,Vl*i (i=1,,I) comprise the set of simultaneously updated unit matrices closest to V0 that, verifying all the aggregate unit row and column constraints, generate the compound matrix Vl that minimizes, in L1-norm, the sum across units of the accumulation of (i) the expected count discrepancies between the observed and expected unit column totals when Pli is applied to the observed unit row totals and (ii) the expected count discrepancies between the observed and expected unit row totals when Ql*i is applied to the observed unit column totals.

All the new algorithms introduced in Sections 35 are available in functions with suffixes ‘_dual’ and ‘_joint’ in the R package lphom.

6. Assessment of procedures

In Sections 35, two new sets of symmetric algorithms based on linear programming have been suggested to solve the ecological inference problem. These procedures represent an alternative to the asymmetric methods, where rows and columns of the ecological matrices are not interchangeable. In this section, we assess, focused on accuracy, the performance of the proposed procedures. In addition to comparing the different solutions they provide, we also assess them in comparison to the equivalent asymmetric solutions, which act as baseline solutions.

Although it is possible to also include in the assessments other asymmetric models, we have restricted the comparisons to the asymmetric algorithms from which the symmetric methods stem. Under our view, this makes comparisons fairer and does not entail a limitation, as lphom-family algorithms are among the most accurate methods to estimate vote transition matrices. First, Klima et al (2016) state the ei.MD.bayes algorithm (Lau et al., 2020; Rosen et al., 2001) as the most accurate after comparing it to (i) a modification of classical ecological regression (Goodman, 1953), (ii) Thomsen's approach (Thomsen, 1987), and (iii) two recursive 2 × 2-approaches—the ones proposed by Andreadis and Chadjipadelis (2009) and by Kellermann (2011). Second, Plescia and De Sio (2018) also identify ei.MD.bayes as the best option after comparing it to (i) the RxCEcolInf method (Greiner et al., 2021; Greiner & Quinn, 2009) and (ii) classical ecological regression. Third, Barreto et al. (2022) conclude that King’s iterative EI:RxC (Collingwood et al., 2020; King, 1997) and ei.MD.bayes can be used interchangeably when assessing precinct-level voting patterns in Voting Rights Act cases, with Ferree (2004) and Katz (2014) considering iterative EI:RxC inferior. Finally, Pavía and Romero (2023) extensively compare the lphom-family algorithms with ei.MD.bayes-solutions and find the former to be as least as accurate than ei.MD.bayes, being moreover more accurate when the information is scarce or convergence cannot be guarantee for ei.MD.bayes. In summary, if the new symmetric methods improve the asymmetric lphom-based solutions, by the transitivity property they should also improve the rest of the asymmetric methods.

6.1 Data

Due to the very nature of the problem, datasets with known true vote transfer matrices are scarce, so practical research typically relies on artificial, simulated data. In this paper, however, we gauge the different approaches using real data. The results of this section are based on comparing estimated and true matrices of votes belonging to more than 500 elections available in the R package ei.Datasets (Pavía, 2022). The ei.Datasets package contains the matrices of party votes, X, candidate votes, Y, and party and candidate cross-distribution tables of votes at district/election-level, V, corresponding to 565 mixed-member elections held between 2002 and 2020 in New Zealand (NZ) and in 2007 in Scotland (SCO). True cross-classification of votes at polling station (unit) level is not available. At unit level only the marginal distributions of votes (matrices X and Y) are known. New Zealand employs a mixed-member proportional system for electing parliament members, while Scotland uses an additional member voting system. Both systems involve voters casting two votes—one for a regional or national party list and another for a local candidate (who need not come from the chosen party). The candidate with the most votes in each district is automatically elected, and the remaining seats are allocated based on proportional party votes. In New Zealand, national compensatory seats ensure that each party’s share of seats aligns closely with its share of votes across the country.

Although, as far as we know, this is the first time in the literature that all the datasets included in ei.Datasets are employed for ecological inference assessment, some of these elections have already been utilized for evaluating ecological inference procedures (Pavía & Romero, 2022, 2023; Plescia & De Sio, 2018). In using these data, we follow in the footsteps of the above authors and, as is usual practice when handling real data (e.g. Barreto et al. 2022; Klein, 2019; Klima et al., 2016), we have only considered sizeable populations and grouped very small election options in ‘Others’. We have merged in ‘Other parties’ and ‘Other candidates’ those parties and candidates, respectively, who individually do not attain at least a 3% of the district vote. Despite this operation notably reducing the sizes of the ecological tables (from 147.6 to 28.2 in average number of cells), we still retain a large diversity of size tables, with tables of 23 different sizes ranging from a minimum of 15 cells (5 × 3) to a maximum of 56 cells (8 × 7); 8 and 3 being the maximum and minimum number for both rows and columns. Before merging, ei.Datasets tables range from a minimum of 51 cells (17 × 3) to a maximum of 300 cells (20 × 15), 27 being the maximum number of rows for a single table and 15 the maximum number of columns. In terms of sizes of the districts, measured in number of units (polling stations), we have districts ranging from a minimum of 22 to a maximum of 833, with an average of 90.5 units per district. More details about the datasets can be found in Pavía (2022).

6.2 Error measure

Regarding the metric utilized to assess the estimated matrices of votes, Vˇ=[vˇjk], we rely on the error index, EI. This statistic, formulated in equation (34), accounts for the percentage of votes incorrectly allocated in the estimated matrix. This is the minimum number of votes that should be relocated among entries of the matrix to achieve a perfect fit. Although this statistic had been defined by other authors omitting the 0.5 coefficient (e.g. Klima et al., 2016), we prefer to follow the specification proposed by Romero et al. (2020) because it avoids counting every wrongly assigned vote twice.

(34)

6.3 Results

Tables 1 and 2 offer a summary of the accuracy of the different algorithms. The summaries are presented after grouping the elections following two different categorizations. In Table 1, the elections have been grouped by country and by year in which they were held. In our view, this is the most natural way of grouping these elections since all the datasets belonging to the same year and country share the same political environment. They all correspond to district elections held in the context of the same national general election. Nevertheless, according to Pavía (2022), another natural way of grouping the NZ elections is by type of district, either Māori or regular.

Table 1.

Averages of EI errors by group of elections

Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
# of electionsN = 69N = 69N = 73N = 70N = 70N = 71N = 71N = 72
Avg. # of unitsI¯ = 83.2I¯ = 81.8I¯ = 70.2I¯ = 84.1I¯ = 85.7I¯ = 81.2I¯ = 101.9I¯ = 134.9
Avg. # of cellsJK¯ = 39.5JK¯ = 23.8JK¯ = 35.2JK¯ = 23.4JK¯ = 26.2JK¯ = 27.9JK¯ = 24.8JK¯ = 24.5
lphom-based solutions
lphom(X, Y)16.8812.1412.9212.2212.9912.9512.2014.02
lphom(Y, X)16.1412.7410.7811.8713.4013.7012.0812.18
lphom_min16.3011.8611.7311.5812.6512.8011.6112.59
lphom_dual_a14.8711.4710.5011.0312.1412.0311.1912.16
lphom_dual_w14.8911.3710.4110.9712.0512.0111.1012.11
lphom_joint15.3611.6510.5211.3112.5112.3511.4812.60
tslphom-based solutions
tslphom(X, Y)14.8010.9111.0010.8911.5011.6610.9112.59
tslphom(Y, X)14.5211.509.4610.7712.2212.5010.9511.07
tslphom_min14.0610.659.6410.3011.2011.6110.3611.18
tslphom_dual_a13.1810.278.969.8710.8310.8310.0310.92
tslphom_dual_w13.1510.178.879.7910.7410.789.9510.85
tslphom_joint14.0710.739.4210.4011.4611.4510.4911.54
nslphom-based solutions
nslphom(X, Y)12.799.478.859.119.469.698.919.82
nslphom(Y, X)12.719.898.119.1710.8611.159.299.39
nslphom_min12.559.228.428.639.419.678.489.07
nslphom_dual_a11.108.637.118.089.019.088.048.52
nslphom_dual_w11.148.587.118.048.919.017.968.48
nslphom_joint11.548.867.538.449.309.398.238.99
Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
# of electionsN = 69N = 69N = 73N = 70N = 70N = 71N = 71N = 72
Avg. # of unitsI¯ = 83.2I¯ = 81.8I¯ = 70.2I¯ = 84.1I¯ = 85.7I¯ = 81.2I¯ = 101.9I¯ = 134.9
Avg. # of cellsJK¯ = 39.5JK¯ = 23.8JK¯ = 35.2JK¯ = 23.4JK¯ = 26.2JK¯ = 27.9JK¯ = 24.8JK¯ = 24.5
lphom-based solutions
lphom(X, Y)16.8812.1412.9212.2212.9912.9512.2014.02
lphom(Y, X)16.1412.7410.7811.8713.4013.7012.0812.18
lphom_min16.3011.8611.7311.5812.6512.8011.6112.59
lphom_dual_a14.8711.4710.5011.0312.1412.0311.1912.16
lphom_dual_w14.8911.3710.4110.9712.0512.0111.1012.11
lphom_joint15.3611.6510.5211.3112.5112.3511.4812.60
tslphom-based solutions
tslphom(X, Y)14.8010.9111.0010.8911.5011.6610.9112.59
tslphom(Y, X)14.5211.509.4610.7712.2212.5010.9511.07
tslphom_min14.0610.659.6410.3011.2011.6110.3611.18
tslphom_dual_a13.1810.278.969.8710.8310.8310.0310.92
tslphom_dual_w13.1510.178.879.7910.7410.789.9510.85
tslphom_joint14.0710.739.4210.4011.4611.4510.4911.54
nslphom-based solutions
nslphom(X, Y)12.799.478.859.119.469.698.919.82
nslphom(Y, X)12.719.898.119.1710.8611.159.299.39
nslphom_min12.559.228.428.639.419.678.489.07
nslphom_dual_a11.108.637.118.089.019.088.048.52
nslphom_dual_w11.148.587.118.048.919.017.968.48
nslphom_joint11.548.867.538.449.309.398.238.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 35, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 1.

Averages of EI errors by group of elections

Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
# of electionsN = 69N = 69N = 73N = 70N = 70N = 71N = 71N = 72
Avg. # of unitsI¯ = 83.2I¯ = 81.8I¯ = 70.2I¯ = 84.1I¯ = 85.7I¯ = 81.2I¯ = 101.9I¯ = 134.9
Avg. # of cellsJK¯ = 39.5JK¯ = 23.8JK¯ = 35.2JK¯ = 23.4JK¯ = 26.2JK¯ = 27.9JK¯ = 24.8JK¯ = 24.5
lphom-based solutions
lphom(X, Y)16.8812.1412.9212.2212.9912.9512.2014.02
lphom(Y, X)16.1412.7410.7811.8713.4013.7012.0812.18
lphom_min16.3011.8611.7311.5812.6512.8011.6112.59
lphom_dual_a14.8711.4710.5011.0312.1412.0311.1912.16
lphom_dual_w14.8911.3710.4110.9712.0512.0111.1012.11
lphom_joint15.3611.6510.5211.3112.5112.3511.4812.60
tslphom-based solutions
tslphom(X, Y)14.8010.9111.0010.8911.5011.6610.9112.59
tslphom(Y, X)14.5211.509.4610.7712.2212.5010.9511.07
tslphom_min14.0610.659.6410.3011.2011.6110.3611.18
tslphom_dual_a13.1810.278.969.8710.8310.8310.0310.92
tslphom_dual_w13.1510.178.879.7910.7410.789.9510.85
tslphom_joint14.0710.739.4210.4011.4611.4510.4911.54
nslphom-based solutions
nslphom(X, Y)12.799.478.859.119.469.698.919.82
nslphom(Y, X)12.719.898.119.1710.8611.159.299.39
nslphom_min12.559.228.428.639.419.678.489.07
nslphom_dual_a11.108.637.118.089.019.088.048.52
nslphom_dual_w11.148.587.118.048.919.017.968.48
nslphom_joint11.548.867.538.449.309.398.238.99
Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
# of electionsN = 69N = 69N = 73N = 70N = 70N = 71N = 71N = 72
Avg. # of unitsI¯ = 83.2I¯ = 81.8I¯ = 70.2I¯ = 84.1I¯ = 85.7I¯ = 81.2I¯ = 101.9I¯ = 134.9
Avg. # of cellsJK¯ = 39.5JK¯ = 23.8JK¯ = 35.2JK¯ = 23.4JK¯ = 26.2JK¯ = 27.9JK¯ = 24.8JK¯ = 24.5
lphom-based solutions
lphom(X, Y)16.8812.1412.9212.2212.9912.9512.2014.02
lphom(Y, X)16.1412.7410.7811.8713.4013.7012.0812.18
lphom_min16.3011.8611.7311.5812.6512.8011.6112.59
lphom_dual_a14.8711.4710.5011.0312.1412.0311.1912.16
lphom_dual_w14.8911.3710.4110.9712.0512.0111.1012.11
lphom_joint15.3611.6510.5211.3112.5112.3511.4812.60
tslphom-based solutions
tslphom(X, Y)14.8010.9111.0010.8911.5011.6610.9112.59
tslphom(Y, X)14.5211.509.4610.7712.2212.5010.9511.07
tslphom_min14.0610.659.6410.3011.2011.6110.3611.18
tslphom_dual_a13.1810.278.969.8710.8310.8310.0310.92
tslphom_dual_w13.1510.178.879.7910.7410.789.9510.85
tslphom_joint14.0710.739.4210.4011.4611.4510.4911.54
nslphom-based solutions
nslphom(X, Y)12.799.478.859.119.469.698.919.82
nslphom(Y, X)12.719.898.119.1710.8611.159.299.39
nslphom_min12.559.228.428.639.419.678.489.07
nslphom_dual_a11.108.637.118.089.019.088.048.52
nslphom_dual_w11.148.587.118.048.919.017.968.48
nslphom_joint11.548.867.538.449.309.398.238.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 35, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 2.

Averages of EI errors of nslphom-based solutions by alternative groupings of elections

Group of electionsNZ—regularNZ—MāoriNZ—allSCO 2007NZ + SCO
# of electionsN = 443N = 49N = 492N = 73N = 565
Avg. # of unitsI¯ = 65.9I¯ = 343.1I¯ = 93.5I¯ = 70.2I¯ = 90.5
Avg. # of cellsJK¯ = 26.8JK¯ = 29.9JK¯ = 27.1JK¯ = 35.2JK¯ = 28.2
nslphom(X, Y)9.8510.209.898.859.75
nslphom(Y, X)10.479.1410.348.1110.05
nslphom_min9.549.839.578.429.42
nslphom_dual_a9.027.998.927.118.68
nslphom_dual_w8.968.048.877.118.64
nslphom_joint9.407.859.247.539.02
Group of electionsNZ—regularNZ—MāoriNZ—allSCO 2007NZ + SCO
# of electionsN = 443N = 49N = 492N = 73N = 565
Avg. # of unitsI¯ = 65.9I¯ = 343.1I¯ = 93.5I¯ = 70.2I¯ = 90.5
Avg. # of cellsJK¯ = 26.8JK¯ = 29.9JK¯ = 27.1JK¯ = 35.2JK¯ = 28.2
nslphom(X, Y)9.8510.209.898.859.75
nslphom(Y, X)10.479.1410.348.1110.05
nslphom_min9.549.839.578.429.42
nslphom_dual_a9.027.998.927.118.68
nslphom_dual_w8.968.048.877.118.64
nslphom_joint9.407.859.247.539.02

Source: Compiled by the authors after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms based on nslphom, described in Sections 35, with ns = 10 to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 2.

Averages of EI errors of nslphom-based solutions by alternative groupings of elections

Group of electionsNZ—regularNZ—MāoriNZ—allSCO 2007NZ + SCO
# of electionsN = 443N = 49N = 492N = 73N = 565
Avg. # of unitsI¯ = 65.9I¯ = 343.1I¯ = 93.5I¯ = 70.2I¯ = 90.5
Avg. # of cellsJK¯ = 26.8JK¯ = 29.9JK¯ = 27.1JK¯ = 35.2JK¯ = 28.2
nslphom(X, Y)9.8510.209.898.859.75
nslphom(Y, X)10.479.1410.348.1110.05
nslphom_min9.549.839.578.429.42
nslphom_dual_a9.027.998.927.118.68
nslphom_dual_w8.968.048.877.118.64
nslphom_joint9.407.859.247.539.02
Group of electionsNZ—regularNZ—MāoriNZ—allSCO 2007NZ + SCO
# of electionsN = 443N = 49N = 492N = 73N = 565
Avg. # of unitsI¯ = 65.9I¯ = 343.1I¯ = 93.5I¯ = 70.2I¯ = 90.5
Avg. # of cellsJK¯ = 26.8JK¯ = 29.9JK¯ = 27.1JK¯ = 35.2JK¯ = 28.2
nslphom(X, Y)9.8510.209.898.859.75
nslphom(Y, X)10.479.1410.348.1110.05
nslphom_min9.549.839.578.429.42
nslphom_dual_a9.027.998.927.118.68
nslphom_dual_w8.968.048.877.118.64
nslphom_joint9.407.859.247.539.02

Source: Compiled by the authors after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms based on nslphom, described in Sections 35, with ns = 10 to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 2 presents the summaries grouping the NZ elections as alternatively suggested by Pavía (2022), but focusing only on the most accurate algorithms: those based on the nslphom model. To offer more context and make comparisons easier, we also include in Table 2 the results attained pooling all the elections and combining all the NZ elections in a unique group and, again, the Scottish results. The upper panels of both tables contain some summary statistics of the corresponding group of elections. Here, one of the differential characteristics of the Māori districts stands out: the relative higher number of units in which their electorates are split out. Figure 2 shows graphically the average errors attained by the nslphom-family algorithms in all the groupings.

Graphical representation of average values of EI error measures grouped by election and nslphom-family algorithm. The smaller the number, the better the accuracy. Individual solutions are attained after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). Detailed figures are available in Tables 1 and 2.
Figure 2.

Graphical representation of average values of EI error measures grouped by election and nslphom-family algorithm. The smaller the number, the better the accuracy. Individual solutions are attained after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 35, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). Detailed figures are available in Tables 1 and 2.

In light of the empirical assessments, two clear findings emerge. On the one hand, our results confirm the order of preference for the symmetric models already stated by Pavía and Romero (2022, 2023) for the asymmetric procedures. As with asymmetric models, tslphom-based methods systematically improve lphom-based ones and, equally, nslphom-based models consistently outperform tslphom-based ones (see Table 1). More importantly, given our main research question, by comparing asymmetric and symmetric models we can assert that symmetric algorithms produce, conditioned on the underlying based algorithm, consistently more accurate estimates than asymmetric procedures (at least for the data at hand). Overall, symmetric solutions beat asymmetric solutions almost 90% of the time.

The family of the algorithm (i.e. the underlying procedure, either lphom, tslphom, or nslphom), however, has a higher impact on the accuracy of the estimates than the character, either symmetric or asymmetric, of the model. tslphom-based models, including asymmetric ones, are on average more accurate than lphom-based models, and the same relationship occurs with nslphom-based models with regard to tslphom-based ones. In short, the nslphom-based models are clearly the most accurate, so henceforth we focus only on these by concentrating our analysis on the results available in Table 2 and Figure 1 and in the last panel of Table 1. In particular, pooling all the elections and algorithms, nslphom symmetric solutions are seen to be, on average, almost 11% more accurate than nslphom asymmetric ones.

Focusing now on the relationships within the symmetric solutions, we observe some general, although not completely systematic, patterns. On the one hand, pooling all the results we see that, on average, the nslphom_dual_w solutions are the most accurate, followed by nslphom_dual_a, nslphom_joint, and nslphom_min solutions in that order (see Table 2 and Figure 2). For some of the groups, however, nslphom_dual_w solutions are not the best on average; in some groups they are beaten by nslphom_dual_a (in the 2002 elections group) and by nslphom_joint (in the of Māori districts group).

In any case, what we can affirm is that nslphom_dual_w generates the more robust solutions. This algorithm produces, on average, the most accurate solutions despite being the algorithm that out of the four generates the smallest EI error fewer times. Indeed, even nslphom_min, the algorithm whose error average is the greatest out of the four, produces better solutions more times than nslphom_dual_w. In fact, nslphom_min at the individual level is the best 25% of the time, the same figure recorded for nslphom_dual_a, while the figures for nslphom_joint and nslphom_dual_w are 30% and 20%, respectively. Looking in more depth at the analysis of the individual matrices, we also find that, as expected, the nslphom_dual_w and nslphom_dual_a solutions are quite close, being midway between the nslphom_min and the nslphom_joint solutions, although a little closer to the latter.

The results for the Māori electorates are particularly interesting (see Table 2 and Figure 2). In all the groups, except the Māori, the dual solutions built as means are the best on average, when for the Māori the best on average are the predictions based on the joint algorithm. At first glance, one could think that this is a consequence of some of their differential characteristics. According to Pavía (2022), the Māori districts are remarkably different in many issues; for instance, compared to the rest of the districts of the database, their electorates are split out in a greater number of units, their transfer matrices record weaker relationships between row and column options, the electoral behaviour of their voters shows a greater heterogeneity across units and their datasets show more across-unit variances for both parties and candidates. A closer look at the individual errors, however, reveals that almost all the relative advantage of nslphom_joint compared to nslphom_dual_w and nslphom_dual_a is grounded on the 2002 elections. Indeed, nslphom_joint is only more accurate on average than nslphom_dual_w and nslphom_dual_a in the 2002 and 2005 elections. As a rule, therefore, we can say that, at least for datasets with similar characteristics to the ones analysed in this research, the nslphom_dual_w solutions are preferable. They are more accurate, on average, and quite robust.

Finally, focusing on computational times (see Table 3), we see that all the methods are quite fast. As expected, the nslphom solutions are slower but they only require a few seconds to reach their solutions. Obviously, the slowest algorithm is the nslphom_joint but, on average, it only takes in these datasets a few more than 20 s to reach its solution. In general, the computational burden grows with the number of units and the number of cells of the corresponding ecological matrix.

Table 3.

Averages of computational burden (in seconds) by group of elections

Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
lphom-based algorithms
lphom(X, Y)0.200.180.060.120.200.190.180.38
lphom(Y, X)0.230.210.060.240.330.340.300.36
lphom_dual0.420.400.120.310.480.420.440.74
lphom_joint1.080.980.250.761.141.011.051.81
tslphom-based algorithms
tslphom(X, Y)0.690.600.460.510.620.630.601.01
tslphom(Y, X)0.720.610.470.620.720.690.770.97
tslphom_dual1.411.230.891.121.351.271.331.94
tslphom_joint3.852.782.362.433.102.892.954.29
nslphom-based algorithms
nslphom(X, Y)5.114.354.074.114.734.554.616.43
nslphom(Y, X)5.074.393.964.184.794.614.826.43
nslphom_dual10.228.818.068.339.038.869.5212.87
nslphom_joint29.0319.3921.4418.0921.9021.3920.9428.99
Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
lphom-based algorithms
lphom(X, Y)0.200.180.060.120.200.190.180.38
lphom(Y, X)0.230.210.060.240.330.340.300.36
lphom_dual0.420.400.120.310.480.420.440.74
lphom_joint1.080.980.250.761.141.011.051.81
tslphom-based algorithms
tslphom(X, Y)0.690.600.460.510.620.630.601.01
tslphom(Y, X)0.720.610.470.620.720.690.770.97
tslphom_dual1.411.230.891.121.351.271.331.94
tslphom_joint3.852.782.362.433.102.892.954.29
nslphom-based algorithms
nslphom(X, Y)5.114.354.074.114.734.554.616.43
nslphom(Y, X)5.074.393.964.184.794.614.826.43
nslphom_dual10.228.818.068.339.038.869.5212.87
nslphom_joint29.0319.3921.4418.0921.9021.3920.9428.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 35, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The computations have been performed on a laptop with a CPU processor Intel Core i7-6820HK (4 cores) 2.70 GHz and 64GB of RAM.

Table 3.

Averages of computational burden (in seconds) by group of elections

Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
lphom-based algorithms
lphom(X, Y)0.200.180.060.120.200.190.180.38
lphom(Y, X)0.230.210.060.240.330.340.300.36
lphom_dual0.420.400.120.310.480.420.440.74
lphom_joint1.080.980.250.761.141.011.051.81
tslphom-based algorithms
tslphom(X, Y)0.690.600.460.510.620.630.601.01
tslphom(Y, X)0.720.610.470.620.720.690.770.97
tslphom_dual1.411.230.891.121.351.271.331.94
tslphom_joint3.852.782.362.433.102.892.954.29
nslphom-based algorithms
nslphom(X, Y)5.114.354.074.114.734.554.616.43
nslphom(Y, X)5.074.393.964.184.794.614.826.43
nslphom_dual10.228.818.068.339.038.869.5212.87
nslphom_joint29.0319.3921.4418.0921.9021.3920.9428.99
Country yearNZ 2002NZ 2005SCO 2007NZ 2008NZ 2011NZ 2014NZ 2017NZ 2020
lphom-based algorithms
lphom(X, Y)0.200.180.060.120.200.190.180.38
lphom(Y, X)0.230.210.060.240.330.340.300.36
lphom_dual0.420.400.120.310.480.420.440.74
lphom_joint1.080.980.250.761.141.011.051.81
tslphom-based algorithms
tslphom(X, Y)0.690.600.460.510.620.630.601.01
tslphom(Y, X)0.720.610.470.620.720.690.770.97
tslphom_dual1.411.230.891.121.351.271.331.94
tslphom_joint3.852.782.362.433.102.892.954.29
nslphom-based algorithms
nslphom(X, Y)5.114.354.074.114.734.554.616.43
nslphom(Y, X)5.074.393.964.184.794.614.826.43
nslphom_dual10.228.818.068.339.038.869.5212.87
nslphom_joint29.0319.3921.4418.0921.9021.3920.9428.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 35, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The computations have been performed on a laptop with a CPU processor Intel Core i7-6820HK (4 cores) 2.70 GHz and 64GB of RAM.

7. On the factors impacting on the accuracy of the estimates

In the previous section, we have compared the accuracy averages of the different solutions, omitting their variability. However, the accuracies of the estimates not only differ across methods but they obviously also differ across districts/elections. In this section, we consider several observed features, specific of each election, and study whether and to what extent they can explain the observed differences in accuracy across elections. To do that, we just focus on nslphom_dual_w and nlsphom_joint solutions. On the one hand, the former are revealed as the more accurate in the analysed database. On the other hand, the latter have been obtained with the theoretically finest method, as it solves the problem dealing with all the constraints simultaneously.

Figure 3, where the distributions obtained for EI are drawn in the 565 elections analysed, clearly shows the existence of important accuracy differences across elections in both groups of solutions. For instance, focusing on nslphom_dual_w solutions, we have that EI ranges from a minimum of 2.45 to a maximum of 20.01; with the minimum being observed in an election where voters are distributed into 90 polling stations and the problem consists in estimating a transfer matrix of size 5×5 and the maximum being recorded in an election in which there are just data from 38 polling stations to estimate a 7×7 matrix. Indeed, this remark properly exemplifies a usual pattern/relationship typically shown for ecological inference estimators, which tend to be more accurate, the larger is the ratio between units and unknowns, R=I/JK. In particular, in our application, the EI average values for, respectively, nslphom_dual_w and nslphom_joint are 12.47 and 12.43 when R1; 9.46 and 9.97 when 1<R<2; and 8.08 and 8.43 when R2.

Histograms of the distributions of the error indexes (EI) corresponding to the nslphom_dual_w and nslphom_joint solutions attained in the 565 elections analysed. The discontinuous vertical lines place the means of the distributions.
Figure 3.

Histograms of the distributions of the error indexes (EI) corresponding to the nslphom_dual_w and nslphom_joint solutions attained in the 565 elections analysed. The discontinuous vertical lines place the means of the distributions.

The accuracy of the attained estimates, therefore, depends on structural characteristics of the particular election under study. In what follows, we analyse whether the observed variables previously identified in the literature as explainers of accuracy for asymmetric methods also work for the proposed symmetric models. In particular, the list of features recognized in the literature (e.g. King, 1997; Klima et al., 2016; Park et al., 2014; Pavía & Romero, 2023; Plescia & De Sio, 2018; Wakefield, 2004) as factors impacting on the accuracy of the estimates include: (i) the amount of available information, I; (ii) the complexity of the problem, JK; (iii) the level of heterogeneity among the local transfer matrices, HET; (iv) the strength of the relationship between the row and column categories, χ2; (v) the degrees of diversity/variability within-units in both row and column marginal distributions, DWR and DWC, which account for how similar/different the sizes of different groups/categories are; and (vi) the extent of the variability across-units in both row and column marginal distributions, VAR and VAC, which measure how similar/different are the margin distributions across tables.

As a rule, ceteris paribus, the larger the number of polling stations and the smaller the contingency tables (the number of coefficients to be estimated), the more accurate the estimates are; with the impact of each of these two features being conditioned by the value taken by the other. Indeed, it is customarily stated that a ratio of at least two units (polling stations) per coefficient is necessary for a proper estimation of the vote transfer matrix (Plescia & De Sio, 2018). Thus, the impacts of I and JK are usually assessed together using as joint indicator the number of polling stations divided by number of columns times the number of rows: R=I/JK.

The values HET and χ2 are commonly unknown as they both depend on the actual district vote transfer matrix. Although in our case they could be exactly computed, we prefer to consider a typical situation and to study the impact of these features relying on their estimates, which are contingent to the specific solution reached. In particular, we approximate HET using the HETe coefficient given by equation (33) and χ2 through χ2e=jk[(v^jkjv^jkkv^jk)2/(J1)(K1)jv^jkkv^jk], where the divisor (J1)(K1) accounts for the different number of cells that each matrix has.

For approximating district within-unit diversities, different statistics have been tried in the literature. They have been measured as (a) (weighted) averages of the standard deviations of the unit marginal distributions (using as weights the number of voters per unit), (b) directly using the standard deviations of the district marginal distributions, and (c) employing the formula 1hπh2, with πh representing the proportion of votes gained by option h in the total population. In practice, these three measures are (almost) equivalent. In our dataset, the correlations between (a) and (b), (a) and (c), and (b) and (c) are, respectively, 0.99, −0.99, and −0.99 for parties and 0.99, −0.99, and −0.98 for candidates. We approximate DWR and DWC using (b), the standard deviations of the district margin distributions of parties and candidates, respectively. Finally, to measure the extent of the diversity/similarity of party and candidate distributions across polling stations, we have used the compositional total variance (Pawlowsky-Glahn et al., 2015): VAR=(2J)1j,jJVar({log(xij/xij)}i) and VAC=(2K)1k,kKVar({log(yik/yik)}i), where the variances are computed across the I units. Note that in whatever application both VAR and VAC need to be different from zero for the ecological inference problem to have a solution, as all the methods learn from the statistical covariations between the polling stations margin distributions.

Once defined the features, we study using regression linear models their impact on the nslphom_dual_w and nslphom_joint solutions’ accuracy. To do that, we adopt a two-step strategy. First, we analyse the marginal impact of each feature not excluding the possibility that its (increasing/decreasing) effect operates at a decreasing/increasing rate. That is, we consider the possibility of nonlinear effects in the one-feature models by also including as potential regressor the square of the feature. Second, we jointly estimate the impact on accuracy of all the regressors identified as statistically significant in the marginal (one-feature) models. To fit the models, we have excluded the Māori electorates from the analysis. Although the aggregate results are quite similar when these elections are also included in the analysis, we have opted to not consider them to fit the model because of its singular characteristics that would undermine the impact of the ratio I/JK. In fact, the Māori elections can be observed as non-standard as they are characterized for having their electorates distributed into a large number of polling stations, quite geographically spread, with the majority of them having very few voters.

Table 4 presents a statistical summary of the values corresponding to the abovementioned features in the 516 districts that remain in the database after excluding the Māori electorates. As can be observed, almost all variables show asymmetric positive, right-skewed, distributions. In terms of correlations, the largest ones are observed for the pairs of variables I/JK and DWR (0.74), I/JK and χ2e (0.53), HETe_d and DWC (−0.64), and HETe_j and DWR (−0.55). Given the large sample size, we do not expect this to represent a problem in interpreting the models obtained.

Table 4.

Statistical summary of the explicative variables, excluding Ma¯ori electorates (N=516)

I/JKHETe_dHETe_jχ2e_dχ2e_jDWRDWCVARVAC
Mean2.564.254.01293229090.180.210.980.85
Standard deviation1.191.250.959669600.040.040.360.35
Minimum0.522.422.163773470.060.090.290.12
Maximum7.739.448.45629766130.340.332.772.12
Skewness1.161.050.940.400.400.06−0.131.100.74
I/JKHETe_dHETe_jχ2e_dχ2e_jDWRDWCVARVAC
Mean2.564.254.01293229090.180.210.980.85
Standard deviation1.191.250.959669600.040.040.360.35
Minimum0.522.422.163773470.060.090.290.12
Maximum7.739.448.45629766130.340.332.772.12
Skewness1.161.050.940.400.400.06−0.131.100.74

Source: Compiled by the authors through χ2e=jk[(v^jkjv^jkkv^jk)2/(J1)(K1)jv^jkkv^jk], VAR=(2J)1j,jJVar({log(xij/xij)}i), and VAC=(2K)1k,kKVar({log(yik/yik)}i), and using equation (33) and the standard deviations of the district marginal distributions of parties and candidates to calculate HETe, DWR, and DWC, respectively. The suffixes _d and _j stand for dual and joint, respectively.

Table 4.

Statistical summary of the explicative variables, excluding Ma¯ori electorates (N=516)

I/JKHETe_dHETe_jχ2e_dχ2e_jDWRDWCVARVAC
Mean2.564.254.01293229090.180.210.980.85
Standard deviation1.191.250.959669600.040.040.360.35
Minimum0.522.422.163773470.060.090.290.12
Maximum7.739.448.45629766130.340.332.772.12
Skewness1.161.050.940.400.400.06−0.131.100.74
I/JKHETe_dHETe_jχ2e_dχ2e_jDWRDWCVARVAC
Mean2.564.254.01293229090.180.210.980.85
Standard deviation1.191.250.959669600.040.040.360.35
Minimum0.522.422.163773470.060.090.290.12
Maximum7.739.448.45629766130.340.332.772.12
Skewness1.161.050.940.400.400.06−0.131.100.74

Source: Compiled by the authors through χ2e=jk[(v^jkjv^jkkv^jk)2/(J1)(K1)jv^jkkv^jk], VAR=(2J)1j,jJVar({log(xij/xij)}i), and VAC=(2K)1k,kKVar({log(yik/yik)}i), and using equation (33) and the standard deviations of the district marginal distributions of parties and candidates to calculate HETe, DWR, and DWC, respectively. The suffixes _d and _j stand for dual and joint, respectively.

In order to facilitate the interpretation of the estimated coefficients and the relative impact of their associated features on accuracy, all the explanatory variables have been standardized to zero mean and standard deviation 1. In this way, each estimated coefficient informs about the expected variation from the mean in the response variable due to one standard variation in the corresponding variable. From the one-feature models (not shown), one can infer that (i) I/JK is the feature with the largest impact on accuracy, followed by DWC and χ2e, (ii) two variables (I/JK and χ2e) show a significant curvature in its relationship with EI, i.e. accuracy improves but at decreasing rates as both variables grow, and (iii) accuracy also improves when either row or column within-units or across-units variances increase.

Some changes between the marginal and joint impacts of the features happen when we model the full multivariate specifications. Table 5 presents the coefficients of the fitted linear regression models. All variables, except χ2e2 and DWR, show a statistically significant impact. Together these variables explain around 25% of the observed variability of EI. As expected, larger I/JK ratios lead to more reliable estimates. This feature is moreover the one with the largest impact. The other two variables impacting positively by increasing accuracy (i.e. by reducing EI) are the column within-units and across-units variabilities. If in the one-feature models both row and column within-units and across-units variabilities improve accuracy in a similar fashion, when we consider all the features simultaneously, DWR lost its statistical significance and VAR reverses its sign. This result may seem puzzling in a first glance, given the symmetric nature of the algorithms, but their different impact in this database is consequence of the differences between the row and column distributions of variabilities/diversities. Finally, the models also show that when we control for the rest of variables, marginally the larger the heterogeneities or the relationships between row and column categories, the less accurate the solutions are, ceteris paribus.

Table 5.

Impact of different features on solutions’ accuracy

VariableResponse variable: error index (EI)
nslphom_dual_wnslphom_joint
Estimatep-ValueEstimatep-Value
Intercept8.23<0.00018.72<0.0001
I/JK−1.60<0.0001−1.89<0.0001
(I/JK)20.32<0.00010.37<0.0001
HETe0.460.00090.84<0.0001
χ2e0.78<0.00011.15<0.0001
χ2e20.140.05830.050.5368
VAR0.67<0.00010.430.0027
VAC−0.66<0.0001−0.510.0001
DWR0.230.19040.340.0549
DWC−0.440.0302−0.520.0042
Adjusted R2 (%)24.9025.73
Residual Std. error2.212.33
VariableResponse variable: error index (EI)
nslphom_dual_wnslphom_joint
Estimatep-ValueEstimatep-Value
Intercept8.23<0.00018.72<0.0001
I/JK−1.60<0.0001−1.89<0.0001
(I/JK)20.32<0.00010.37<0.0001
HETe0.460.00090.84<0.0001
χ2e0.78<0.00011.15<0.0001
χ2e20.140.05830.050.5368
VAR0.67<0.00010.430.0027
VAC−0.66<0.0001−0.510.0001
DWR0.230.19040.340.0549
DWC−0.440.0302−0.520.0042
Adjusted R2 (%)24.9025.73
Residual Std. error2.212.33

Source: Compiled by the authors. All the predictor variables were standardized before fitting the models to make comparisons of coefficients easier.

Table 5.

Impact of different features on solutions’ accuracy

VariableResponse variable: error index (EI)
nslphom_dual_wnslphom_joint
Estimatep-ValueEstimatep-Value
Intercept8.23<0.00018.72<0.0001
I/JK−1.60<0.0001−1.89<0.0001
(I/JK)20.32<0.00010.37<0.0001
HETe0.460.00090.84<0.0001
χ2e0.78<0.00011.15<0.0001
χ2e20.140.05830.050.5368
VAR0.67<0.00010.430.0027
VAC−0.66<0.0001−0.510.0001
DWR0.230.19040.340.0549
DWC−0.440.0302−0.520.0042
Adjusted R2 (%)24.9025.73
Residual Std. error2.212.33
VariableResponse variable: error index (EI)
nslphom_dual_wnslphom_joint
Estimatep-ValueEstimatep-Value
Intercept8.23<0.00018.72<0.0001
I/JK−1.60<0.0001−1.89<0.0001
(I/JK)20.32<0.00010.37<0.0001
HETe0.460.00090.84<0.0001
χ2e0.78<0.00011.15<0.0001
χ2e20.140.05830.050.5368
VAR0.67<0.00010.430.0027
VAC−0.66<0.0001−0.510.0001
DWR0.230.19040.340.0549
DWC−0.440.0302−0.520.0042
Adjusted R2 (%)24.9025.73
Residual Std. error2.212.33

Source: Compiled by the authors. All the predictor variables were standardized before fitting the models to make comparisons of coefficients easier.

8. Discussion, conclusions, and future research

Ecological inference methods are devised to infer conditional distribution probabilities from marginal distributions. In doing so, they consider a main characteristic variable (e.g. race or social class) impacting on a response variable, usually the vote. This scheme fits the majority of ecological inference problems. In some instances, however, such as in simultaneous elections, this general scheme can be questionable. The issue is that different solutions are achieved depending on which variable is considered the factor (origin, explanatory, cause) and which the response. The methods are asymmetric. In this paper, we ask whether we can obtain some advantage in terms of accuracy by dealing with this inverse problem in a symmetric way. That is, by solving it as a purely mathematical puzzle that must verify some congruence properties, omitting the possible presence of a natural a priori relationship (i.e. a logical mapping of the variables to rows and columns). Afterwards, the researcher can recover the meaningful order when presenting the outcomes.

To answer the above research question, this paper builds within the linear programming framework two new families of algorithms whose solutions do not depend on how variables are mapped to rows and columns. These are symmetric methods. From a statistical standpoint, symmetric approaches offer the advantage of adhering to the probabilistic symmetry condition, meaning that their solutions conform to Bayes theorem. It should be noted, however, that while the symmetric condition could be considered a desirable property, it alone does not guarantee the attainment of accurate solutions. For example, an algorithm that assumes conditional independence between rows and columns given the unit (equivalent to the nonlinear neighbourhood model; Freedman et al., 1998) would multiply the average error by more than four times in the datasets analysed in this paper, even verifying the symmetric condition.

We evaluate the accuracy of the proposed methods using real data corresponding to 565 simultaneous elections where the true district-level cross-classifications of votes are known. Our empirical assessment indubitably points to the proposed symmetric solutions as being more accurate than the equivalent asymmetric approaches, with the ratio between available information and complexity/difficulty of the problem as the characteristic having more impact on accuracy. Overall, the optimal solutions reached with the joint specifications, those which solve linear programmes that simultaneously include all the constraints, are not the most accurate on average. They are outperformed by the methods that build their solutions as an average of asymmetric solutions. We conclude that for the datasets at hand, the nslphom_dual_w solutions are, on average, the most accurate. This method grounds its better accuracy on its greater robustness since it is the one generating the smallest error fewer times among the nslphom symmetric models. Indeed, there is no symmetric algorithm that systematically beats the rest of the symmetric methods, so a question to be answered is whether the circumstance itself could indicate which symmetric method to choose. In other words, can we find specific configurations derivable from the observed data that calls for the recommendation of a specific method for a particular dataset?

Given that mathematical programming offers the natural methodological setting for simultaneously handling all the constraints linked to a symmetric treatment of the problem, we have considered only algorithms within the linear programming framework where we have built the symmetric solutions from asymmetric methods. This has the advantage of maintaining all the proposals within the same framework, making comparisons fairer and producing methods easy to use and robust to claims of hacking. The user only needs to input a set of ecological data and a maximum number of iterations, and the procedures automatically return a sensible solution. Nevertheless, it would be worth extending the analysis to other methodological settings, i.e. to study whether similar results would be obtained if symmetric methods were built from scratch or using asymmetric algorithms from other frameworks. For example, the following question could be addressed: can inferences attained as an average of the two dual solutions of the Rosen et al. (2001) R × C model systematically improve the asymmetric solution achieved by choosing the most logical approach (of the two)? Indeed, in our view, the new angle for tackling the ecological inference problem offered by this research should also be systematically explored from other conceptual frameworks. This, however, demands new theoretical developments. More specifically, just as ecological inference linear programming requires new developments to introduce information from polls into its models (Pavía, 2023), an effort is also required to develop genuine symmetric models from other frameworks, such as the Bayesian and frequentist ecological inference ones. In this case, we consider that models based on conditional multivariate hypergeometric distributions or full-table multinomial distributions could prove successful.

Our empirical results are based on data from simultaneous elections so it would also be worth exploring whether our conclusions can be confirmed in other contexts, such as voting rights litigation or literacy studies. Despite electoral datasets with true answers in other target application areas being scarce and typically not available, it would be interesting to replicate the study of Barreto et al. (2022) on racially polarized voting and analyse the impact of using symmetric approaches on substantive conclusions and on the accuracy in estimating the inner-cells values of their simulated datasets. Likewise, the symmetric approach could also be tested estimating the tables of the datasets employed by Jiang et al. (2020). These datasets include data on US mortality rates by gender and race, or literacy rates and educational attendance by gender in India.

Finally, from a more methodological perspective, it would be interesting to study what impact initializing the iterative process with a different matrix of votes would have on the accuracy of the nslphom_joint optimal solutions; for instance, what the impact would be of initializing with the nslphom_dual_w solution instead of with the lphom_joint solution. Furthermore, given the lack of theoretical results about the (asymptotic) distributions of the estimators, we consider that the bootstrap approach (e.g. Efron & Tibshirani, 1994) could be used to measure the precision of estimates. Despite the fact that lphom-based specifications can be theoretically observed as a linear absolute fitting problem after applying Theorem 1 of chapter 6 in Bloomfield and Steiger (1983, pp. 164–165) and, therefore, their estimators ‘usually’ be considered as having ‘a limiting normal distribution’ (Bloomfield & Steiger, 1983, p. 44), in our view, this does not apply as a rule in the ecological inference problem and definitely cannot be used in the tslphom- and nslphom-based specifications. On the one hand, it is difficult to accept that the assumptions required by the different asymptotic theorems hold in the ecological inference problem, due to, among other issues, the discrete character of the margins and the cross structures of relationships that the row and column aggregations impose. On the other hand, and more importantly, the tslphom- and nslphom-specifications do not fulfil the hypothesis of the abovementioned Theorem 1, as it requires that the number of summands (auxiliary variables) in the objective function be greater than the number of equality constraints. The unit tables algorithms, lphom_local and lphom_local_joint, that are in the core of the tslphom- and nslphom-family models solve linear programmes where the number of auxiliary variables is smaller than the number of equality constraints.

The above issues do not imply that ecological inference linear programming approaches lack a statistical interpretation or that the estimated coefficients necessarily exhibit undesirable statistical properties. On the one hand, in addition to the established links between linear programming solutions and minimization of expected discrepancies, as previously mentioned when discussing the proposed models, it is important to note that tslphom- and nlsphom-based approaches, like many other ecological inference models, rely on the underlying assumption of independence across units of the two-way distributions of votes, given their aggregate two-way joint distribution. In other words, the set of unit two-way fraction distributions can be viewed as a simple random sample of a common probability distribution. This entails that, without covariates, estimates can only be reliable when models are applied in homogeneous political regions. On the other hand, we believe that similar to how sampling properties of quadratic programming solutions can be obtained through the links between quadratic programming and inequality-constrained normal regression (Geweke, 1986; Judge & Takayama, 1966), the connection between linear programming and inequality-constrained linear absolute fitting with Laplace-distributed errors could be explored to derive statistical properties of basic ecological inference linear programming solutions. In addition, we also find remarkable that the iterative algorithm that characterize nslphom-based methods can be understood in terms of an expectation-maximization algorithm (Dempster et al., 1977). In this comparison, equations (6)–(13) would correspond to the expectation step, equation (14) to the maximization step, and equations (1)–(5) are required for generating an initial estimate of the expected transition probabilities. This connection becomes more evident when examining the variants introduced in the lclphom algorithm (Pavía, 2024).

Acknowledgements

The authors wish to thank the editors and four anonymous reviewers for their valuable comments and suggestions and M. Hodkinson for revising the English of the paper.

Funding

This research has been supported by Conselleria de Educación, Universidades y Empleo, Generalitat Valenciana [grant number AICO/2021/257] and by Ministerio de Economía e Innovación [grant number PID2021-128228NB-I00].

Data availability

The data used in this research is publicly available on the R package ei.Datasets (version 0.0.1-1) accessible on CRAN. The reproducible ad hoc R-code employed, based on functions included in the R package lphom (version 0.3.0-7), is available in the attached online supplementary material.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series A.

References

Andreadis
,
I.
, &
Chadjipadelis
,
T.
(
2009
).
A method for the estimation of voter transition rates
.
Journal of Elections, Public Opinion and Parties
,
19
(
2
),
203
218
.

Barreto
,
M.
,
Collingwood
,
L.
,
Garcia-Rios
,
S.
, &
Oskooii
,
K. A. R.
(
2022
).
Estimating candidate support in voting rights act cases: Comparing iterative EI and EI-R_C methods
.
Sociological Methods & Research
,
51
(
1
),
271
304
.

Bernardini-Papalia
,
R.
, &
Fernández-Vázquez
,
E.
(
2020
).
Entropy-based solutions for ecological inference problems: A composite estimator
.
Entropy
,
22
(
7
),
781
.

Bloomfield
,
P.
, &
Steiger
,
W. L.
(
1983
).
Least absolute deviations: Theory, applications and algorithms
.
Birkhäuser-Springer
.

Brown
,
P. J.
, &
Payne
,
C. D.
(
1986
).
Aggregate data, ecological regression and voting transitions
.
Journal of the American Statistical Association
,
81
(
394
),
452
460
.

Collingwood
,
L.
,
Decter-Frain
,
A.
,
Murayama
,
H.
,
Sachdeva
,
P.
, &
Burke
,
J.
(
2020
). eiCompare: Compares ecological inference, Goodman, rows by columns estimates. R package version 3.0.0. https://CRAN.R-project.org/package=eiCompare

Collingwood
,
L.
,
Oskooii
,
K.
,
Garcia-Rios
,
S.
, &
Barreto
,
M.
(
2016
).
eiCompare: Comparing ecological inference estimates across EI and EI:R×C
.
The R Journal
,
8
(
2
),
92
101
.

Dempster
,
A. P.
,
Laird
,
N. M.
, &
Rubin
,
D. B.
(
1977
).
Maximum likelihood from incomplete data via the EM algorithm
.
Journal of the Royal Statistical Society, Series B
,
39
(
1
),
1
38
.

Efron
,
B.
, &
Tibshirani
,
R. J.
(
1994
).
An introduction to bootstrap
.
Chapman and Hall/CRC
.

Ferree
,
K. E.
(
2004
).
Iterative approaches to RxC ecological inference problems: Where they can go wrong and one quick fix
.
Political Analysis
,
12
(
2
),
143
159
.

Forcina
,
A.
, &
Pellegrino
,
D.
(
2019
).
Estimation of voter transitions and the ecological fallacy
.
Quality & Quantity
,
53
(
4
),
1859
1874
.

Freedman
,
D. A.
,
Klein
,
S. P.
,
Ostland
,
M.
, &
Roberts
,
M. R.
(
1998
).
A solution to the ecological inference problem (book review)
.
Journal of the American Statistical Association
,
93
(
444
),
1518
1522
.

Freedman
,
D. A.
,
Ostland
,
M.
,
Roberts
,
M. R.
, &
Klein
,
S. P.
(
1999
).
Reply to G. King
.
Journal of the American Statistical Association
,
94
(
445
),
355
357
.

Gelman
,
A.
,
Park
,
D. K.
,
Ansolabehere
,
S.
,
Price
,
L. C.
, &
Minnite
,
L. C.
(
2001
).
Models, assumptions and model checking in ecological regression
.
Journal of the Royal Statistical Society, Series A
,
164
(
1
),
101
118
.

Geweke
,
J.
(
1986
).
Exact inference in the inequality constrained normal linear regression model
.
Journal of Applied Econometrics
,
1
(
2
),
127
141
.

Glynn
,
A.
, &
Wakefield
,
J.
(
2010
).
Ecological inference in the social sciences
.
Statistical Methodology
,
7
(
3
),
307
322
.

Goodman
,
L. A.
(
1953
).
Ecological regressions and the behavior of individuals
.
American Sociological Review
,
18
(
6
),
663
664
.

Goodman
,
L. A.
(
1959
).
Some alternatives to ecological correlation
.
American Journal of Sociology
,
64
(
6
),
610
625
.

Greiner
,
D. J.
(
2007
).
Ecological inference in voting rights act disputes: Where are we now, and where do we want to be?
Jurimetrics
,
47
(
2
),
115
167
. https://www.jstor.org/stable/29762964

Greiner
,
D. J.
,
Baines
,
P.
, &
Quinn
,
K.M.
(
2021
). RxCEcolInf: R x C ecological inference with optional incorporation of survey information. R package version 0.1-5. https://CRAN.R-project.org/package=RxCEcolInf

Greiner
,
D. J.
, &
Quinn
,
K. M.
(
2009
).
R×C ecological inference: Bounds, correlations, flexibility, and transparency of assumptions
.
Journal of the Royal Statistical Society, Series A
,
172
(
1
),
67
81
.

Greiner
,
D. J.
, &
Quinn
,
K. M.
(
2010
).
Exit polling and racial bloc voting: Combining individual level and RxC ecological data
.
The Annals of Applied Statistics
,
4
(
4
),
1774
1796
.

Hawkes
,
A.
(
1969
).
An approach to the analysis of electoral swing
.
Journal of the Royal Statistical Society, Series A
,
132
(
1
),
68
79
.

Jiang
,
W.
,
King
,
G.
,
Schmaltz
,
A.
, &
Tanner
,
M. A.
(
2020
).
Ecological regression with partial identification
.
Political Analysis
,
28
(
1
),
65
86
.

Johnston
,
R. J.
,
Hay
,
A. M.
, &
Rumley
,
D.
(
1983
).
Entropy-maximizing method for estimating voting data: A critical test
.
Area
,
15
(
1
),
35
41
. https://www.jstor.org/stable/20001867

Johnston
,
R. J.
, &
Pattie
,
C.
(
2003
).
Evaluating an entropy-maximizing solution to the ecological inference problem: Split-ticket voting in New Zealand, 1999
.
Geographical Analysis
,
35
(
1
),
1
23
.

Judge
,
G. G.
,
Miller
,
D. J.
, &
Tam Cho
,
W. K.
(
2004
). An information theoretic approach to ecological estimation and inference. In
G.
King
,
O.
Rosen
,
M. A,
Tanner
(Eds.),
Ecological inference. New methodological strategies
(pp.
162
187
).
Cambridge University Press
.

Judge
,
G. G.
, &
Takayama
,
T.
(
1966
).
Inequality restrictions in regression analysis
.
Journal of the American Statistical Association
,
61
(
313
),
166
181
.

Katz
,
J. N.
(
2014
).
Expert report on voting in the city of Whittier
.
Superior Court of the State of California
.

Kellermann
,
T.
(
2011
).
Vom Wahlergebnis zur Wählerwanderung: Welche Wähler wechselten wie ihre Entscheidung
.
Stadtforschung und Statistik
,
2011
(
1
),
34
40
.

King
,
G.
(
1997
).
A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data
.
Princeton University Press
.

King
,
G.
,
Rosen
,
O.
, &
Tanner
,
M. A.
(Eds.) (
2004
).
Ecological inference. New methodological strategies
.
Cambridge University Press
.

Klein
,
J. M.
(2019). Estimation of voter transitions in multi-party systems. Quality of credible intervals in (hybrid) multinomial-Dirichlet models [Master thesis dissertation]. Ludwig-Maximilians-Universität München.

Klima
,
A.
,
Schlesinger
,
T.
,
Thurner
,
P. W.
, &
Küchenhoff
,
H.
(
2019
).
Combining aggregate data and exit polls for the estimation of voter transitions
.
Sociological Methods & Research
,
48
(
2
),
296
325
.

Klima
,
A.
,
Thurner
,
P. W.
,
Molnar
,
C.
,
Schlesinger
,
T.
, &
Küchenhoff
,
H.
(
2016
).
Estimation of voter transitions based on ecological inference: An empirical assessment of different approaches
.
AStA-Advances in Statistical Analysis
,
100
(
2
),
133
159
.

Lau
,
O.
,
Moore
,
O. R. T.
, &
Kellermann
,
M.
(
2020
). eiPack: Ecological inference and higher-dimension data management. R package version 0.2-1. https://CRAN.R-project.org/package=eiPack

Manski
,
C. F.
(
2007
).
Identification for prediction and decision
.
Harvard University Press
.

O’Loughlin
,
J.
(
2000
).
Can King’s ecological inference method answer a social scientific puzzle: Who voted for the Nazi party in Weimar Germany?
Annals of the Association of American Geographers
,
90
(
3
),
592
601
.

Park
,
W.-H.
,
Hanmer
,
M. J.
, &
Biggers
,
D. R.
(
2014
).
Ecological inference under unfavorable conditions: Straight and split-ticket voting in diverse settings and small samples
.
Electoral Studies
,
36
,
192
203
.

Pavía
,
J. M.
(
2022
).
ei.Datasets: Real datasets for assessing ecological inference algorithms
.
Social Science Computer Review
,
40
(
1
),
247
260
.

Pavía
,
J. M.
(
2023
).
Adjustment of initial estimates of voter transition probabilities to guarantee consistency and completeness
.
SN Social Sciences
,
3
(
5
),
75
.

Pavía
,
J. M.
(
2024
).
A local convergent ecological inference algorithm for RxC tables
.

Pavía
,
J. M.
, &
Cantarino
,
I.
(
2017
).
Dasymetric distribution of votes in a dense city
.
Applied Geography
,
86
,
22
31
.

Pavía
,
J. M.
, &
Romero
,
R.
(
2022
).
Improving estimates accuracy of voter transitions. Two new algorithms for ecological inference based on linear programming
.
Sociological Methods & Research
.

Pavía
,
J. M.
, &
Romero
,
R.
(
2023
).
Data wrangling, computational burden, automation, robustness and accuracy in ecological inference forecasting of RxC tables
.
SORT—Statistics and Operations Research Transactions
,
47
(
1
),
151
186
.

Pawlowsky-Glahn
,
V.
,
Egozcue
,
J. J.
, &
Tolosana-Delgado
,
R.
(
2015
).
Modeling and analysis of compositional data
.
John Wiley & Sons, Ltd
.

Plescia
,
C.
, &
De Sio
,
L.
(
2018
).
An evaluation of the performance and suitability of RxC methods for ecological inference with known true values
.
Quality & Quantity
,
52
(
2
),
669
683
.

Robinson
,
W. S.
(
1950
).
Ecological correlations and the behavior of individuals
.
American Sociological Review
,
15
(
3
),
351
357
.

Romero
,
R.
, &
Pavía
,
J. M.
(
2021
).
Estimating vote party entries and exits by ecological inference. Mathematical programming versus Bayesian statistics
.
Boletín de Estadística e Investigación Operativa
,
37
(
2
),
85
97
.

Romero
,
R.
,
Pavía
,
J. M.
,
Martín
,
J.
, &
Romero
,
G.
(
2020
).
Assessing uncertainty of voter transitions estimated from aggregated data. Application to the 2017 French presidential election
.
Journal of Applied Statistics
,
47
(
13–15
),
2711
2736
.

Rosen
,
O.
,
Jiang
,
W.
,
King
,
G.
, &
Tanner
,
M. A.
(
2001
).
Bayesian and frequentist inference for ecological inference: The RxC case
.
Statistica Neerlandica
,
55
(
2
),
134
156
.

Schakel
,
A. H.
, &
Romanova
,
V.
(
2020
).
Vertical linkages between regional and national electoral arenas and their impact on multilevel democracy
.
Regional and Federal Studies
,
30
(
3
),
323
342
.

Tam Cho
,
W. K.
(
1998
).
Iff the assumption fits…: A comment on the King ecological inference solution
.
Political Analysis
,
7
,
143
163
.

Tam Cho
,
W. K.
, &
Gaines
,
B. J.
(
2004
).
The limits of ecological inference: The case of split-ticket voting
.
American Journal of Political Science
,
48
(
1
),
152
171
.

Thomsen
,
S. R.
(
1987
).
Danish elections, 1920–79: A logit approach to ecological analysis and inference
.
Politica
.

Tziafetas
,
G.
(
1986
).
Estimation of the voter transition matrix
.
Optimization
,
17
(
2
),
275
279
.

Wakefield
,
J.
(
2004
).
Ecological inference for 2×2 tables (with discussion)
.
Journal of the Royal Statistical Society, Series A
,
167
(
3
),
385
445
.

Author notes

Conflicts of interest: none declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Supplementary data