-
PDF
- Split View
-
Views
-
Cite
Cite
Jose M Pavía, Rafael Romero, Symmetry estimating R × C vote transfer matrices from aggregate data, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 187, Issue 4, October 2024, Pages 919–943, https://doi.org/10.1093/jrsssa/qnae013
- Share Icon Share
Abstract
Ecological inference methods are devised to estimate unknown inner-cells of 2-way contingency tables by inferring conditional distribution probabilities. This outlines one of the more long-standing social science problems, chiefly frequent in political science and sociology. To solve the problem, ecological inference algorithms consider an asymmetric relationship, with a main characteristic (e.g. race or social class) mapped to rows impacting on a dependent variable, usually the vote, mapped to columns. The problem arises because different solutions are reached depending on how variables are assigned to rows and columns. The models are asymmetric. In this paper, we propose 2 new sets of ecological inference algorithms and explore if accuracy could be improved by handling the problem in a symmetric way. We assess the accuracy of the proposed methods using real data from more than 550 concurrent elections where the true district-level cross-classifications of votes (straight- and split-tickets) are known. Our empirical assessment clearly identifies the symmetric solutions as more accurate. They outperform asymmetric methods 90% of the time and reduce error, on average, by 11%. Our results are based on data from simultaneous elections, so further research is required to see whether our conclusions can be maintained in other ecological inference contexts. Interested readers can easily use the proposed methods as they are implemented in the R package lphom.
1. Introduction
In elections, aggregate-level data are abundant and reliable whereas individual-level data are largely unavailable due to the secret ballot, and when they are available they tend to be inexact and scarce to estimate voter transitions (changes in voters’ choices between elections) as they usually originate from polls (Romero et al., 2020). Despite this, many agents, including political parties, the media, and US legal practitioners of voting rights, are still interested in knowing the voting behaviour of different subgroups of people. Ecological inference has been devised to solve this issue by exploiting the aggregate-level data relationships (King, 1997).
Ecological inference refers to the process of inferring individual-level relationships from aggregate (‘ecological’) data when individual-level data are not available. It is routinely employed in many disciplines, from economics and epidemiology to sociology and political science (Pavía, 2022), despite being exposed to the so-called ecological fallacy (Robinson, 1950). For instance, ecological inference algorithms are used to estimate vote transfer matrices between elections, infer split-ticket voting behaviours or reveal social and racial voting patterns (Barreto et al., 2022; O’Loughlin, 2000; Park et al., 2014; Romero & Pavía, 2021).
Using a classical contingency table representation in which individuals are classified by rows according to the groups to which they belong (e.g. social class, previous vote or race) and by columns by, for instance, their votes, the unknown cross-distributions are estimated using as data the observed row and column marginal distributions in a set of non-overlapping (geographical) units (e.g. precincts). The difficulty arises because substantially different inner cell counts can give rise to the same aggregated row and column totals, with this indeterminacy cannot being completely removed. It is intrinsic to the problem (see, e.g. Forcina & Pellegrino, 2019; Greiner, 2007; Manski, 2007). The solution depends on the assumptions made, which in a large extent are unverifiable with the available information (Gelman et al., 2001; Glynn & Wakefield, 2010; Wakefield, 2004), an issue that has led to festering disputes (see, e.g. Freedman et al., 1998, 1999; Greiner, 2007; Tam Cho, 1998; Tam Cho & Gaines, 2004). Practitioners customarily hypothesize similar/related conditional row (underlying) probability/fraction distributions across tables (sometimes with the help of covariates) relying on the common observation that people belonging to the same group tend to follow, probabilistically, similar behaviour patterns within the same political context (Pavía & Romero, 2022).
Although an extensive list of different ecological inference procedures has been suggested over time from fields as diverse as frequentist and Bayesian statistics (e.g. Brown & Payne, 1986; Goodman, 1953, 1959; Greiner & Quinn, 2009; King, 1997; King et al., 2004; Klima et al., 2019; Rosen et al., 2001), mathematical optimization (e.g. Hawkes, 1969; Pavía & Romero, 2022; Tziafetas, 1986), or information theory (e.g. Bernardini-Papalia & Fernández-Vázquez, 2020; Judge et al., 2004), they all consider analogous underlying assumptions and share a similar framework. This framework is based on a hypothesis of similarity/relationship in (electoral) behaviour by group across units and a scheme with an explanatory and a response variable.
Ecological inference may, therefore, be observed as an inverse problem where the goal is estimating (inferring) the conditional row fraction (underlying probability) distributions using the observed count marginal distributions as data (Jiang et al., 2020). Once the estimates of probability distributions are attained, cross-classification estimates of counts are obtained by multiplying the observed row margins and the estimated probabilities. In some models, such as in Greiner and Quinn (2009) and in its extension (Greiner & Quinn, 2010), counts are directly inferred.
The above general scheme means that different estimates are reached depending on which variable is assigned to rows and which to columns. In other words, even using the same method and the same data, a different solution is obtained if rows and columns are flipped. Thomsen’s model (1987), which assumes that voters’ choices are driven by individual latent factors, is the exception. The models proposed by Johnston and colleagues (e.g. Johnston et al., 1983; Johnston & Pattie, 2003), based on entropy maximization, while treating tables symmetrically, cannot be classified as genuine ecological inference approaches since they require prior information about the target cross-distributions to be applied.
In many applications, deciding which classification should be assigned to rows and which to columns is straightforward. For example, in a study of polarized voting in which we want to know how different collectives support different candidates, characteristics of the electorate (such as race or gender) are naturally assigned to rows while the categories of the columns are defined by the candidates. Equally, if we want to know the levels of voters’ loyalty (and switching) by party between two consecutive elections, the natural way is to assign the electoral options of the first and second elections to rows and columns, respectively.
In the above examples, the implicit assumption is that somehow there is a causality relationship or, at least, a natural origin-destination temporal arrangement. But what happens in simultaneous elections? In this case, the answer might not be so straightforward.
When two elections are held simultaneously, it is usually considered that there is a first-order election and a subordinate second-order election, so the relationship is studied in that order. It is implicitly accepted that the majority of voters make a sequential choice: they first decide their vote for the first-order election to subsequently, in a second step, choose their vote for the subordinate election (Pavía & Cantarino, 2017). Sometimes, the order of primacy is clear, such as when a general election and a referendum coincide. Other times, however, such as when electors vote simultaneously in a national and in a regional election or for a party list and for a candidate, this is not so clear even though we can argue over an order of primacy. Indeed, the argument itself, the existence of a hierarchy between elections, is being progressively challenged by the literature (see, e.g. Schakel & Romanova, 2020). Anyway, although we can reasonably argue the presence of a first-order and a second-order election, the question is whether this is the proper way to proceed. In other words, can we improve the accuracy of the estimated counts by considering both elections at the same level? More generally, given that (almost) all the ecological inference models exploit correlations and not causal relationships, we should ask ourselves whether the average accuracy of ecological inference estimates may be improved using methods that deal with rows and columns symmetrically. The aim of this paper is to provide answers to these questions.
To answer these questions, we propose two new sets of methods whose solutions do not depend on how classifications are assigned to rows and columns. On the one hand, we propose a new set of algorithms that achieve their solutions handling rows and columns symmetrically from the outset, by definition. This type of models can be logically specified from a mathematical optimization framework. Both dual constraints as well as congruency constraints can be naturally introduced in a linear programme, within the same optimization problem. Hence, in this research we adopt a linear programming approach. It is important to note that, although these models could also be specified within a quadratic programming framework, we prefer to state our models within the linear framework because, as Tziafetas (1986) shows, linear approaches are more efficient than quadratic ones in this context. On the other hand, we also consider a group of algorithms based on asymmetric solutions. This involves methods that reach their estimates by combining the solutions attained after applying an asymmetric ecological inference model to the two possible ways of assigning classifications to rows and columns.
In a recent paper, Pavía and Romero (2022) propose two new models, tslphom and nslphom, that, according to the authors, ‘place the linear programming approach once again in a prominent position in the ecological inference toolkit’. These new models, based on linear programming, generate estimates for each unit table and solve the tendency of lphom, the basic linear programming model, to estimate extreme probabilities (zeros and ones). They have also been shown to be at least as accurate and significantly simpler to use (Pavía & Romero, 2023) than the multimonial-Dirichlet R × C ecological inference statistical model, previously identified as the best in the literature (Katz, 2014; Klima et al., 2016; Plescia & De Sio, 2018). In this paper, we propose new algorithms that deal with rows and columns symmetrically by building on the Pavía and Romero (2022) models in two directions.
On the one hand, using either lphom, tslphom, or nslphom, we suggest three new methods, each of them generating three reasonable solutions, by solving the two underlying dual problems, swapping origin and destination. We identify these new methods by adding the suffix ‘_dual’ to the corresponding base algorithm and their respective new solutions by adding the suffixes ‘_min’, ‘_dual_a’, and ‘_dual_w’. On the other hand, we directly modify lphom, tslphom, and nslphom, looking for the joint congruent optimal solution of both dual problems. We specify new linear programmes in which the optimal solutions, being congruent, simultaneously verify the constraints imposed by the two dual ecological inference specifications. We identify these new models and their corresponding solutions adding the suffix ‘_joint’ to the respective base algorithm. The main innovation of this paper, therefore, lies in proposing, as far as we know for the first time in the literature, ecological inference models that explicitly deal with rows and columns in a symmetric fashion. Conceptually, this is a novelty that could also be explored from other frameworks.
The performance of all the proposed algorithms is assessed using the data available in the R package ei.Datasets (Pavía, 2022), which contains the global true party-candidate cross-classification tables corresponding to more than 500 mixed-member elections. After comparing actual tables and estimates from all the new algorithms and also considering, as baseline, the solutions obtained using the base asymmetric models, we can assert that, at least for these datasets, more accurate solutions are obtained exploiting the information both ways: parties to candidates and candidates to parties.
The rest of the paper is organized as follows. Section 2 states the problem and introduces the notation. Section 3 details the methods based on lphom, tslphom, and nslphom proposed to obtain the dual solutions. Section 4 revises the basic linear programme model (lphom), adapting it to the symmetric case. Section 5 goes further and reworks tslphom and nslphom after modifying the lphom_local algorithm of Pavía and Romero (2022) to make it symmetric. Section 6 presents the results and discusses the findings of the empirical comparisons. Section 7 deepens and explores on the factors impacting on the accuracy of the estimates. Section 8 concludes, discusses, and suggests directions for further research.
2. Mathematical representation of the problem
For convenience and without loss of generality, from now on we adopt a framework of simultaneous elections where voters, grouped in a set of units that jointly define a partition of the electoral space, cast two votes; one for a party list and another for a candidate. We denote by I, J, and K the number of units, parties, and candidates, respectively.
The observed data are the votes
As intermediate

The two dual ways of representing the basic problem in a typical i unit: parties to candidates (left panel) and candidates to parties (right panel). Inner quantities are the unobserved proportions, the (intermediate) targets of ecological inference. In general, different solutions are reached solving the problem from parties-to-candidates than from candidates-to-parties. Symmetric algorithms generate congruent
All the above quantities are closely related. Among other relationships, the following equalities hold:
In general, different estimates for
In matrix form, we denote by
Likewise, we denote by
Additionally, we use
3. Solutions based on asymmetric methods
Given that the symmetric models we propose build on the lphom, tslphom, and nslphom algorithms, we limit ourselves to using the same framework for defining our asymmetric-based dual methods. In our view, this makes comparisons fairer. This auto-constraint could of course be relaxed and asymmetric-based dual solutions be defined using as building blocks other R × C ecological inference procedures, such as the Rosen et al. (2001) Bayesian-based multinomial-Dirichlet model (Lau et al., 2020), the iterative version of the 2 × 2 model proposed by King (Collingwood et al., 2020; King, 1997), or the generalization of the Goodman regression method (Collingwood et al., 2016, 2020), to name just some possibilities.
Before defining our asymmetric-based dual methods in Section 3.2, we first detail the lphom, tslphom, and nslphom models in Section 3.1. To do this, we focus on the nslphom model given that, although lphom, tslphom, and nslphom were suggested as different models, as Pavía and Romero (2023) shows, the lphom and the tslphom solutions can be attained as by-products of the nslphom algorithm, these being intermediate outputs of its iterative process: lphom after solving its first linear system (iteration 0) and tslphom after solving its firsts
In this section, we introduce the equations when
3.1 Asymmetric linear programming models
The nslphom procedure is an iterative algorithm that, for simultaneous elections and typical ecological inference problems, uses equations (1)–(15) to attain its solution. In its iteration zero, nslphom solves the basic lphom linear system (Romero et al., 2020) defined by equations (1)–(5)—whose unknowns are the
In the next iterations, for
In the second step, the unit proportion estimates are in turn used to update the global proportion estimates,
In each iteration, the statistic defined by equation (15), which measures the distance to homogeneity of the lth-solution, is also calculated. This measure is employed to select the nslphom solution (Pavía & Romero, 2022), which corresponds to the estimates associated with the iteration
Once the iterative process is finished, nslphom provides the lphom, tslphom, and nlsphom solutions. The matrix
The above algorithm, programmed in the function nslphom of the R package lphom (Pavía & Romero, 2022), has one parameter to be chosen: the number of iterations, ns. The default option of the nslphom function considers only 10 iterations since the minimum of equation (15) is typically reached in very few iterations. At the end of the iterative process, the matrices
3.2 Symmetric models based on dual solutions
In the same way that equations (1)–(15) are defined, a similar linear system can be specified to obtain estimates of the proportion matrices
We envisage two basic reconciliation approaches maintaining the accounting constraints that delimit the problem. On the one hand, just one of the two solutions could be selected. On the other hand, both solutions could be combined in a manner that preserves the constraints among votes and proportions. Below, we explore solutions for both types.
As an alternative to the classical approach of reasoning a logical (causal, temporal, or primacy) order between classifications, equation (15) could be employed to choose between solutions. Given the central role played by the heterogeneity index determining linear programming solutions and its previously reported close relationship with the accuracy of the (global) estimates (Pavía & Romero, 2022, 2023; Romero et al., 2020), we propose selecting between the sets
The above ‘combination’ of asymmetric solutions scarcely exploits the information contained in one of the two dual solutions, so we propose building combined solutions as averages, since averages compared to individual estimates tend to reduce expected errors. We consider either unweighted or weighted averages, of the two dual solutions, with the weighted solutions attained employing the inverses of the HETe statistics as weights. We denote with the suffix ‘_dual_a’ the solutions associated with the sets of votes estimates
4. A symmetric linear model: lphom for simultaneous elections
The methods proposed in the previous section build ad hoc symmetric (congruent) solutions (i.e. solutions that do not depend on which classification is mapped to rows and which to columns) by selecting or averaging solutions from asymmetric methods. In this section, we modify the lphom model to directly produce global (optimal) symmetric solutions. In the next section, we extend the approach to also embrace the tslphom and nslphom models.
To make the notation more compact, we adopt a matrix representation. In particular, in matrix form, the linear programme,
and, equally, the dual problem,
where
Stacking both systems in the same programme and solving the resulting linear programme is not enough to guarantee congruency among the
Obviously, these constraints can also be expressed in matrix form. In particular,
Combining all the equations, a new linear programme system emerges. The programme given by (24) describes the lphom_joint model, which corresponds to the symmetric version of the lphom model.
In this model, the unknowns
The estimate
5. Estimating local transfer matrices: nslphom for simultaneous elections
Although the lphom_joint model solves the inconsistency of the lphom procedure by generating estimates of matrices of votes that do not depend on how classifications are mapped to row and columns, it still reveals a crucial limitation of lphom. lphom_joint only estimates global (aggregate) matrices, yet having estimates of local (unit) matrices (
Continuing on from the previous sections, this section amends the tslphom and nslphom algorithms by proposing new versions of the procedures that, through their specification, force consistency between the estimates of all the (
Both tslphom and nslphom are based on the lphom_local procedure (Pavía & Romero, 2022), a procedure devised to estimate unit matrices that is at the core of both algorithms. Indeed, their equations, starting from an arbitrary initial global transfer matrix, are for a fixed i similar to equations (6)–(13). Therefore, we first adapt, in Section 5.1, the lphom_local algorithm to build the lphom_local_joint method. Subsequently, in Section 5.2, we build the _joint versions of tslphom and nslphom on this.
5.1 The lphom_local_joint algorithm
The lphom_local algorithm for simultaneous elections can be observed as a two-step linear programme procedure that takes as inputs a row-standardized proportion matrix
For a fixed unit i, the first two dual linear programmes, which correspond to equations (6)–(10), can be expressed with our notation as
where
As with lphom, independently solving the two dual problems does not assure the congruency between both dual solutions, even starting from a congruent
Similarly, the second two dual linear programmes, which correspond to equations (11)–(13), can be stated as
where
Putting together all the pieces, we propose as lphom_local_joint algorithm the two-step linear programme procedure that yields as outcome the result of sequentially solving the programmes defined by the systems (31) and (32).
Once the minimum value for
where
This model generates solutions of the estimated unit level proportions that, verifying all the ecological accounting equalities, including constraints (28), equivalent to the probabilistic symmetric condition, are closest (in L1-norm) to the
As with the lphom_joint model, the lphom_local_joint algorithm also verifies the symmetry property. That is, given a matrix of counts
5.2 The nslphom_joint model
Once lphom and lphom_local are adapted to make them symmetric, nslphom can easily be modified in the same way; that is, we can build the nslphom_joint model. In particular, for a fixed ns, the natural definition of the nslphom_joint model is given by the following iterative procedure:
Iteration 0. Apply lphom_joint to
and to obtain initial congruent estimates and of the global proportion matrices, from which the aggregate matrix of votes can be derived through: .Iteration l, for
. Compute and and apply lphom_joint_local using as inputs , , , and , for . This produces congruent estimates of proportion matrices and for each unit i; from which an estimate of the global matrix of votes is built by aggregating the estimates of the corresponding unit matrices: ; where .
During each iteration, the statistic given by equation (33), where
In a similar vein to nslphom, the lphom_joint, tslphom_joint and nlsphom_joint solutions are also attained once the above iterative process is finished. The set of matrices
At the end of the iterative process, the estimates
All the new algorithms introduced in Sections 3–5 are available in functions with suffixes ‘_dual’ and ‘_joint’ in the R package lphom.
6. Assessment of procedures
In Sections 3–5, two new sets of symmetric algorithms based on linear programming have been suggested to solve the ecological inference problem. These procedures represent an alternative to the asymmetric methods, where rows and columns of the ecological matrices are not interchangeable. In this section, we assess, focused on accuracy, the performance of the proposed procedures. In addition to comparing the different solutions they provide, we also assess them in comparison to the equivalent asymmetric solutions, which act as baseline solutions.
Although it is possible to also include in the assessments other asymmetric models, we have restricted the comparisons to the asymmetric algorithms from which the symmetric methods stem. Under our view, this makes comparisons fairer and does not entail a limitation, as lphom-family algorithms are among the most accurate methods to estimate vote transition matrices. First, Klima et al (2016) state the ei.MD.bayes algorithm (Lau et al., 2020; Rosen et al., 2001) as the most accurate after comparing it to (i) a modification of classical ecological regression (Goodman, 1953), (ii) Thomsen's approach (Thomsen, 1987), and (iii) two recursive 2 × 2-approaches—the ones proposed by Andreadis and Chadjipadelis (2009) and by Kellermann (2011). Second, Plescia and De Sio (2018) also identify ei.MD.bayes as the best option after comparing it to (i) the RxCEcolInf method (Greiner et al., 2021; Greiner & Quinn, 2009) and (ii) classical ecological regression. Third, Barreto et al. (2022) conclude that King’s iterative EI:RxC (Collingwood et al., 2020; King, 1997) and ei.MD.bayes can be used interchangeably when assessing precinct-level voting patterns in Voting Rights Act cases, with Ferree (2004) and Katz (2014) considering iterative EI:RxC inferior. Finally, Pavía and Romero (2023) extensively compare the lphom-family algorithms with ei.MD.bayes-solutions and find the former to be as least as accurate than ei.MD.bayes, being moreover more accurate when the information is scarce or convergence cannot be guarantee for ei.MD.bayes. In summary, if the new symmetric methods improve the asymmetric lphom-based solutions, by the transitivity property they should also improve the rest of the asymmetric methods.
6.1 Data
Due to the very nature of the problem, datasets with known true vote transfer matrices are scarce, so practical research typically relies on artificial, simulated data. In this paper, however, we gauge the different approaches using real data. The results of this section are based on comparing estimated and true matrices of votes belonging to more than 500 elections available in the R package ei.Datasets (Pavía, 2022). The ei.Datasets package contains the matrices of party votes,
Although, as far as we know, this is the first time in the literature that all the datasets included in ei.Datasets are employed for ecological inference assessment, some of these elections have already been utilized for evaluating ecological inference procedures (Pavía & Romero, 2022, 2023; Plescia & De Sio, 2018). In using these data, we follow in the footsteps of the above authors and, as is usual practice when handling real data (e.g. Barreto et al. 2022; Klein, 2019; Klima et al., 2016), we have only considered sizeable populations and grouped very small election options in ‘Others’. We have merged in ‘Other parties’ and ‘Other candidates’ those parties and candidates, respectively, who individually do not attain at least a 3% of the district vote. Despite this operation notably reducing the sizes of the ecological tables (from 147.6 to 28.2 in average number of cells), we still retain a large diversity of size tables, with tables of 23 different sizes ranging from a minimum of 15 cells (5 × 3) to a maximum of 56 cells (8 × 7); 8 and 3 being the maximum and minimum number for both rows and columns. Before merging, ei.Datasets tables range from a minimum of 51 cells (17 × 3) to a maximum of 300 cells (20 × 15), 27 being the maximum number of rows for a single table and 15 the maximum number of columns. In terms of sizes of the districts, measured in number of units (polling stations), we have districts ranging from a minimum of 22 to a maximum of 833, with an average of 90.5 units per district. More details about the datasets can be found in Pavía (2022).
6.2 Error measure
Regarding the metric utilized to assess the estimated matrices of votes,
6.3 Results
Tables 1 and 2 offer a summary of the accuracy of the different algorithms. The summaries are presented after grouping the elections following two different categorizations. In Table 1, the elections have been grouped by country and by year in which they were held. In our view, this is the most natural way of grouping these elections since all the datasets belonging to the same year and country share the same political environment. They all correspond to district elections held in the context of the same national general election. Nevertheless, according to Pavía (2022), another natural way of grouping the NZ elections is by type of district, either Māori or regular.
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
# of elections | N = 69 | N = 69 | N = 73 | N = 70 | N = 70 | N = 71 | N = 71 | N = 72 |
Avg. # of units | ||||||||
Avg. # of cells | ||||||||
lphom-based solutions | ||||||||
lphom(X, Y) | 16.88 | 12.14 | 12.92 | 12.22 | 12.99 | 12.95 | 12.20 | 14.02 |
lphom(Y, X) | 16.14 | 12.74 | 10.78 | 11.87 | 13.40 | 13.70 | 12.08 | 12.18 |
lphom_min | 16.30 | 11.86 | 11.73 | 11.58 | 12.65 | 12.80 | 11.61 | 12.59 |
lphom_dual_a | 14.87 | 11.47 | 10.50 | 11.03 | 12.14 | 12.03 | 11.19 | 12.16 |
lphom_dual_w | 14.89 | 11.37 | 10.41 | 10.97 | 12.05 | 12.01 | 11.10 | 12.11 |
lphom_joint | 15.36 | 11.65 | 10.52 | 11.31 | 12.51 | 12.35 | 11.48 | 12.60 |
tslphom-based solutions | ||||||||
tslphom(X, Y) | 14.80 | 10.91 | 11.00 | 10.89 | 11.50 | 11.66 | 10.91 | 12.59 |
tslphom(Y, X) | 14.52 | 11.50 | 9.46 | 10.77 | 12.22 | 12.50 | 10.95 | 11.07 |
tslphom_min | 14.06 | 10.65 | 9.64 | 10.30 | 11.20 | 11.61 | 10.36 | 11.18 |
tslphom_dual_a | 13.18 | 10.27 | 8.96 | 9.87 | 10.83 | 10.83 | 10.03 | 10.92 |
tslphom_dual_w | 13.15 | 10.17 | 8.87 | 9.79 | 10.74 | 10.78 | 9.95 | 10.85 |
tslphom_joint | 14.07 | 10.73 | 9.42 | 10.40 | 11.46 | 11.45 | 10.49 | 11.54 |
nslphom-based solutions | ||||||||
nslphom(X, Y) | 12.79 | 9.47 | 8.85 | 9.11 | 9.46 | 9.69 | 8.91 | 9.82 |
nslphom(Y, X) | 12.71 | 9.89 | 8.11 | 9.17 | 10.86 | 11.15 | 9.29 | 9.39 |
nslphom_min | 12.55 | 9.22 | 8.42 | 8.63 | 9.41 | 9.67 | 8.48 | 9.07 |
nslphom_dual_a | 11.10 | 8.63 | 7.11 | 8.08 | 9.01 | 9.08 | 8.04 | 8.52 |
nslphom_dual_w | 11.14 | 8.58 | 7.11 | 8.04 | 8.91 | 9.01 | 7.96 | 8.48 |
nslphom_joint | 11.54 | 8.86 | 7.53 | 8.44 | 9.30 | 9.39 | 8.23 | 8.99 |
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
# of elections | N = 69 | N = 69 | N = 73 | N = 70 | N = 70 | N = 71 | N = 71 | N = 72 |
Avg. # of units | ||||||||
Avg. # of cells | ||||||||
lphom-based solutions | ||||||||
lphom(X, Y) | 16.88 | 12.14 | 12.92 | 12.22 | 12.99 | 12.95 | 12.20 | 14.02 |
lphom(Y, X) | 16.14 | 12.74 | 10.78 | 11.87 | 13.40 | 13.70 | 12.08 | 12.18 |
lphom_min | 16.30 | 11.86 | 11.73 | 11.58 | 12.65 | 12.80 | 11.61 | 12.59 |
lphom_dual_a | 14.87 | 11.47 | 10.50 | 11.03 | 12.14 | 12.03 | 11.19 | 12.16 |
lphom_dual_w | 14.89 | 11.37 | 10.41 | 10.97 | 12.05 | 12.01 | 11.10 | 12.11 |
lphom_joint | 15.36 | 11.65 | 10.52 | 11.31 | 12.51 | 12.35 | 11.48 | 12.60 |
tslphom-based solutions | ||||||||
tslphom(X, Y) | 14.80 | 10.91 | 11.00 | 10.89 | 11.50 | 11.66 | 10.91 | 12.59 |
tslphom(Y, X) | 14.52 | 11.50 | 9.46 | 10.77 | 12.22 | 12.50 | 10.95 | 11.07 |
tslphom_min | 14.06 | 10.65 | 9.64 | 10.30 | 11.20 | 11.61 | 10.36 | 11.18 |
tslphom_dual_a | 13.18 | 10.27 | 8.96 | 9.87 | 10.83 | 10.83 | 10.03 | 10.92 |
tslphom_dual_w | 13.15 | 10.17 | 8.87 | 9.79 | 10.74 | 10.78 | 9.95 | 10.85 |
tslphom_joint | 14.07 | 10.73 | 9.42 | 10.40 | 11.46 | 11.45 | 10.49 | 11.54 |
nslphom-based solutions | ||||||||
nslphom(X, Y) | 12.79 | 9.47 | 8.85 | 9.11 | 9.46 | 9.69 | 8.91 | 9.82 |
nslphom(Y, X) | 12.71 | 9.89 | 8.11 | 9.17 | 10.86 | 11.15 | 9.29 | 9.39 |
nslphom_min | 12.55 | 9.22 | 8.42 | 8.63 | 9.41 | 9.67 | 8.48 | 9.07 |
nslphom_dual_a | 11.10 | 8.63 | 7.11 | 8.08 | 9.01 | 9.08 | 8.04 | 8.52 |
nslphom_dual_w | 11.14 | 8.58 | 7.11 | 8.04 | 8.91 | 9.01 | 7.96 | 8.48 |
nslphom_joint | 11.54 | 8.86 | 7.53 | 8.44 | 9.30 | 9.39 | 8.23 | 8.99 |
Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
# of elections | N = 69 | N = 69 | N = 73 | N = 70 | N = 70 | N = 71 | N = 71 | N = 72 |
Avg. # of units | ||||||||
Avg. # of cells | ||||||||
lphom-based solutions | ||||||||
lphom(X, Y) | 16.88 | 12.14 | 12.92 | 12.22 | 12.99 | 12.95 | 12.20 | 14.02 |
lphom(Y, X) | 16.14 | 12.74 | 10.78 | 11.87 | 13.40 | 13.70 | 12.08 | 12.18 |
lphom_min | 16.30 | 11.86 | 11.73 | 11.58 | 12.65 | 12.80 | 11.61 | 12.59 |
lphom_dual_a | 14.87 | 11.47 | 10.50 | 11.03 | 12.14 | 12.03 | 11.19 | 12.16 |
lphom_dual_w | 14.89 | 11.37 | 10.41 | 10.97 | 12.05 | 12.01 | 11.10 | 12.11 |
lphom_joint | 15.36 | 11.65 | 10.52 | 11.31 | 12.51 | 12.35 | 11.48 | 12.60 |
tslphom-based solutions | ||||||||
tslphom(X, Y) | 14.80 | 10.91 | 11.00 | 10.89 | 11.50 | 11.66 | 10.91 | 12.59 |
tslphom(Y, X) | 14.52 | 11.50 | 9.46 | 10.77 | 12.22 | 12.50 | 10.95 | 11.07 |
tslphom_min | 14.06 | 10.65 | 9.64 | 10.30 | 11.20 | 11.61 | 10.36 | 11.18 |
tslphom_dual_a | 13.18 | 10.27 | 8.96 | 9.87 | 10.83 | 10.83 | 10.03 | 10.92 |
tslphom_dual_w | 13.15 | 10.17 | 8.87 | 9.79 | 10.74 | 10.78 | 9.95 | 10.85 |
tslphom_joint | 14.07 | 10.73 | 9.42 | 10.40 | 11.46 | 11.45 | 10.49 | 11.54 |
nslphom-based solutions | ||||||||
nslphom(X, Y) | 12.79 | 9.47 | 8.85 | 9.11 | 9.46 | 9.69 | 8.91 | 9.82 |
nslphom(Y, X) | 12.71 | 9.89 | 8.11 | 9.17 | 10.86 | 11.15 | 9.29 | 9.39 |
nslphom_min | 12.55 | 9.22 | 8.42 | 8.63 | 9.41 | 9.67 | 8.48 | 9.07 |
nslphom_dual_a | 11.10 | 8.63 | 7.11 | 8.08 | 9.01 | 9.08 | 8.04 | 8.52 |
nslphom_dual_w | 11.14 | 8.58 | 7.11 | 8.04 | 8.91 | 9.01 | 7.96 | 8.48 |
nslphom_joint | 11.54 | 8.86 | 7.53 | 8.44 | 9.30 | 9.39 | 8.23 | 8.99 |
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
# of elections | N = 69 | N = 69 | N = 73 | N = 70 | N = 70 | N = 71 | N = 71 | N = 72 |
Avg. # of units | ||||||||
Avg. # of cells | ||||||||
lphom-based solutions | ||||||||
lphom(X, Y) | 16.88 | 12.14 | 12.92 | 12.22 | 12.99 | 12.95 | 12.20 | 14.02 |
lphom(Y, X) | 16.14 | 12.74 | 10.78 | 11.87 | 13.40 | 13.70 | 12.08 | 12.18 |
lphom_min | 16.30 | 11.86 | 11.73 | 11.58 | 12.65 | 12.80 | 11.61 | 12.59 |
lphom_dual_a | 14.87 | 11.47 | 10.50 | 11.03 | 12.14 | 12.03 | 11.19 | 12.16 |
lphom_dual_w | 14.89 | 11.37 | 10.41 | 10.97 | 12.05 | 12.01 | 11.10 | 12.11 |
lphom_joint | 15.36 | 11.65 | 10.52 | 11.31 | 12.51 | 12.35 | 11.48 | 12.60 |
tslphom-based solutions | ||||||||
tslphom(X, Y) | 14.80 | 10.91 | 11.00 | 10.89 | 11.50 | 11.66 | 10.91 | 12.59 |
tslphom(Y, X) | 14.52 | 11.50 | 9.46 | 10.77 | 12.22 | 12.50 | 10.95 | 11.07 |
tslphom_min | 14.06 | 10.65 | 9.64 | 10.30 | 11.20 | 11.61 | 10.36 | 11.18 |
tslphom_dual_a | 13.18 | 10.27 | 8.96 | 9.87 | 10.83 | 10.83 | 10.03 | 10.92 |
tslphom_dual_w | 13.15 | 10.17 | 8.87 | 9.79 | 10.74 | 10.78 | 9.95 | 10.85 |
tslphom_joint | 14.07 | 10.73 | 9.42 | 10.40 | 11.46 | 11.45 | 10.49 | 11.54 |
nslphom-based solutions | ||||||||
nslphom(X, Y) | 12.79 | 9.47 | 8.85 | 9.11 | 9.46 | 9.69 | 8.91 | 9.82 |
nslphom(Y, X) | 12.71 | 9.89 | 8.11 | 9.17 | 10.86 | 11.15 | 9.29 | 9.39 |
nslphom_min | 12.55 | 9.22 | 8.42 | 8.63 | 9.41 | 9.67 | 8.48 | 9.07 |
nslphom_dual_a | 11.10 | 8.63 | 7.11 | 8.08 | 9.01 | 9.08 | 8.04 | 8.52 |
nslphom_dual_w | 11.14 | 8.58 | 7.11 | 8.04 | 8.91 | 9.01 | 7.96 | 8.48 |
nslphom_joint | 11.54 | 8.86 | 7.53 | 8.44 | 9.30 | 9.39 | 8.23 | 8.99 |
Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.
Averages of EI errors of nslphom-based solutions by alternative groupings of elections
Group of elections . | NZ—regular . | NZ—Māori . | NZ—all . | SCO 2007 . | NZ + SCO . |
---|---|---|---|---|---|
# of elections | N = 443 | N = 49 | N = 492 | N = 73 | N = 565 |
Avg. # of units | |||||
Avg. # of cells | |||||
nslphom(X, Y) | 9.85 | 10.20 | 9.89 | 8.85 | 9.75 |
nslphom(Y, X) | 10.47 | 9.14 | 10.34 | 8.11 | 10.05 |
nslphom_min | 9.54 | 9.83 | 9.57 | 8.42 | 9.42 |
nslphom_dual_a | 9.02 | 7.99 | 8.92 | 7.11 | 8.68 |
nslphom_dual_w | 8.96 | 8.04 | 8.87 | 7.11 | 8.64 |
nslphom_joint | 9.40 | 7.85 | 9.24 | 7.53 | 9.02 |
Group of elections . | NZ—regular . | NZ—Māori . | NZ—all . | SCO 2007 . | NZ + SCO . |
---|---|---|---|---|---|
# of elections | N = 443 | N = 49 | N = 492 | N = 73 | N = 565 |
Avg. # of units | |||||
Avg. # of cells | |||||
nslphom(X, Y) | 9.85 | 10.20 | 9.89 | 8.85 | 9.75 |
nslphom(Y, X) | 10.47 | 9.14 | 10.34 | 8.11 | 10.05 |
nslphom_min | 9.54 | 9.83 | 9.57 | 8.42 | 9.42 |
nslphom_dual_a | 9.02 | 7.99 | 8.92 | 7.11 | 8.68 |
nslphom_dual_w | 8.96 | 8.04 | 8.87 | 7.11 | 8.64 |
nslphom_joint | 9.40 | 7.85 | 9.24 | 7.53 | 9.02 |
Source: Compiled by the authors after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms based on nslphom, described in Sections 3–5, with ns = 10 to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.
Averages of EI errors of nslphom-based solutions by alternative groupings of elections
Group of elections . | NZ—regular . | NZ—Māori . | NZ—all . | SCO 2007 . | NZ + SCO . |
---|---|---|---|---|---|
# of elections | N = 443 | N = 49 | N = 492 | N = 73 | N = 565 |
Avg. # of units | |||||
Avg. # of cells | |||||
nslphom(X, Y) | 9.85 | 10.20 | 9.89 | 8.85 | 9.75 |
nslphom(Y, X) | 10.47 | 9.14 | 10.34 | 8.11 | 10.05 |
nslphom_min | 9.54 | 9.83 | 9.57 | 8.42 | 9.42 |
nslphom_dual_a | 9.02 | 7.99 | 8.92 | 7.11 | 8.68 |
nslphom_dual_w | 8.96 | 8.04 | 8.87 | 7.11 | 8.64 |
nslphom_joint | 9.40 | 7.85 | 9.24 | 7.53 | 9.02 |
Group of elections . | NZ—regular . | NZ—Māori . | NZ—all . | SCO 2007 . | NZ + SCO . |
---|---|---|---|---|---|
# of elections | N = 443 | N = 49 | N = 492 | N = 73 | N = 565 |
Avg. # of units | |||||
Avg. # of cells | |||||
nslphom(X, Y) | 9.85 | 10.20 | 9.89 | 8.85 | 9.75 |
nslphom(Y, X) | 10.47 | 9.14 | 10.34 | 8.11 | 10.05 |
nslphom_min | 9.54 | 9.83 | 9.57 | 8.42 | 9.42 |
nslphom_dual_a | 9.02 | 7.99 | 8.92 | 7.11 | 8.68 |
nslphom_dual_w | 8.96 | 8.04 | 8.87 | 7.11 | 8.64 |
nslphom_joint | 9.40 | 7.85 | 9.24 | 7.53 | 9.02 |
Source: Compiled by the authors after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms based on nslphom, described in Sections 3–5, with ns = 10 to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.
Table 2 presents the summaries grouping the NZ elections as alternatively suggested by Pavía (2022), but focusing only on the most accurate algorithms: those based on the nslphom model. To offer more context and make comparisons easier, we also include in Table 2 the results attained pooling all the elections and combining all the NZ elections in a unique group and, again, the Scottish results. The upper panels of both tables contain some summary statistics of the corresponding group of elections. Here, one of the differential characteristics of the Māori districts stands out: the relative higher number of units in which their electorates are split out. Figure 2 shows graphically the average errors attained by the nslphom-family algorithms in all the groupings.

Graphical representation of average values of EI error measures grouped by election and nslphom-family algorithm. The smaller the number, the better the accuracy. Individual solutions are attained after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). Detailed figures are available in Tables 1 and 2.
In light of the empirical assessments, two clear findings emerge. On the one hand, our results confirm the order of preference for the symmetric models already stated by Pavía and Romero (2022, 2023) for the asymmetric procedures. As with asymmetric models, tslphom-based methods systematically improve lphom-based ones and, equally, nslphom-based models consistently outperform tslphom-based ones (see Table 1). More importantly, given our main research question, by comparing asymmetric and symmetric models we can assert that symmetric algorithms produce, conditioned on the underlying based algorithm, consistently more accurate estimates than asymmetric procedures (at least for the data at hand). Overall, symmetric solutions beat asymmetric solutions almost 90% of the time.
The family of the algorithm (i.e. the underlying procedure, either lphom, tslphom, or nslphom), however, has a higher impact on the accuracy of the estimates than the character, either symmetric or asymmetric, of the model. tslphom-based models, including asymmetric ones, are on average more accurate than lphom-based models, and the same relationship occurs with nslphom-based models with regard to tslphom-based ones. In short, the nslphom-based models are clearly the most accurate, so henceforth we focus only on these by concentrating our analysis on the results available in Table 2 and Figure 1 and in the last panel of Table 1. In particular, pooling all the elections and algorithms, nslphom symmetric solutions are seen to be, on average, almost 11% more accurate than nslphom asymmetric ones.
Focusing now on the relationships within the symmetric solutions, we observe some general, although not completely systematic, patterns. On the one hand, pooling all the results we see that, on average, the nslphom_dual_w solutions are the most accurate, followed by nslphom_dual_a, nslphom_joint, and nslphom_min solutions in that order (see Table 2 and Figure 2). For some of the groups, however, nslphom_dual_w solutions are not the best on average; in some groups they are beaten by nslphom_dual_a (in the 2002 elections group) and by nslphom_joint (in the of Māori districts group).
In any case, what we can affirm is that nslphom_dual_w generates the more robust solutions. This algorithm produces, on average, the most accurate solutions despite being the algorithm that out of the four generates the smallest EI error fewer times. Indeed, even nslphom_min, the algorithm whose error average is the greatest out of the four, produces better solutions more times than nslphom_dual_w. In fact, nslphom_min at the individual level is the best 25% of the time, the same figure recorded for nslphom_dual_a, while the figures for nslphom_joint and nslphom_dual_w are 30% and 20%, respectively. Looking in more depth at the analysis of the individual matrices, we also find that, as expected, the nslphom_dual_w and nslphom_dual_a solutions are quite close, being midway between the nslphom_min and the nslphom_joint solutions, although a little closer to the latter.
The results for the Māori electorates are particularly interesting (see Table 2 and Figure 2). In all the groups, except the Māori, the dual solutions built as means are the best on average, when for the Māori the best on average are the predictions based on the joint algorithm. At first glance, one could think that this is a consequence of some of their differential characteristics. According to Pavía (2022), the Māori districts are remarkably different in many issues; for instance, compared to the rest of the districts of the database, their electorates are split out in a greater number of units, their transfer matrices record weaker relationships between row and column options, the electoral behaviour of their voters shows a greater heterogeneity across units and their datasets show more across-unit variances for both parties and candidates. A closer look at the individual errors, however, reveals that almost all the relative advantage of nslphom_joint compared to nslphom_dual_w and nslphom_dual_a is grounded on the 2002 elections. Indeed, nslphom_joint is only more accurate on average than nslphom_dual_w and nslphom_dual_a in the 2002 and 2005 elections. As a rule, therefore, we can say that, at least for datasets with similar characteristics to the ones analysed in this research, the nslphom_dual_w solutions are preferable. They are more accurate, on average, and quite robust.
Finally, focusing on computational times (see Table 3), we see that all the methods are quite fast. As expected, the nslphom solutions are slower but they only require a few seconds to reach their solutions. Obviously, the slowest algorithm is the nslphom_joint but, on average, it only takes in these datasets a few more than 20 s to reach its solution. In general, the computational burden grows with the number of units and the number of cells of the corresponding ecological matrix.
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
lphom-based algorithms | ||||||||
lphom(X, Y) | 0.20 | 0.18 | 0.06 | 0.12 | 0.20 | 0.19 | 0.18 | 0.38 |
lphom(Y, X) | 0.23 | 0.21 | 0.06 | 0.24 | 0.33 | 0.34 | 0.30 | 0.36 |
lphom_dual | 0.42 | 0.40 | 0.12 | 0.31 | 0.48 | 0.42 | 0.44 | 0.74 |
lphom_joint | 1.08 | 0.98 | 0.25 | 0.76 | 1.14 | 1.01 | 1.05 | 1.81 |
tslphom-based algorithms | ||||||||
tslphom(X, Y) | 0.69 | 0.60 | 0.46 | 0.51 | 0.62 | 0.63 | 0.60 | 1.01 |
tslphom(Y, X) | 0.72 | 0.61 | 0.47 | 0.62 | 0.72 | 0.69 | 0.77 | 0.97 |
tslphom_dual | 1.41 | 1.23 | 0.89 | 1.12 | 1.35 | 1.27 | 1.33 | 1.94 |
tslphom_joint | 3.85 | 2.78 | 2.36 | 2.43 | 3.10 | 2.89 | 2.95 | 4.29 |
nslphom-based algorithms | ||||||||
nslphom(X, Y) | 5.11 | 4.35 | 4.07 | 4.11 | 4.73 | 4.55 | 4.61 | 6.43 |
nslphom(Y, X) | 5.07 | 4.39 | 3.96 | 4.18 | 4.79 | 4.61 | 4.82 | 6.43 |
nslphom_dual | 10.22 | 8.81 | 8.06 | 8.33 | 9.03 | 8.86 | 9.52 | 12.87 |
nslphom_joint | 29.03 | 19.39 | 21.44 | 18.09 | 21.90 | 21.39 | 20.94 | 28.99 |
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
lphom-based algorithms | ||||||||
lphom(X, Y) | 0.20 | 0.18 | 0.06 | 0.12 | 0.20 | 0.19 | 0.18 | 0.38 |
lphom(Y, X) | 0.23 | 0.21 | 0.06 | 0.24 | 0.33 | 0.34 | 0.30 | 0.36 |
lphom_dual | 0.42 | 0.40 | 0.12 | 0.31 | 0.48 | 0.42 | 0.44 | 0.74 |
lphom_joint | 1.08 | 0.98 | 0.25 | 0.76 | 1.14 | 1.01 | 1.05 | 1.81 |
tslphom-based algorithms | ||||||||
tslphom(X, Y) | 0.69 | 0.60 | 0.46 | 0.51 | 0.62 | 0.63 | 0.60 | 1.01 |
tslphom(Y, X) | 0.72 | 0.61 | 0.47 | 0.62 | 0.72 | 0.69 | 0.77 | 0.97 |
tslphom_dual | 1.41 | 1.23 | 0.89 | 1.12 | 1.35 | 1.27 | 1.33 | 1.94 |
tslphom_joint | 3.85 | 2.78 | 2.36 | 2.43 | 3.10 | 2.89 | 2.95 | 4.29 |
nslphom-based algorithms | ||||||||
nslphom(X, Y) | 5.11 | 4.35 | 4.07 | 4.11 | 4.73 | 4.55 | 4.61 | 6.43 |
nslphom(Y, X) | 5.07 | 4.39 | 3.96 | 4.18 | 4.79 | 4.61 | 4.82 | 6.43 |
nslphom_dual | 10.22 | 8.81 | 8.06 | 8.33 | 9.03 | 8.86 | 9.52 | 12.87 |
nslphom_joint | 29.03 | 19.39 | 21.44 | 18.09 | 21.90 | 21.39 | 20.94 | 28.99 |
Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The computations have been performed on a laptop with a CPU processor Intel Core i7-6820HK (4 cores) 2.70 GHz and 64GB of RAM.
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
lphom-based algorithms | ||||||||
lphom(X, Y) | 0.20 | 0.18 | 0.06 | 0.12 | 0.20 | 0.19 | 0.18 | 0.38 |
lphom(Y, X) | 0.23 | 0.21 | 0.06 | 0.24 | 0.33 | 0.34 | 0.30 | 0.36 |
lphom_dual | 0.42 | 0.40 | 0.12 | 0.31 | 0.48 | 0.42 | 0.44 | 0.74 |
lphom_joint | 1.08 | 0.98 | 0.25 | 0.76 | 1.14 | 1.01 | 1.05 | 1.81 |
tslphom-based algorithms | ||||||||
tslphom(X, Y) | 0.69 | 0.60 | 0.46 | 0.51 | 0.62 | 0.63 | 0.60 | 1.01 |
tslphom(Y, X) | 0.72 | 0.61 | 0.47 | 0.62 | 0.72 | 0.69 | 0.77 | 0.97 |
tslphom_dual | 1.41 | 1.23 | 0.89 | 1.12 | 1.35 | 1.27 | 1.33 | 1.94 |
tslphom_joint | 3.85 | 2.78 | 2.36 | 2.43 | 3.10 | 2.89 | 2.95 | 4.29 |
nslphom-based algorithms | ||||||||
nslphom(X, Y) | 5.11 | 4.35 | 4.07 | 4.11 | 4.73 | 4.55 | 4.61 | 6.43 |
nslphom(Y, X) | 5.07 | 4.39 | 3.96 | 4.18 | 4.79 | 4.61 | 4.82 | 6.43 |
nslphom_dual | 10.22 | 8.81 | 8.06 | 8.33 | 9.03 | 8.86 | 9.52 | 12.87 |
nslphom_joint | 29.03 | 19.39 | 21.44 | 18.09 | 21.90 | 21.39 | 20.94 | 28.99 |
Country year . | NZ 2002 . | NZ 2005 . | SCO 2007 . | NZ 2008 . | NZ 2011 . | NZ 2014 . | NZ 2017 . | NZ 2020 . |
---|---|---|---|---|---|---|---|---|
lphom-based algorithms | ||||||||
lphom(X, Y) | 0.20 | 0.18 | 0.06 | 0.12 | 0.20 | 0.19 | 0.18 | 0.38 |
lphom(Y, X) | 0.23 | 0.21 | 0.06 | 0.24 | 0.33 | 0.34 | 0.30 | 0.36 |
lphom_dual | 0.42 | 0.40 | 0.12 | 0.31 | 0.48 | 0.42 | 0.44 | 0.74 |
lphom_joint | 1.08 | 0.98 | 0.25 | 0.76 | 1.14 | 1.01 | 1.05 | 1.81 |
tslphom-based algorithms | ||||||||
tslphom(X, Y) | 0.69 | 0.60 | 0.46 | 0.51 | 0.62 | 0.63 | 0.60 | 1.01 |
tslphom(Y, X) | 0.72 | 0.61 | 0.47 | 0.62 | 0.72 | 0.69 | 0.77 | 0.97 |
tslphom_dual | 1.41 | 1.23 | 0.89 | 1.12 | 1.35 | 1.27 | 1.33 | 1.94 |
tslphom_joint | 3.85 | 2.78 | 2.36 | 2.43 | 3.10 | 2.89 | 2.95 | 4.29 |
nslphom-based algorithms | ||||||||
nslphom(X, Y) | 5.11 | 4.35 | 4.07 | 4.11 | 4.73 | 4.55 | 4.61 | 6.43 |
nslphom(Y, X) | 5.07 | 4.39 | 3.96 | 4.18 | 4.79 | 4.61 | 4.82 | 6.43 |
nslphom_dual | 10.22 | 8.81 | 8.06 | 8.33 | 9.03 | 8.86 | 9.52 | 12.87 |
nslphom_joint | 29.03 | 19.39 | 21.44 | 18.09 | 21.90 | 21.39 | 20.94 | 28.99 |
Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The computations have been performed on a laptop with a CPU processor Intel Core i7-6820HK (4 cores) 2.70 GHz and 64GB of RAM.
7. On the factors impacting on the accuracy of the estimates
In the previous section, we have compared the accuracy averages of the different solutions, omitting their variability. However, the accuracies of the estimates not only differ across methods but they obviously also differ across districts/elections. In this section, we consider several observed features, specific of each election, and study whether and to what extent they can explain the observed differences in accuracy across elections. To do that, we just focus on nslphom_dual_w and nlsphom_joint solutions. On the one hand, the former are revealed as the more accurate in the analysed database. On the other hand, the latter have been obtained with the theoretically finest method, as it solves the problem dealing with all the constraints simultaneously.
Figure 3, where the distributions obtained for EI are drawn in the 565 elections analysed, clearly shows the existence of important accuracy differences across elections in both groups of solutions. For instance, focusing on nslphom_dual_w solutions, we have that EI ranges from a minimum of 2.45 to a maximum of 20.01; with the minimum being observed in an election where voters are distributed into 90 polling stations and the problem consists in estimating a transfer matrix of size

Histograms of the distributions of the error indexes (EI) corresponding to the nslphom_dual_w and nslphom_joint solutions attained in the 565 elections analysed. The discontinuous vertical lines place the means of the distributions.
The accuracy of the attained estimates, therefore, depends on structural characteristics of the particular election under study. In what follows, we analyse whether the observed variables previously identified in the literature as explainers of accuracy for asymmetric methods also work for the proposed symmetric models. In particular, the list of features recognized in the literature (e.g. King, 1997; Klima et al., 2016; Park et al., 2014; Pavía & Romero, 2023; Plescia & De Sio, 2018; Wakefield, 2004) as factors impacting on the accuracy of the estimates include: (i) the amount of available information, I; (ii) the complexity of the problem, JK; (iii) the level of heterogeneity among the local transfer matrices, HET; (iv) the strength of the relationship between the row and column categories,
As a rule, ceteris paribus, the larger the number of polling stations and the smaller the contingency tables (the number of coefficients to be estimated), the more accurate the estimates are; with the impact of each of these two features being conditioned by the value taken by the other. Indeed, it is customarily stated that a ratio of at least two units (polling stations) per coefficient is necessary for a proper estimation of the vote transfer matrix (Plescia & De Sio, 2018). Thus, the impacts of I and JK are usually assessed together using as joint indicator the number of polling stations divided by number of columns times the number of rows:
The values HET and
For approximating district within-unit diversities, different statistics have been tried in the literature. They have been measured as (a) (weighted) averages of the standard deviations of the unit marginal distributions (using as weights the number of voters per unit), (b) directly using the standard deviations of the district marginal distributions, and (c) employing the formula
Once defined the features, we study using regression linear models their impact on the nslphom_dual_w and nslphom_joint solutions’ accuracy. To do that, we adopt a two-step strategy. First, we analyse the marginal impact of each feature not excluding the possibility that its (increasing/decreasing) effect operates at a decreasing/increasing rate. That is, we consider the possibility of nonlinear effects in the one-feature models by also including as potential regressor the square of the feature. Second, we jointly estimate the impact on accuracy of all the regressors identified as statistically significant in the marginal (one-feature) models. To fit the models, we have excluded the Māori electorates from the analysis. Although the aggregate results are quite similar when these elections are also included in the analysis, we have opted to not consider them to fit the model because of its singular characteristics that would undermine the impact of the ratio
Table 4 presents a statistical summary of the values corresponding to the abovementioned features in the 516 districts that remain in the database after excluding the Māori electorates. As can be observed, almost all variables show asymmetric positive, right-skewed, distributions. In terms of correlations, the largest ones are observed for the pairs of variables
Statistical summary of the explicative variables, excluding M
. | DWR . | DWC . | VAR . | VAC . | |||||
---|---|---|---|---|---|---|---|---|---|
Mean | 2.56 | 4.25 | 4.01 | 2932 | 2909 | 0.18 | 0.21 | 0.98 | 0.85 |
Standard deviation | 1.19 | 1.25 | 0.95 | 966 | 960 | 0.04 | 0.04 | 0.36 | 0.35 |
Minimum | 0.52 | 2.42 | 2.16 | 377 | 347 | 0.06 | 0.09 | 0.29 | 0.12 |
Maximum | 7.73 | 9.44 | 8.45 | 6297 | 6613 | 0.34 | 0.33 | 2.77 | 2.12 |
Skewness | 1.16 | 1.05 | 0.94 | 0.40 | 0.40 | 0.06 | −0.13 | 1.10 | 0.74 |
. | DWR . | DWC . | VAR . | VAC . | |||||
---|---|---|---|---|---|---|---|---|---|
Mean | 2.56 | 4.25 | 4.01 | 2932 | 2909 | 0.18 | 0.21 | 0.98 | 0.85 |
Standard deviation | 1.19 | 1.25 | 0.95 | 966 | 960 | 0.04 | 0.04 | 0.36 | 0.35 |
Minimum | 0.52 | 2.42 | 2.16 | 377 | 347 | 0.06 | 0.09 | 0.29 | 0.12 |
Maximum | 7.73 | 9.44 | 8.45 | 6297 | 6613 | 0.34 | 0.33 | 2.77 | 2.12 |
Skewness | 1.16 | 1.05 | 0.94 | 0.40 | 0.40 | 0.06 | −0.13 | 1.10 | 0.74 |
Source: Compiled by the authors through
Statistical summary of the explicative variables, excluding M
. | DWR . | DWC . | VAR . | VAC . | |||||
---|---|---|---|---|---|---|---|---|---|
Mean | 2.56 | 4.25 | 4.01 | 2932 | 2909 | 0.18 | 0.21 | 0.98 | 0.85 |
Standard deviation | 1.19 | 1.25 | 0.95 | 966 | 960 | 0.04 | 0.04 | 0.36 | 0.35 |
Minimum | 0.52 | 2.42 | 2.16 | 377 | 347 | 0.06 | 0.09 | 0.29 | 0.12 |
Maximum | 7.73 | 9.44 | 8.45 | 6297 | 6613 | 0.34 | 0.33 | 2.77 | 2.12 |
Skewness | 1.16 | 1.05 | 0.94 | 0.40 | 0.40 | 0.06 | −0.13 | 1.10 | 0.74 |
. | DWR . | DWC . | VAR . | VAC . | |||||
---|---|---|---|---|---|---|---|---|---|
Mean | 2.56 | 4.25 | 4.01 | 2932 | 2909 | 0.18 | 0.21 | 0.98 | 0.85 |
Standard deviation | 1.19 | 1.25 | 0.95 | 966 | 960 | 0.04 | 0.04 | 0.36 | 0.35 |
Minimum | 0.52 | 2.42 | 2.16 | 377 | 347 | 0.06 | 0.09 | 0.29 | 0.12 |
Maximum | 7.73 | 9.44 | 8.45 | 6297 | 6613 | 0.34 | 0.33 | 2.77 | 2.12 |
Skewness | 1.16 | 1.05 | 0.94 | 0.40 | 0.40 | 0.06 | −0.13 | 1.10 | 0.74 |
Source: Compiled by the authors through
In order to facilitate the interpretation of the estimated coefficients and the relative impact of their associated features on accuracy, all the explanatory variables have been standardized to zero mean and standard deviation 1. In this way, each estimated coefficient informs about the expected variation from the mean in the response variable due to one standard variation in the corresponding variable. From the one-feature models (not shown), one can infer that (i)
Some changes between the marginal and joint impacts of the features happen when we model the full multivariate specifications. Table 5 presents the coefficients of the fitted linear regression models. All variables, except
Variable . | Response variable: error index (EI) . | |||
---|---|---|---|---|
nslphom_dual_w . | nslphom_joint . | |||
Estimate . | p-Value . | Estimate . | p-Value . | |
8.23 | <0.0001 | 8.72 | <0.0001 | |
−1.60 | <0.0001 | −1.89 | <0.0001 | |
0.32 | <0.0001 | 0.37 | <0.0001 | |
HETe | 0.46 | 0.0009 | 0.84 | <0.0001 |
0.78 | <0.0001 | 1.15 | <0.0001 | |
0.14 | 0.0583 | 0.05 | 0.5368 | |
VAR | 0.67 | <0.0001 | 0.43 | 0.0027 |
VAC | −0.66 | <0.0001 | −0.51 | 0.0001 |
DWR | 0.23 | 0.1904 | 0.34 | 0.0549 |
DWC | −0.44 | 0.0302 | −0.52 | 0.0042 |
Adjusted R2 (%) | 24.90 | 25.73 | ||
Residual Std. error | 2.21 | 2.33 |
Variable . | Response variable: error index (EI) . | |||
---|---|---|---|---|
nslphom_dual_w . | nslphom_joint . | |||
Estimate . | p-Value . | Estimate . | p-Value . | |
8.23 | <0.0001 | 8.72 | <0.0001 | |
−1.60 | <0.0001 | −1.89 | <0.0001 | |
0.32 | <0.0001 | 0.37 | <0.0001 | |
HETe | 0.46 | 0.0009 | 0.84 | <0.0001 |
0.78 | <0.0001 | 1.15 | <0.0001 | |
0.14 | 0.0583 | 0.05 | 0.5368 | |
VAR | 0.67 | <0.0001 | 0.43 | 0.0027 |
VAC | −0.66 | <0.0001 | −0.51 | 0.0001 |
DWR | 0.23 | 0.1904 | 0.34 | 0.0549 |
DWC | −0.44 | 0.0302 | −0.52 | 0.0042 |
Adjusted R2 (%) | 24.90 | 25.73 | ||
Residual Std. error | 2.21 | 2.33 |
Source: Compiled by the authors. All the predictor variables were standardized before fitting the models to make comparisons of coefficients easier.
Variable . | Response variable: error index (EI) . | |||
---|---|---|---|---|
nslphom_dual_w . | nslphom_joint . | |||
Estimate . | p-Value . | Estimate . | p-Value . | |
8.23 | <0.0001 | 8.72 | <0.0001 | |
−1.60 | <0.0001 | −1.89 | <0.0001 | |
0.32 | <0.0001 | 0.37 | <0.0001 | |
HETe | 0.46 | 0.0009 | 0.84 | <0.0001 |
0.78 | <0.0001 | 1.15 | <0.0001 | |
0.14 | 0.0583 | 0.05 | 0.5368 | |
VAR | 0.67 | <0.0001 | 0.43 | 0.0027 |
VAC | −0.66 | <0.0001 | −0.51 | 0.0001 |
DWR | 0.23 | 0.1904 | 0.34 | 0.0549 |
DWC | −0.44 | 0.0302 | −0.52 | 0.0042 |
Adjusted R2 (%) | 24.90 | 25.73 | ||
Residual Std. error | 2.21 | 2.33 |
Variable . | Response variable: error index (EI) . | |||
---|---|---|---|---|
nslphom_dual_w . | nslphom_joint . | |||
Estimate . | p-Value . | Estimate . | p-Value . | |
8.23 | <0.0001 | 8.72 | <0.0001 | |
−1.60 | <0.0001 | −1.89 | <0.0001 | |
0.32 | <0.0001 | 0.37 | <0.0001 | |
HETe | 0.46 | 0.0009 | 0.84 | <0.0001 |
0.78 | <0.0001 | 1.15 | <0.0001 | |
0.14 | 0.0583 | 0.05 | 0.5368 | |
VAR | 0.67 | <0.0001 | 0.43 | 0.0027 |
VAC | −0.66 | <0.0001 | −0.51 | 0.0001 |
DWR | 0.23 | 0.1904 | 0.34 | 0.0549 |
DWC | −0.44 | 0.0302 | −0.52 | 0.0042 |
Adjusted R2 (%) | 24.90 | 25.73 | ||
Residual Std. error | 2.21 | 2.33 |
Source: Compiled by the authors. All the predictor variables were standardized before fitting the models to make comparisons of coefficients easier.
8. Discussion, conclusions, and future research
Ecological inference methods are devised to infer conditional distribution probabilities from marginal distributions. In doing so, they consider a main characteristic variable (e.g. race or social class) impacting on a response variable, usually the vote. This scheme fits the majority of ecological inference problems. In some instances, however, such as in simultaneous elections, this general scheme can be questionable. The issue is that different solutions are achieved depending on which variable is considered the factor (origin, explanatory, cause) and which the response. The methods are asymmetric. In this paper, we ask whether we can obtain some advantage in terms of accuracy by dealing with this inverse problem in a symmetric way. That is, by solving it as a purely mathematical puzzle that must verify some congruence properties, omitting the possible presence of a natural a priori relationship (i.e. a logical mapping of the variables to rows and columns). Afterwards, the researcher can recover the meaningful order when presenting the outcomes.
To answer the above research question, this paper builds within the linear programming framework two new families of algorithms whose solutions do not depend on how variables are mapped to rows and columns. These are symmetric methods. From a statistical standpoint, symmetric approaches offer the advantage of adhering to the probabilistic symmetry condition, meaning that their solutions conform to Bayes theorem. It should be noted, however, that while the symmetric condition could be considered a desirable property, it alone does not guarantee the attainment of accurate solutions. For example, an algorithm that assumes conditional independence between rows and columns given the unit (equivalent to the nonlinear neighbourhood model; Freedman et al., 1998) would multiply the average error by more than four times in the datasets analysed in this paper, even verifying the symmetric condition.
We evaluate the accuracy of the proposed methods using real data corresponding to 565 simultaneous elections where the true district-level cross-classifications of votes are known. Our empirical assessment indubitably points to the proposed symmetric solutions as being more accurate than the equivalent asymmetric approaches, with the ratio between available information and complexity/difficulty of the problem as the characteristic having more impact on accuracy. Overall, the optimal solutions reached with the joint specifications, those which solve linear programmes that simultaneously include all the constraints, are not the most accurate on average. They are outperformed by the methods that build their solutions as an average of asymmetric solutions. We conclude that for the datasets at hand, the nslphom_dual_w solutions are, on average, the most accurate. This method grounds its better accuracy on its greater robustness since it is the one generating the smallest error fewer times among the nslphom symmetric models. Indeed, there is no symmetric algorithm that systematically beats the rest of the symmetric methods, so a question to be answered is whether the circumstance itself could indicate which symmetric method to choose. In other words, can we find specific configurations derivable from the observed data that calls for the recommendation of a specific method for a particular dataset?
Given that mathematical programming offers the natural methodological setting for simultaneously handling all the constraints linked to a symmetric treatment of the problem, we have considered only algorithms within the linear programming framework where we have built the symmetric solutions from asymmetric methods. This has the advantage of maintaining all the proposals within the same framework, making comparisons fairer and producing methods easy to use and robust to claims of hacking. The user only needs to input a set of ecological data and a maximum number of iterations, and the procedures automatically return a sensible solution. Nevertheless, it would be worth extending the analysis to other methodological settings, i.e. to study whether similar results would be obtained if symmetric methods were built from scratch or using asymmetric algorithms from other frameworks. For example, the following question could be addressed: can inferences attained as an average of the two dual solutions of the Rosen et al. (2001) R × C model systematically improve the asymmetric solution achieved by choosing the most logical approach (of the two)? Indeed, in our view, the new angle for tackling the ecological inference problem offered by this research should also be systematically explored from other conceptual frameworks. This, however, demands new theoretical developments. More specifically, just as ecological inference linear programming requires new developments to introduce information from polls into its models (Pavía, 2023), an effort is also required to develop genuine symmetric models from other frameworks, such as the Bayesian and frequentist ecological inference ones. In this case, we consider that models based on conditional multivariate hypergeometric distributions or full-table multinomial distributions could prove successful.
Our empirical results are based on data from simultaneous elections so it would also be worth exploring whether our conclusions can be confirmed in other contexts, such as voting rights litigation or literacy studies. Despite electoral datasets with true answers in other target application areas being scarce and typically not available, it would be interesting to replicate the study of Barreto et al. (2022) on racially polarized voting and analyse the impact of using symmetric approaches on substantive conclusions and on the accuracy in estimating the inner-cells values of their simulated datasets. Likewise, the symmetric approach could also be tested estimating the tables of the datasets employed by Jiang et al. (2020). These datasets include data on US mortality rates by gender and race, or literacy rates and educational attendance by gender in India.
Finally, from a more methodological perspective, it would be interesting to study what impact initializing the iterative process with a different matrix of votes would have on the accuracy of the nslphom_joint optimal solutions; for instance, what the impact would be of initializing with the nslphom_dual_w solution instead of with the lphom_joint solution. Furthermore, given the lack of theoretical results about the (asymptotic) distributions of the estimators, we consider that the bootstrap approach (e.g. Efron & Tibshirani, 1994) could be used to measure the precision of estimates. Despite the fact that lphom-based specifications can be theoretically observed as a linear absolute fitting problem after applying Theorem 1 of chapter 6 in Bloomfield and Steiger (1983, pp. 164–165) and, therefore, their estimators ‘usually’ be considered as having ‘a limiting normal distribution’ (Bloomfield & Steiger, 1983, p. 44), in our view, this does not apply as a rule in the ecological inference problem and definitely cannot be used in the tslphom- and nslphom-based specifications. On the one hand, it is difficult to accept that the assumptions required by the different asymptotic theorems hold in the ecological inference problem, due to, among other issues, the discrete character of the margins and the cross structures of relationships that the row and column aggregations impose. On the other hand, and more importantly, the tslphom- and nslphom-specifications do not fulfil the hypothesis of the abovementioned Theorem 1, as it requires that the number of summands (auxiliary variables) in the objective function be greater than the number of equality constraints. The unit tables algorithms, lphom_local and lphom_local_joint, that are in the core of the tslphom- and nslphom-family models solve linear programmes where the number of auxiliary variables is smaller than the number of equality constraints.
The above issues do not imply that ecological inference linear programming approaches lack a statistical interpretation or that the estimated coefficients necessarily exhibit undesirable statistical properties. On the one hand, in addition to the established links between linear programming solutions and minimization of expected discrepancies, as previously mentioned when discussing the proposed models, it is important to note that tslphom- and nlsphom-based approaches, like many other ecological inference models, rely on the underlying assumption of independence across units of the two-way distributions of votes, given their aggregate two-way joint distribution. In other words, the set of unit two-way fraction distributions can be viewed as a simple random sample of a common probability distribution. This entails that, without covariates, estimates can only be reliable when models are applied in homogeneous political regions. On the other hand, we believe that similar to how sampling properties of quadratic programming solutions can be obtained through the links between quadratic programming and inequality-constrained normal regression (Geweke, 1986; Judge & Takayama, 1966), the connection between linear programming and inequality-constrained linear absolute fitting with Laplace-distributed errors could be explored to derive statistical properties of basic ecological inference linear programming solutions. In addition, we also find remarkable that the iterative algorithm that characterize nslphom-based methods can be understood in terms of an expectation-maximization algorithm (Dempster et al., 1977). In this comparison, equations (6)–(13) would correspond to the expectation step, equation (14) to the maximization step, and equations (1)–(5) are required for generating an initial estimate of the expected transition probabilities. This connection becomes more evident when examining the variants introduced in the lclphom algorithm (Pavía, 2024).
Acknowledgements
The authors wish to thank the editors and four anonymous reviewers for their valuable comments and suggestions and M. Hodkinson for revising the English of the paper.
Funding
This research has been supported by Conselleria de Educación, Universidades y Empleo, Generalitat Valenciana [grant number AICO/2021/257] and by Ministerio de Economía e Innovación [grant number PID2021-128228NB-I00].
Data availability
The data used in this research is publicly available on the R package ei.Datasets (version 0.0.1-1) accessible on CRAN. The reproducible ad hoc R-code employed, based on functions included in the R package lphom (version 0.3.0-7), is available in the attached online supplementary material.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series A.
References
Author notes
Conflicts of interest: none declared.