Symmetry estimating R × C vote transfer matrices from aggregate data

Averages of EI errors by group of elections

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
# of elections	N = 69	N = 69	N = 73	N = 70	N = 70	N = 71	N = 71	N = 72
Avg. # of units	$\bar{I}$ = 83.2	$\bar{I}$ = 81.8	$\bar{I}$ = 70.2	$\bar{I}$ = 84.1	$\bar{I}$ = 85.7	$\bar{I}$ = 81.2	$\bar{I}$ = 101.9	$\bar{I}$ = 134.9
Avg. # of cells	$\bar{J K}$ = 39.5	$\bar{J K}$ = 23.8	$\bar{J K}$ = 35.2	$\bar{J K}$ = 23.4	$\bar{J K}$ = 26.2	$\bar{J K}$ = 27.9	$\bar{J K}$ = 24.8	$\bar{J K}$ = 24.5
lphom-based solutions
lphom(X, Y)	16.88	12.14	12.92	12.22	12.99	12.95	12.20	14.02
lphom(Y, X)	16.14	12.74	10.78	11.87	13.40	13.70	12.08	12.18
lphom_min	16.30	11.86	11.73	11.58	12.65	12.80	11.61	12.59
lphom_dual_a	14.87	11.47	10.50	11.03	12.14	12.03	11.19	12.16
lphom_dual_w	14.89	11.37	10.41	10.97	12.05	12.01	11.10	12.11
lphom_joint	15.36	11.65	10.52	11.31	12.51	12.35	11.48	12.60
tslphom-based solutions
tslphom(X, Y)	14.80	10.91	11.00	10.89	11.50	11.66	10.91	12.59
tslphom(Y, X)	14.52	11.50	9.46	10.77	12.22	12.50	10.95	11.07
tslphom_min	14.06	10.65	9.64	10.30	11.20	11.61	10.36	11.18
tslphom_dual_a	13.18	10.27	8.96	9.87	10.83	10.83	10.03	10.92
tslphom_dual_w	13.15	10.17	8.87	9.79	10.74	10.78	9.95	10.85
tslphom_joint	14.07	10.73	9.42	10.40	11.46	11.45	10.49	11.54
nslphom-based solutions
nslphom(X, Y)	12.79	9.47	8.85	9.11	9.46	9.69	8.91	9.82
nslphom(Y, X)	12.71	9.89	8.11	9.17	10.86	11.15	9.29	9.39
nslphom_min	12.55	9.22	8.42	8.63	9.41	9.67	8.48	9.07
nslphom_dual_a	11.10	8.63	7.11	8.08	9.01	9.08	8.04	8.52
nslphom_dual_w	11.14	8.58	7.11	8.04	8.91	9.01	7.96	8.48
nslphom_joint	11.54	8.86	7.53	8.44	9.30	9.39	8.23	8.99

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
# of elections	N = 69	N = 69	N = 73	N = 70	N = 70	N = 71	N = 71	N = 72
Avg. # of units	$\bar{I}$ = 83.2	$\bar{I}$ = 81.8	$\bar{I}$ = 70.2	$\bar{I}$ = 84.1	$\bar{I}$ = 85.7	$\bar{I}$ = 81.2	$\bar{I}$ = 101.9	$\bar{I}$ = 134.9
Avg. # of cells	$\bar{J K}$ = 39.5	$\bar{J K}$ = 23.8	$\bar{J K}$ = 35.2	$\bar{J K}$ = 23.4	$\bar{J K}$ = 26.2	$\bar{J K}$ = 27.9	$\bar{J K}$ = 24.8	$\bar{J K}$ = 24.5
lphom-based solutions
lphom(X, Y)	16.88	12.14	12.92	12.22	12.99	12.95	12.20	14.02
lphom(Y, X)	16.14	12.74	10.78	11.87	13.40	13.70	12.08	12.18
lphom_min	16.30	11.86	11.73	11.58	12.65	12.80	11.61	12.59
lphom_dual_a	14.87	11.47	10.50	11.03	12.14	12.03	11.19	12.16
lphom_dual_w	14.89	11.37	10.41	10.97	12.05	12.01	11.10	12.11
lphom_joint	15.36	11.65	10.52	11.31	12.51	12.35	11.48	12.60
tslphom-based solutions
tslphom(X, Y)	14.80	10.91	11.00	10.89	11.50	11.66	10.91	12.59
tslphom(Y, X)	14.52	11.50	9.46	10.77	12.22	12.50	10.95	11.07
tslphom_min	14.06	10.65	9.64	10.30	11.20	11.61	10.36	11.18
tslphom_dual_a	13.18	10.27	8.96	9.87	10.83	10.83	10.03	10.92
tslphom_dual_w	13.15	10.17	8.87	9.79	10.74	10.78	9.95	10.85
tslphom_joint	14.07	10.73	9.42	10.40	11.46	11.45	10.49	11.54
nslphom-based solutions
nslphom(X, Y)	12.79	9.47	8.85	9.11	9.46	9.69	8.91	9.82
nslphom(Y, X)	12.71	9.89	8.11	9.17	10.86	11.15	9.29	9.39
nslphom_min	12.55	9.22	8.42	8.63	9.41	9.67	8.48	9.07
nslphom_dual_a	11.10	8.63	7.11	8.08	9.01	9.08	8.04	8.52
nslphom_dual_w	11.14	8.58	7.11	8.04	8.91	9.01	7.96	8.48
nslphom_joint	11.54	8.86	7.53	8.44	9.30	9.39	8.23	8.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 1.

Averages of EI errors by group of elections

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
# of elections	N = 69	N = 69	N = 73	N = 70	N = 70	N = 71	N = 71	N = 72
Avg. # of units	$\bar{I}$ = 83.2	$\bar{I}$ = 81.8	$\bar{I}$ = 70.2	$\bar{I}$ = 84.1	$\bar{I}$ = 85.7	$\bar{I}$ = 81.2	$\bar{I}$ = 101.9	$\bar{I}$ = 134.9
Avg. # of cells	$\bar{J K}$ = 39.5	$\bar{J K}$ = 23.8	$\bar{J K}$ = 35.2	$\bar{J K}$ = 23.4	$\bar{J K}$ = 26.2	$\bar{J K}$ = 27.9	$\bar{J K}$ = 24.8	$\bar{J K}$ = 24.5
lphom-based solutions
lphom(X, Y)	16.88	12.14	12.92	12.22	12.99	12.95	12.20	14.02
lphom(Y, X)	16.14	12.74	10.78	11.87	13.40	13.70	12.08	12.18
lphom_min	16.30	11.86	11.73	11.58	12.65	12.80	11.61	12.59
lphom_dual_a	14.87	11.47	10.50	11.03	12.14	12.03	11.19	12.16
lphom_dual_w	14.89	11.37	10.41	10.97	12.05	12.01	11.10	12.11
lphom_joint	15.36	11.65	10.52	11.31	12.51	12.35	11.48	12.60
tslphom-based solutions
tslphom(X, Y)	14.80	10.91	11.00	10.89	11.50	11.66	10.91	12.59
tslphom(Y, X)	14.52	11.50	9.46	10.77	12.22	12.50	10.95	11.07
tslphom_min	14.06	10.65	9.64	10.30	11.20	11.61	10.36	11.18
tslphom_dual_a	13.18	10.27	8.96	9.87	10.83	10.83	10.03	10.92
tslphom_dual_w	13.15	10.17	8.87	9.79	10.74	10.78	9.95	10.85
tslphom_joint	14.07	10.73	9.42	10.40	11.46	11.45	10.49	11.54
nslphom-based solutions
nslphom(X, Y)	12.79	9.47	8.85	9.11	9.46	9.69	8.91	9.82
nslphom(Y, X)	12.71	9.89	8.11	9.17	10.86	11.15	9.29	9.39
nslphom_min	12.55	9.22	8.42	8.63	9.41	9.67	8.48	9.07
nslphom_dual_a	11.10	8.63	7.11	8.08	9.01	9.08	8.04	8.52
nslphom_dual_w	11.14	8.58	7.11	8.04	8.91	9.01	7.96	8.48
nslphom_joint	11.54	8.86	7.53	8.44	9.30	9.39	8.23	8.99

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
# of elections	N = 69	N = 69	N = 73	N = 70	N = 70	N = 71	N = 71	N = 72
Avg. # of units	$\bar{I}$ = 83.2	$\bar{I}$ = 81.8	$\bar{I}$ = 70.2	$\bar{I}$ = 84.1	$\bar{I}$ = 85.7	$\bar{I}$ = 81.2	$\bar{I}$ = 101.9	$\bar{I}$ = 134.9
Avg. # of cells	$\bar{J K}$ = 39.5	$\bar{J K}$ = 23.8	$\bar{J K}$ = 35.2	$\bar{J K}$ = 23.4	$\bar{J K}$ = 26.2	$\bar{J K}$ = 27.9	$\bar{J K}$ = 24.8	$\bar{J K}$ = 24.5
lphom-based solutions
lphom(X, Y)	16.88	12.14	12.92	12.22	12.99	12.95	12.20	14.02
lphom(Y, X)	16.14	12.74	10.78	11.87	13.40	13.70	12.08	12.18
lphom_min	16.30	11.86	11.73	11.58	12.65	12.80	11.61	12.59
lphom_dual_a	14.87	11.47	10.50	11.03	12.14	12.03	11.19	12.16
lphom_dual_w	14.89	11.37	10.41	10.97	12.05	12.01	11.10	12.11
lphom_joint	15.36	11.65	10.52	11.31	12.51	12.35	11.48	12.60
tslphom-based solutions
tslphom(X, Y)	14.80	10.91	11.00	10.89	11.50	11.66	10.91	12.59
tslphom(Y, X)	14.52	11.50	9.46	10.77	12.22	12.50	10.95	11.07
tslphom_min	14.06	10.65	9.64	10.30	11.20	11.61	10.36	11.18
tslphom_dual_a	13.18	10.27	8.96	9.87	10.83	10.83	10.03	10.92
tslphom_dual_w	13.15	10.17	8.87	9.79	10.74	10.78	9.95	10.85
tslphom_joint	14.07	10.73	9.42	10.40	11.46	11.45	10.49	11.54
nslphom-based solutions
nslphom(X, Y)	12.79	9.47	8.85	9.11	9.46	9.69	8.91	9.82
nslphom(Y, X)	12.71	9.89	8.11	9.17	10.86	11.15	9.29	9.39
nslphom_min	12.55	9.22	8.42	8.63	9.41	9.67	8.48	9.07
nslphom_dual_a	11.10	8.63	7.11	8.08	9.01	9.08	8.04	8.52
nslphom_dual_w	11.14	8.58	7.11	8.04	8.91	9.01	7.96	8.48
nslphom_joint	11.54	8.86	7.53	8.44	9.30	9.39	8.23	8.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 2.

Averages of EI errors of nslphom-based solutions by alternative groupings of elections

Group of elections	NZ—regular	NZ—Māori	NZ—all	SCO 2007	NZ + SCO
# of elections	N = 443	N = 49	N = 492	N = 73	N = 565
Avg. # of units	$\bar{I}$ = 65.9	$\bar{I}$ = 343.1	$\bar{I}$ = 93.5	$\bar{I}$ = 70.2	$\bar{I}$ = 90.5
Avg. # of cells	$\bar{J K}$ = 26.8	$\bar{J K}$ = 29.9	$\bar{J K}$ = 27.1	$\bar{J K}$ = 35.2	$\bar{J K}$ = 28.2
nslphom(X, Y)	9.85	10.20	9.89	8.85	9.75
nslphom(Y, X)	10.47	9.14	10.34	8.11	10.05
nslphom_min	9.54	9.83	9.57	8.42	9.42
nslphom_dual_a	9.02	7.99	8.92	7.11	8.68
nslphom_dual_w	8.96	8.04	8.87	7.11	8.64
nslphom_joint	9.40	7.85	9.24	7.53	9.02

Group of elections	NZ—regular	NZ—Māori	NZ—all	SCO 2007	NZ + SCO
# of elections	N = 443	N = 49	N = 492	N = 73	N = 565
Avg. # of units	$\bar{I}$ = 65.9	$\bar{I}$ = 343.1	$\bar{I}$ = 93.5	$\bar{I}$ = 70.2	$\bar{I}$ = 90.5
Avg. # of cells	$\bar{J K}$ = 26.8	$\bar{J K}$ = 29.9	$\bar{J K}$ = 27.1	$\bar{J K}$ = 35.2	$\bar{J K}$ = 28.2
nslphom(X, Y)	9.85	10.20	9.89	8.85	9.75
nslphom(Y, X)	10.47	9.14	10.34	8.11	10.05
nslphom_min	9.54	9.83	9.57	8.42	9.42
nslphom_dual_a	9.02	7.99	8.92	7.11	8.68
nslphom_dual_w	8.96	8.04	8.87	7.11	8.64
nslphom_joint	9.40	7.85	9.24	7.53	9.02

Source: Compiled by the authors after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms based on nslphom, described in Sections 3–5, with ns = 10 to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 2.

Open in new tab Download slide

Averages of EI errors of nslphom-based solutions by alternative groupings of elections

Group of elections	NZ—regular	NZ—Māori	NZ—all	SCO 2007	NZ + SCO
# of elections	N = 443	N = 49	N = 492	N = 73	N = 565
Avg. # of units	$\bar{I}$ = 65.9	$\bar{I}$ = 343.1	$\bar{I}$ = 93.5	$\bar{I}$ = 70.2	$\bar{I}$ = 90.5
Avg. # of cells	$\bar{J K}$ = 26.8	$\bar{J K}$ = 29.9	$\bar{J K}$ = 27.1	$\bar{J K}$ = 35.2	$\bar{J K}$ = 28.2
nslphom(X, Y)	9.85	10.20	9.89	8.85	9.75
nslphom(Y, X)	10.47	9.14	10.34	8.11	10.05
nslphom_min	9.54	9.83	9.57	8.42	9.42
nslphom_dual_a	9.02	7.99	8.92	7.11	8.68
nslphom_dual_w	8.96	8.04	8.87	7.11	8.64
nslphom_joint	9.40	7.85	9.24	7.53	9.02

Group of elections	NZ—regular	NZ—Māori	NZ—all	SCO 2007	NZ + SCO
# of elections	N = 443	N = 49	N = 492	N = 73	N = 565
Avg. # of units	$\bar{I}$ = 65.9	$\bar{I}$ = 343.1	$\bar{I}$ = 93.5	$\bar{I}$ = 70.2	$\bar{I}$ = 90.5
Avg. # of cells	$\bar{J K}$ = 26.8	$\bar{J K}$ = 29.9	$\bar{J K}$ = 27.1	$\bar{J K}$ = 35.2	$\bar{J K}$ = 28.2
nslphom(X, Y)	9.85	10.20	9.89	8.85	9.75
nslphom(Y, X)	10.47	9.14	10.34	8.11	10.05
nslphom_min	9.54	9.83	9.57	8.42	9.42
nslphom_dual_a	9.02	7.99	8.92	7.11	8.68
nslphom_dual_w	8.96	8.04	8.87	7.11	8.64
nslphom_joint	9.40	7.85	9.24	7.53	9.02

Source: Compiled by the authors after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms based on nslphom, described in Sections 3–5, with ns = 10 to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). The smaller the number, the better the accuracy. Values in bold correspond to the most average accurate solutions in each set of elections.

Table 2 presents the summaries grouping the NZ elections as alternatively suggested by Pavía (2022), but focusing only on the most accurate algorithms: those based on the nslphom model. To offer more context and make comparisons easier, we also include in Table 2 the results attained pooling all the elections and combining all the NZ elections in a unique group and, again, the Scottish results. The upper panels of both tables contain some summary statistics of the corresponding group of elections. Here, one of the differential characteristics of the Māori districts stands out: the relative higher number of units in which their electorates are split out. Figure 2 shows graphically the average errors attained by the nslphom-family algorithms in all the groupings.

Figure 2.

Graphical representation of average values of EI error measures grouped by election and nslphom-family algorithm. The smaller the number, the better the accuracy. Individual solutions are attained after applying, with default options, the function nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). Detailed figures are available in Tables 1 and 2.

In light of the empirical assessments, two clear findings emerge. On the one hand, our results confirm the order of preference for the symmetric models already stated by Pavía and Romero (2022, 2023) for the asymmetric procedures. As with asymmetric models, tslphom-based methods systematically improve lphom-based ones and, equally, nslphom-based models consistently outperform tslphom-based ones (see Table 1). More importantly, given our main research question, by comparing asymmetric and symmetric models we can assert that symmetric algorithms produce, conditioned on the underlying based algorithm, consistently more accurate estimates than asymmetric procedures (at least for the data at hand). Overall, symmetric solutions beat asymmetric solutions almost 90% of the time.

The family of the algorithm (i.e. the underlying procedure, either lphom, tslphom, or nslphom), however, has a higher impact on the accuracy of the estimates than the character, either symmetric or asymmetric, of the model. tslphom-based models, including asymmetric ones, are on average more accurate than lphom-based models, and the same relationship occurs with nslphom-based models with regard to tslphom-based ones. In short, the nslphom-based models are clearly the most accurate, so henceforth we focus only on these by concentrating our analysis on the results available in Table 2 and Figure 1 and in the last panel of Table 1. In particular, pooling all the elections and algorithms, nslphom symmetric solutions are seen to be, on average, almost 11% more accurate than nslphom asymmetric ones.

Focusing now on the relationships within the symmetric solutions, we observe some general, although not completely systematic, patterns. On the one hand, pooling all the results we see that, on average, the nslphom_dual_w solutions are the most accurate, followed by nslphom_dual_a, nslphom_joint, and nslphom_min solutions in that order (see Table 2 and Figure 2). For some of the groups, however, nslphom_dual_w solutions are not the best on average; in some groups they are beaten by nslphom_dual_a (in the 2002 elections group) and by nslphom_joint (in the of Māori districts group).

In any case, what we can affirm is that nslphom_dual_w generates the more robust solutions. This algorithm produces, on average, the most accurate solutions despite being the algorithm that out of the four generates the smallest EI error fewer times. Indeed, even nslphom_min, the algorithm whose error average is the greatest out of the four, produces better solutions more times than nslphom_dual_w. In fact, nslphom_min at the individual level is the best 25% of the time, the same figure recorded for nslphom_dual_a, while the figures for nslphom_joint and nslphom_dual_w are 30% and 20%, respectively. Looking in more depth at the analysis of the individual matrices, we also find that, as expected, the nslphom_dual_w and nslphom_dual_a solutions are quite close, being midway between the nslphom_min and the nslphom_joint solutions, although a little closer to the latter.

The results for the Māori electorates are particularly interesting (see Table 2 and Figure 2). In all the groups, except the Māori, the dual solutions built as means are the best on average, when for the Māori the best on average are the predictions based on the joint algorithm. At first glance, one could think that this is a consequence of some of their differential characteristics. According to Pavía (2022), the Māori districts are remarkably different in many issues; for instance, compared to the rest of the districts of the database, their electorates are split out in a greater number of units, their transfer matrices record weaker relationships between row and column options, the electoral behaviour of their voters shows a greater heterogeneity across units and their datasets show more across-unit variances for both parties and candidates. A closer look at the individual errors, however, reveals that almost all the relative advantage of nslphom_joint compared to nslphom_dual_w and nslphom_dual_a is grounded on the 2002 elections. Indeed, nslphom_joint is only more accurate on average than nslphom_dual_w and nslphom_dual_a in the 2002 and 2005 elections. As a rule, therefore, we can say that, at least for datasets with similar characteristics to the ones analysed in this research, the nslphom_dual_w solutions are preferable. They are more accurate, on average, and quite robust.

Finally, focusing on computational times (see Table 3), we see that all the methods are quite fast. As expected, the nslphom solutions are slower but they only require a few seconds to reach their solutions. Obviously, the slowest algorithm is the nslphom_joint but, on average, it only takes in these datasets a few more than 20 s to reach its solution. In general, the computational burden grows with the number of units and the number of cells of the corresponding ecological matrix.

Table 3.

Averages of computational burden (in seconds) by group of elections

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
lphom-based algorithms
lphom(X, Y)	0.20	0.18	0.06	0.12	0.20	0.19	0.18	0.38
lphom(Y, X)	0.23	0.21	0.06	0.24	0.33	0.34	0.30	0.36
lphom_dual	0.42	0.40	0.12	0.31	0.48	0.42	0.44	0.74
lphom_joint	1.08	0.98	0.25	0.76	1.14	1.01	1.05	1.81
tslphom-based algorithms
tslphom(X, Y)	0.69	0.60	0.46	0.51	0.62	0.63	0.60	1.01
tslphom(Y, X)	0.72	0.61	0.47	0.62	0.72	0.69	0.77	0.97
tslphom_dual	1.41	1.23	0.89	1.12	1.35	1.27	1.33	1.94
tslphom_joint	3.85	2.78	2.36	2.43	3.10	2.89	2.95	4.29
nslphom-based algorithms
nslphom(X, Y)	5.11	4.35	4.07	4.11	4.73	4.55	4.61	6.43
nslphom(Y, X)	5.07	4.39	3.96	4.18	4.79	4.61	4.82	6.43
nslphom_dual	10.22	8.81	8.06	8.33	9.03	8.86	9.52	12.87
nslphom_joint	29.03	19.39	21.44	18.09	21.90	21.39	20.94	28.99

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
lphom-based algorithms
lphom(X, Y)	0.20	0.18	0.06	0.12	0.20	0.19	0.18	0.38
lphom(Y, X)	0.23	0.21	0.06	0.24	0.33	0.34	0.30	0.36
lphom_dual	0.42	0.40	0.12	0.31	0.48	0.42	0.44	0.74
lphom_joint	1.08	0.98	0.25	0.76	1.14	1.01	1.05	1.81
tslphom-based algorithms
tslphom(X, Y)	0.69	0.60	0.46	0.51	0.62	0.63	0.60	1.01
tslphom(Y, X)	0.72	0.61	0.47	0.62	0.72	0.69	0.77	0.97
tslphom_dual	1.41	1.23	0.89	1.12	1.35	1.27	1.33	1.94
tslphom_joint	3.85	2.78	2.36	2.43	3.10	2.89	2.95	4.29
nslphom-based algorithms
nslphom(X, Y)	5.11	4.35	4.07	4.11	4.73	4.55	4.61	6.43
nslphom(Y, X)	5.07	4.39	3.96	4.18	4.79	4.61	4.82	6.43
nslphom_dual	10.22	8.81	8.06	8.33	9.03	8.86	9.52	12.87
nslphom_joint	29.03	19.39	21.44	18.09	21.90	21.39	20.94	28.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The computations have been performed on a laptop with a CPU processor Intel Core i7-6820HK (4 cores) 2.70 GHz and 64GB of RAM.

Table 3.

Open in new tab Download slide

Averages of computational burden (in seconds) by group of elections

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
lphom-based algorithms
lphom(X, Y)	0.20	0.18	0.06	0.12	0.20	0.19	0.18	0.38
lphom(Y, X)	0.23	0.21	0.06	0.24	0.33	0.34	0.30	0.36
lphom_dual	0.42	0.40	0.12	0.31	0.48	0.42	0.44	0.74
lphom_joint	1.08	0.98	0.25	0.76	1.14	1.01	1.05	1.81
tslphom-based algorithms
tslphom(X, Y)	0.69	0.60	0.46	0.51	0.62	0.63	0.60	1.01
tslphom(Y, X)	0.72	0.61	0.47	0.62	0.72	0.69	0.77	0.97
tslphom_dual	1.41	1.23	0.89	1.12	1.35	1.27	1.33	1.94
tslphom_joint	3.85	2.78	2.36	2.43	3.10	2.89	2.95	4.29
nslphom-based algorithms
nslphom(X, Y)	5.11	4.35	4.07	4.11	4.73	4.55	4.61	6.43
nslphom(Y, X)	5.07	4.39	3.96	4.18	4.79	4.61	4.82	6.43
nslphom_dual	10.22	8.81	8.06	8.33	9.03	8.86	9.52	12.87
nslphom_joint	29.03	19.39	21.44	18.09	21.90	21.39	20.94	28.99

Country year	NZ 2002	NZ 2005	SCO 2007	NZ 2008	NZ 2011	NZ 2014	NZ 2017	NZ 2020
lphom-based algorithms
lphom(X, Y)	0.20	0.18	0.06	0.12	0.20	0.19	0.18	0.38
lphom(Y, X)	0.23	0.21	0.06	0.24	0.33	0.34	0.30	0.36
lphom_dual	0.42	0.40	0.12	0.31	0.48	0.42	0.44	0.74
lphom_joint	1.08	0.98	0.25	0.76	1.14	1.01	1.05	1.81
tslphom-based algorithms
tslphom(X, Y)	0.69	0.60	0.46	0.51	0.62	0.63	0.60	1.01
tslphom(Y, X)	0.72	0.61	0.47	0.62	0.72	0.69	0.77	0.97
tslphom_dual	1.41	1.23	0.89	1.12	1.35	1.27	1.33	1.94
tslphom_joint	3.85	2.78	2.36	2.43	3.10	2.89	2.95	4.29
nslphom-based algorithms
nslphom(X, Y)	5.11	4.35	4.07	4.11	4.73	4.55	4.61	6.43
nslphom(Y, X)	5.07	4.39	3.96	4.18	4.79	4.61	4.82	6.43
nslphom_dual	10.22	8.81	8.06	8.33	9.03	8.86	9.52	12.87
nslphom_joint	29.03	19.39	21.44	18.09	21.90	21.39	20.94	28.99

Source: Compiled by the authors after applying, with default options, the functions lphom, tslphom, and nslphom of the R package lphom (Pavía & Romero, 2022) and the new algorithms, described in Sections 3–5, to the data from the New Zealand electoral commission and the Scotland Electoral Office available in the R package ei.Datasets (Pavía, 2022). In the case of nslphom-based models, ns is always fixed at 10. The computations have been performed on a laptop with a CPU processor Intel Core i7-6820HK (4 cores) 2.70 GHz and 64GB of RAM.

7. On the factors impacting on the accuracy of the estimates

In the previous section, we have compared the accuracy averages of the different solutions, omitting their variability. However, the accuracies of the estimates not only differ across methods but they obviously also differ across districts/elections. In this section, we consider several observed features, specific of each election, and study whether and to what extent they can explain the observed differences in accuracy across elections. To do that, we just focus on nslphom_dual_w and nlsphom_joint solutions. On the one hand, the former are revealed as the more accurate in the analysed database. On the other hand, the latter have been obtained with the theoretically finest method, as it solves the problem dealing with all the constraints simultaneously.

Figure 3, where the distributions obtained for EI are drawn in the 565 elections analysed, clearly shows the existence of important accuracy differences across elections in both groups of solutions. For instance, focusing on nslphom_dual_w solutions, we have that EI ranges from a minimum of 2.45 to a maximum of 20.01; with the minimum being observed in an election where voters are distributed into 90 polling stations and the problem consists in estimating a transfer matrix of size $5 \times 5$ and the maximum being recorded in an election in which there are just data from 38 polling stations to estimate a $7 \times 7$ matrix. Indeed, this remark properly exemplifies a usual pattern/relationship typically shown for ecological inference estimators, which tend to be more accurate, the larger is the ratio between units and unknowns, $R = I / J K$ ⁠. In particular, in our application, the EI average values for, respectively, nslphom_dual_w and nslphom_joint are 12.47 and 12.43 when $R \leq 1$ ⁠; 9.46 and 9.97 when $1 < R < 2$ ⁠; and 8.08 and 8.43 when $R \geq 2$ ⁠.

Figure 3.

Histograms of the distributions of the error indexes (EI) corresponding to the nslphom_dual_w and nslphom_joint solutions attained in the 565 elections analysed. The discontinuous vertical lines place the means of the distributions.

The accuracy of the attained estimates, therefore, depends on structural characteristics of the particular election under study. In what follows, we analyse whether the observed variables previously identified in the literature as explainers of accuracy for asymmetric methods also work for the proposed symmetric models. In particular, the list of features recognized in the literature (e.g. King, 1997; Klima et al., 2016; Park et al., 2014; Pavía & Romero, 2023; Plescia & De Sio, 2018; Wakefield, 2004) as factors impacting on the accuracy of the estimates include: (i) the amount of available information, I; (ii) the complexity of the problem, JK; (iii) the level of heterogeneity among the local transfer matrices, HET; (iv) the strength of the relationship between the row and column categories, $χ^{2}$ ⁠; (v) the degrees of diversity/variability within-units in both row and column marginal distributions, DWR and DWC, which account for how similar/different the sizes of different groups/categories are; and (vi) the extent of the variability across-units in both row and column marginal distributions, VAR and VAC, which measure how similar/different are the margin distributions across tables.

As a rule, ceteris paribus, the larger the number of polling stations and the smaller the contingency tables (the number of coefficients to be estimated), the more accurate the estimates are; with the impact of each of these two features being conditioned by the value taken by the other. Indeed, it is customarily stated that a ratio of at least two units (polling stations) per coefficient is necessary for a proper estimation of the vote transfer matrix (Plescia & De Sio, 2018). Thus, the impacts of I and JK are usually assessed together using as joint indicator the number of polling stations divided by number of columns times the number of rows: $R = I / J K$ ⁠.

The values HET and $χ^{2}$ are commonly unknown as they both depend on the actual district vote transfer matrix. Although in our case they could be exactly computed, we prefer to consider a typical situation and to study the impact of these features relying on their estimates, which are contingent to the specific solution reached. In particular, we approximate HET using the HETe coefficient given by equation (33) and $χ^{2}$ through $χ^{2} e = \sum_{j k} [{({\hat{v}}_{j k} - \sum_{j} {\hat{v}}_{j k} \sum_{k} {\hat{v}}_{j k})}^{2} / (J - 1) (K - 1) \sum_{j} {\hat{v}}_{j k} \sum_{k} {\hat{v}}_{j k}]$ ⁠, where the divisor $(J - 1) (K - 1)$ accounts for the different number of cells that each matrix has.

For approximating district within-unit diversities, different statistics have been tried in the literature. They have been measured as (a) (weighted) averages of the standard deviations of the unit marginal distributions (using as weights the number of voters per unit), (b) directly using the standard deviations of the district marginal distributions, and (c) employing the formula $1 - \sum_{h} π_{h}^{2}$ ⁠, with $π_{h}$ representing the proportion of votes gained by option h in the total population. In practice, these three measures are (almost) equivalent. In our dataset, the correlations between (a) and (b), (a) and (c), and (b) and (c) are, respectively, 0.99, −0.99, and −0.99 for parties and 0.99, −0.99, and −0.98 for candidates. We approximate DWR and DWC using (b), the standard deviations of the district margin distributions of parties and candidates, respectively. Finally, to measure the extent of the diversity/similarity of party and candidate distributions across polling stations, we have used the compositional total variance (Pawlowsky-Glahn et al., 2015): $V A R = (2 J)^{- 1} \sum_{j, j^{'}}^{J} Var ({l o g (x_{i j} / x_{i j^{'}})}_{i})$ and $V A C = (2 K)^{- 1} \sum_{k, k^{'}}^{K} Var ({l o g (y_{i k} / y_{i k^{'}})}_{i})$ ⁠, where the variances are computed across the I units. Note that in whatever application both VAR and VAC need to be different from zero for the ecological inference problem to have a solution, as all the methods learn from the statistical covariations between the polling stations margin distributions.

Once defined the features, we study using regression linear models their impact on the nslphom_dual_w and nslphom_joint solutions’ accuracy. To do that, we adopt a two-step strategy. First, we analyse the marginal impact of each feature not excluding the possibility that its (increasing/decreasing) effect operates at a decreasing/increasing rate. That is, we consider the possibility of nonlinear effects in the one-feature models by also including as potential regressor the square of the feature. Second, we jointly estimate the impact on accuracy of all the regressors identified as statistically significant in the marginal (one-feature) models. To fit the models, we have excluded the Māori electorates from the analysis. Although the aggregate results are quite similar when these elections are also included in the analysis, we have opted to not consider them to fit the model because of its singular characteristics that would undermine the impact of the ratio $I / J K$ ⁠. In fact, the Māori elections can be observed as non-standard as they are characterized for having their electorates distributed into a large number of polling stations, quite geographically spread, with the majority of them having very few voters.

Table 4 presents a statistical summary of the values corresponding to the abovementioned features in the 516 districts that remain in the database after excluding the Māori electorates. As can be observed, almost all variables show asymmetric positive, right-skewed, distributions. In terms of correlations, the largest ones are observed for the pairs of variables $I / J K$ and DWR (0.74), $I / J K$ and $χ^{2} e$ (0.53), $H E T e_d$ and DWC (−0.64), and $H E T e_j$ and DWR (−0.55). Given the large sample size, we do not expect this to represent a problem in interpreting the models obtained.

Table 4.

Statistical summary of the explicative variables, excluding M $\bar{a}$ ori electorates (⁠ $N = 516$ ⁠)

	$I / J K$	$H E T e_d$	$H E T e_j$	$χ^{2} e_d$	$χ^{2} e_j$	DWR	DWC	VAR	VAC
Mean	2.56	4.25	4.01	2932	2909	0.18	0.21	0.98	0.85
Standard deviation	1.19	1.25	0.95	966	960	0.04	0.04	0.36	0.35
Minimum	0.52	2.42	2.16	377	347	0.06	0.09	0.29	0.12
Maximum	7.73	9.44	8.45	6297	6613	0.34	0.33	2.77	2.12
Skewness	1.16	1.05	0.94	0.40	0.40	0.06	−0.13	1.10	0.74

	$I / J K$	$H E T e_d$	$H E T e_j$	$χ^{2} e_d$	$χ^{2} e_j$	DWR	DWC	VAR	VAC
Mean	2.56	4.25	4.01	2932	2909	0.18	0.21	0.98	0.85
Standard deviation	1.19	1.25	0.95	966	960	0.04	0.04	0.36	0.35
Minimum	0.52	2.42	2.16	377	347	0.06	0.09	0.29	0.12
Maximum	7.73	9.44	8.45	6297	6613	0.34	0.33	2.77	2.12
Skewness	1.16	1.05	0.94	0.40	0.40	0.06	−0.13	1.10	0.74

Source: Compiled by the authors through $χ^{2} e = \sum_{j k} [({\hat{v}}_{j k} - \sum_{j} {\hat{v}}_{j k} \sum_{k} {\hat{v}}_{j k})^{2} / (J - 1) (K - 1) \sum_{j} {\hat{v}}_{j k} \sum_{k} {\hat{v}}_{j k}]$ ⁠, $V A R = (2 J)^{- 1} \sum_{j, j^{'}}^{J} Var ({l o g (x_{i j} / x_{i j^{'}})}_{i})$ ⁠, and $V A C = (2 K)^{- 1} \sum_{k, k^{'}}^{K} Var ({l o g (y_{i k} / y_{i k^{'}})}_{i})$ ⁠, and using equation (33) and the standard deviations of the district marginal distributions of parties and candidates to calculate HETe, DWR, and DWC, respectively. The suffixes _d and _j stand for dual and joint, respectively.

Table 4.

Statistical summary of the explicative variables, excluding M $\bar{a}$ ori electorates (⁠ $N = 516$ ⁠)

	$I / J K$	$H E T e_d$	$H E T e_j$	$χ^{2} e_d$	$χ^{2} e_j$	DWR	DWC	VAR	VAC
Mean	2.56	4.25	4.01	2932	2909	0.18	0.21	0.98	0.85
Standard deviation	1.19	1.25	0.95	966	960	0.04	0.04	0.36	0.35
Minimum	0.52	2.42	2.16	377	347	0.06	0.09	0.29	0.12
Maximum	7.73	9.44	8.45	6297	6613	0.34	0.33	2.77	2.12
Skewness	1.16	1.05	0.94	0.40	0.40	0.06	−0.13	1.10	0.74

	$I / J K$	$H E T e_d$	$H E T e_j$	$χ^{2} e_d$	$χ^{2} e_j$	DWR	DWC	VAR	VAC
Mean	2.56	4.25	4.01	2932	2909	0.18	0.21	0.98	0.85
Standard deviation	1.19	1.25	0.95	966	960	0.04	0.04	0.36	0.35
Minimum	0.52	2.42	2.16	377	347	0.06	0.09	0.29	0.12
Maximum	7.73	9.44	8.45	6297	6613	0.34	0.33	2.77	2.12
Skewness	1.16	1.05	0.94	0.40	0.40	0.06	−0.13	1.10	0.74

Source: Compiled by the authors through $χ^{2} e = \sum_{j k} [({\hat{v}}_{j k} - \sum_{j} {\hat{v}}_{j k} \sum_{k} {\hat{v}}_{j k})^{2} / (J - 1) (K - 1) \sum_{j} {\hat{v}}_{j k} \sum_{k} {\hat{v}}_{j k}]$ ⁠, $V A R = (2 J)^{- 1} \sum_{j, j^{'}}^{J} Var ({l o g (x_{i j} / x_{i j^{'}})}_{i})$ ⁠, and $V A C = (2 K)^{- 1} \sum_{k, k^{'}}^{K} Var ({l o g (y_{i k} / y_{i k^{'}})}_{i})$ ⁠, and using equation (33) and the standard deviations of the district marginal distributions of parties and candidates to calculate HETe, DWR, and DWC, respectively. The suffixes _d and _j stand for dual and joint, respectively.

In order to facilitate the interpretation of the estimated coefficients and the relative impact of their associated features on accuracy, all the explanatory variables have been standardized to zero mean and standard deviation 1. In this way, each estimated coefficient informs about the expected variation from the mean in the response variable due to one standard variation in the corresponding variable. From the one-feature models (not shown), one can infer that (i) $I / J K$ is the feature with the largest impact on accuracy, followed by DWC and $χ^{2} e$ ⁠, (ii) two variables (⁠ $I / J K$ and $χ^{2} e$ ⁠) show a significant curvature in its relationship with EI, i.e. accuracy improves but at decreasing rates as both variables grow, and (iii) accuracy also improves when either row or column within-units or across-units variances increase.

Some changes between the marginal and joint impacts of the features happen when we model the full multivariate specifications. Table 5 presents the coefficients of the fitted linear regression models. All variables, except $χ^{2} e^{2}$ and DWR, show a statistically significant impact. Together these variables explain around 25% of the observed variability of EI. As expected, larger $I / J K$ ratios lead to more reliable estimates. This feature is moreover the one with the largest impact. The other two variables impacting positively by increasing accuracy (i.e. by reducing EI) are the column within-units and across-units variabilities. If in the one-feature models both row and column within-units and across-units variabilities improve accuracy in a similar fashion, when we consider all the features simultaneously, DWR lost its statistical significance and VAR reverses its sign. This result may seem puzzling in a first glance, given the symmetric nature of the algorithms, but their different impact in this database is consequence of the differences between the row and column distributions of variabilities/diversities. Finally, the models also show that when we control for the rest of variables, marginally the larger the heterogeneities or the relationships between row and column categories, the less accurate the solutions are, ceteris paribus.

Table 5.

Impact of different features on solutions’ accuracy

Variable	Response variable: error index (EI)
	nslphom_dual_w		nslphom_joint
	Estimate	p-Value	Estimate	p-Value
$Intercept$	8.23	<0.0001	8.72	<0.0001
$I / J K$	−1.60	<0.0001	−1.89	<0.0001
$(I / J K)^{2}$	0.32	<0.0001	0.37	<0.0001
HETe	0.46	0.0009	0.84	<0.0001
$χ^{2} e$	0.78	<0.0001	1.15	<0.0001
$χ^{2} e^{2}$	0.14	0.0583	0.05	0.5368
VAR	0.67	<0.0001	0.43	0.0027
VAC	−0.66	<0.0001	−0.51	0.0001
DWR	0.23	0.1904	0.34	0.0549
DWC	−0.44	0.0302	−0.52	0.0042
Adjusted R² (%)	24.90		25.73
Residual Std. error	2.21		2.33

Variable	Response variable: error index (EI)
	nslphom_dual_w		nslphom_joint
	Estimate	p-Value	Estimate	p-Value
$Intercept$	8.23	<0.0001	8.72	<0.0001
$I / J K$	−1.60	<0.0001	−1.89	<0.0001
$(I / J K)^{2}$	0.32	<0.0001	0.37	<0.0001
HETe	0.46	0.0009	0.84	<0.0001
$χ^{2} e$	0.78	<0.0001	1.15	<0.0001
$χ^{2} e^{2}$	0.14	0.0583	0.05	0.5368
VAR	0.67	<0.0001	0.43	0.0027
VAC	−0.66	<0.0001	−0.51	0.0001
DWR	0.23	0.1904	0.34	0.0549
DWC	−0.44	0.0302	−0.52	0.0042
Adjusted R² (%)	24.90		25.73
Residual Std. error	2.21		2.33

Source: Compiled by the authors. All the predictor variables were standardized before fitting the models to make comparisons of coefficients easier.

Table 5.

10.1080/17457280902799089

Impact of different features on solutions’ accuracy

Variable	Response variable: error index (EI)
	nslphom_dual_w		nslphom_joint
	Estimate	p-Value	Estimate	p-Value
$Intercept$	8.23	<0.0001	8.72	<0.0001
$I / J K$	−1.60	<0.0001	−1.89	<0.0001
$(I / J K)^{2}$	0.32	<0.0001	0.37	<0.0001
HETe	0.46	0.0009	0.84	<0.0001
$χ^{2} e$	0.78	<0.0001	1.15	<0.0001
$χ^{2} e^{2}$	0.14	0.0583	0.05	0.5368
VAR	0.67	<0.0001	0.43	0.0027
VAC	−0.66	<0.0001	−0.51	0.0001
DWR	0.23	0.1904	0.34	0.0549
DWC	−0.44	0.0302	−0.52	0.0042
Adjusted R² (%)	24.90		25.73
Residual Std. error	2.21		2.33

Variable	Response variable: error index (EI)
	nslphom_dual_w		nslphom_joint
	Estimate	p-Value	Estimate	p-Value
$Intercept$	8.23	<0.0001	8.72	<0.0001
$I / J K$	−1.60	<0.0001	−1.89	<0.0001
$(I / J K)^{2}$	0.32	<0.0001	0.37	<0.0001
HETe	0.46	0.0009	0.84	<0.0001
$χ^{2} e$	0.78	<0.0001	1.15	<0.0001
$χ^{2} e^{2}$	0.14	0.0583	0.05	0.5368
VAR	0.67	<0.0001	0.43	0.0027
VAC	−0.66	<0.0001	−0.51	0.0001
DWR	0.23	0.1904	0.34	0.0549
DWC	−0.44	0.0302	−0.52	0.0042
Adjusted R² (%)	24.90		25.73
Residual Std. error	2.21		2.33

Source: Compiled by the authors. All the predictor variables were standardized before fitting the models to make comparisons of coefficients easier.

8. Discussion, conclusions, and future research

Ecological inference methods are devised to infer conditional distribution probabilities from marginal distributions. In doing so, they consider a main characteristic variable (e.g. race or social class) impacting on a response variable, usually the vote. This scheme fits the majority of ecological inference problems. In some instances, however, such as in simultaneous elections, this general scheme can be questionable. The issue is that different solutions are achieved depending on which variable is considered the factor (origin, explanatory, cause) and which the response. The methods are asymmetric. In this paper, we ask whether we can obtain some advantage in terms of accuracy by dealing with this inverse problem in a symmetric way. That is, by solving it as a purely mathematical puzzle that must verify some congruence properties, omitting the possible presence of a natural a priori relationship (i.e. a logical mapping of the variables to rows and columns). Afterwards, the researcher can recover the meaningful order when presenting the outcomes.

To answer the above research question, this paper builds within the linear programming framework two new families of algorithms whose solutions do not depend on how variables are mapped to rows and columns. These are symmetric methods. From a statistical standpoint, symmetric approaches offer the advantage of adhering to the probabilistic symmetry condition, meaning that their solutions conform to Bayes theorem. It should be noted, however, that while the symmetric condition could be considered a desirable property, it alone does not guarantee the attainment of accurate solutions. For example, an algorithm that assumes conditional independence between rows and columns given the unit (equivalent to the nonlinear neighbourhood model; Freedman et al., 1998) would multiply the average error by more than four times in the datasets analysed in this paper, even verifying the symmetric condition.

We evaluate the accuracy of the proposed methods using real data corresponding to 565 simultaneous elections where the true district-level cross-classifications of votes are known. Our empirical assessment indubitably points to the proposed symmetric solutions as being more accurate than the equivalent asymmetric approaches, with the ratio between available information and complexity/difficulty of the problem as the characteristic having more impact on accuracy. Overall, the optimal solutions reached with the joint specifications, those which solve linear programmes that simultaneously include all the constraints, are not the most accurate on average. They are outperformed by the methods that build their solutions as an average of asymmetric solutions. We conclude that for the datasets at hand, the nslphom_dual_w solutions are, on average, the most accurate. This method grounds its better accuracy on its greater robustness since it is the one generating the smallest error fewer times among the nslphom symmetric models. Indeed, there is no symmetric algorithm that systematically beats the rest of the symmetric methods, so a question to be answered is whether the circumstance itself could indicate which symmetric method to choose. In other words, can we find specific configurations derivable from the observed data that calls for the recommendation of a specific method for a particular dataset?

Given that mathematical programming offers the natural methodological setting for simultaneously handling all the constraints linked to a symmetric treatment of the problem, we have considered only algorithms within the linear programming framework where we have built the symmetric solutions from asymmetric methods. This has the advantage of maintaining all the proposals within the same framework, making comparisons fairer and producing methods easy to use and robust to claims of hacking. The user only needs to input a set of ecological data and a maximum number of iterations, and the procedures automatically return a sensible solution. Nevertheless, it would be worth extending the analysis to other methodological settings, i.e. to study whether similar results would be obtained if symmetric methods were built from scratch or using asymmetric algorithms from other frameworks. For example, the following question could be addressed: can inferences attained as an average of the two dual solutions of the Rosen et al. (2001) R × C model systematically improve the asymmetric solution achieved by choosing the most logical approach (of the two)? Indeed, in our view, the new angle for tackling the ecological inference problem offered by this research should also be systematically explored from other conceptual frameworks. This, however, demands new theoretical developments. More specifically, just as ecological inference linear programming requires new developments to introduce information from polls into its models (Pavía, 2023), an effort is also required to develop genuine symmetric models from other frameworks, such as the Bayesian and frequentist ecological inference ones. In this case, we consider that models based on conditional multivariate hypergeometric distributions or full-table multinomial distributions could prove successful.

Our empirical results are based on data from simultaneous elections so it would also be worth exploring whether our conclusions can be confirmed in other contexts, such as voting rights litigation or literacy studies. Despite electoral datasets with true answers in other target application areas being scarce and typically not available, it would be interesting to replicate the study of Barreto et al. (2022) on racially polarized voting and analyse the impact of using symmetric approaches on substantive conclusions and on the accuracy in estimating the inner-cells values of their simulated datasets. Likewise, the symmetric approach could also be tested estimating the tables of the datasets employed by Jiang et al. (2020). These datasets include data on US mortality rates by gender and race, or literacy rates and educational attendance by gender in India.

Finally, from a more methodological perspective, it would be interesting to study what impact initializing the iterative process with a different matrix of votes would have on the accuracy of the nslphom_joint optimal solutions; for instance, what the impact would be of initializing with the nslphom_dual_w solution instead of with the lphom_joint solution. Furthermore, given the lack of theoretical results about the (asymptotic) distributions of the estimators, we consider that the bootstrap approach (e.g. Efron & Tibshirani, 1994) could be used to measure the precision of estimates. Despite the fact that lphom-based specifications can be theoretically observed as a linear absolute fitting problem after applying Theorem 1 of chapter 6 in Bloomfield and Steiger (1983, pp. 164–165) and, therefore, their estimators ‘usually’ be considered as having ‘a limiting normal distribution’ (Bloomfield & Steiger, 1983, p. 44), in our view, this does not apply as a rule in the ecological inference problem and definitely cannot be used in the tslphom- and nslphom-based specifications. On the one hand, it is difficult to accept that the assumptions required by the different asymptotic theorems hold in the ecological inference problem, due to, among other issues, the discrete character of the margins and the cross structures of relationships that the row and column aggregations impose. On the other hand, and more importantly, the tslphom- and nslphom-specifications do not fulfil the hypothesis of the abovementioned Theorem 1, as it requires that the number of summands (auxiliary variables) in the objective function be greater than the number of equality constraints. The unit tables algorithms, lphom_local and lphom_local_joint, that are in the core of the tslphom- and nslphom-family models solve linear programmes where the number of auxiliary variables is smaller than the number of equality constraints.

The above issues do not imply that ecological inference linear programming approaches lack a statistical interpretation or that the estimated coefficients necessarily exhibit undesirable statistical properties. On the one hand, in addition to the established links between linear programming solutions and minimization of expected discrepancies, as previously mentioned when discussing the proposed models, it is important to note that tslphom- and nlsphom-based approaches, like many other ecological inference models, rely on the underlying assumption of independence across units of the two-way distributions of votes, given their aggregate two-way joint distribution. In other words, the set of unit two-way fraction distributions can be viewed as a simple random sample of a common probability distribution. This entails that, without covariates, estimates can only be reliable when models are applied in homogeneous political regions. On the other hand, we believe that similar to how sampling properties of quadratic programming solutions can be obtained through the links between quadratic programming and inequality-constrained normal regression (Geweke, 1986; Judge & Takayama, 1966), the connection between linear programming and inequality-constrained linear absolute fitting with Laplace-distributed errors could be explored to derive statistical properties of basic ecological inference linear programming solutions. In addition, we also find remarkable that the iterative algorithm that characterize nslphom-based methods can be understood in terms of an expectation-maximization algorithm (Dempster et al., 1977). In this comparison, equations (6)–(13) would correspond to the expectation step, equation (14) to the maximization step, and equations (1)–(5) are required for generating an initial estimate of the expected transition probabilities. This connection becomes more evident when examining the variants introduced in the lclphom algorithm (Pavía, 2024).

Acknowledgements

The authors wish to thank the editors and four anonymous reviewers for their valuable comments and suggestions and M. Hodkinson for revising the English of the paper.

Funding

This research has been supported by Conselleria de Educación, Universidades y Empleo, Generalitat Valenciana [grant number AICO/2021/257] and by Ministerio de Economía e Innovación [grant number PID2021-128228NB-I00].

Data availability

The data used in this research is publicly available on the R package ei.Datasets (version 0.0.1-1) accessible on CRAN. The reproducible ad hoc R-code employed, based on functions included in the R package lphom (version 0.3.0-7), is available in the attached online supplementary material.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series A.

References

Andreadis

,

I.

, &

Chadjipadelis

,

T.

(

2009

).

A method for the estimation of voter transition rates

.

Journal of Elections, Public Opinion and Parties

,

19

(

2

),

203

–

218

.

Barreto

,

M.

,

Collingwood

,

L.

,

Garcia-Rios

,

S.

, &

Oskooii

,

K. A. R.

(

2022

).

Estimating candidate support in voting rights act cases: Comparing iterative EI and EI-R_C methods

.

Sociological Methods & Research

,

51

(

1

),

271

–

304

.

10.1177/0049124119852394

Bernardini-Papalia

,

R.

, &

Fernández-Vázquez

,

E.

(

2020

).

Entropy-based solutions for ecological inference problems: A composite estimator

.

Entropy

,

22

(

7

),

781

.

Bloomfield

,

P.

, &

Steiger

,

W. L.

(

1983

).

Least absolute deviations: Theory, applications and algorithms

.

Birkhäuser-Springer

.

10.1080/01621459.1986.10478290

Brown

,

P. J.

, &

Payne

,

C. D.

(

1986

).

Aggregate data, ecological regression and voting transitions

.

Journal of the American Statistical Association

,

81

(

394

),

452

–

460

.

10.1111/j.2517-6161.1977.tb01600.x

Collingwood

,

L.

,

Decter-Frain

,

A.

,

Murayama

,

H.

,

Sachdeva

,

P.

, &

Burke

,

J.

(

2020

). eiCompare: Compares ecological inference, Goodman, rows by columns estimates. R package version 3.0.0. https://CRAN.R-project.org/package=eiCompare

Collingwood

,

L.

,

Oskooii

,

K.

,

Garcia-Rios

,

S.

, &

Barreto

,

M.

(

2016

).

eiCompare: Comparing ecological inference estimates across EI and EI:R×C

.

The R Journal

,

8

(

2

),

92

–

101

.

Dempster

,

A. P.

,

Laird

,

N. M.

, &

Rubin

,

D. B.

(

1977

).

Maximum likelihood from incomplete data via the EM algorithm

.

Journal of the Royal Statistical Society, Series B

,

39

(

1

),

1

–

38

.

10.1007/s11135-019-00845-1

Efron

,

B.

, &

Tibshirani

,

R. J.

(

1994

).

An introduction to bootstrap

.

Chapman and Hall/CRC

.

Ferree

,

K. E.

(

2004

).

Iterative approaches to RxC ecological inference problems: Where they can go wrong and one quick fix

.

Political Analysis

,

12

(

2

),

143

–

159

.

Forcina

,

A.

, &

Pellegrino

,

D.

(

2019

).

Estimation of voter transitions and the ecological fallacy

.

Quality & Quantity

,

53

(

4

),

1859

–

1874

.

Freedman

,

D. A.

,

Klein

,

S. P.

,

Ostland

,

M.

, &

Roberts

,

M. R.

(

1998

).

A solution to the ecological inference problem (book review)

.

Journal of the American Statistical Association

,

93

(

444

),

1518

–

1522

.

Freedman

,

D. A.

,

Ostland

,

M.

,

Roberts

,

M. R.

, &

Klein

,

S. P.

(

1999

).

Reply to G. King

.

Journal of the American Statistical Association

,

94

(

445

),

355

–

357

.

Gelman

,

A.

,

Park

,

D. K.

,

Ansolabehere

,

S.

,

Price

,

L. C.

, &

Minnite

,

L. C.

(

2001

).

Models, assumptions and model checking in ecological regression

.

Journal of the Royal Statistical Society, Series A

,

164

(

1

),

101

–

118

.

10.1111/1467-985X.00190

Geweke

,

J.

(

1986

).

Exact inference in the inequality constrained normal linear regression model

.

Journal of Applied Econometrics

,

1

(

2

),

127

–

141

.

10.1002/jae.3950010203

10.1016/j.stamet.2009.09.003

Glynn

,

A.

, &

Wakefield

,

J.

(

2010

).

Ecological inference in the social sciences

.

Statistical Methodology

,

7

(

3

),

307

–

322

.

Goodman

,

L. A.

(

1953

).

Ecological regressions and the behavior of individuals

.

American Sociological Review

,

18

(

6

),

663

–

664

.

Goodman

,

L. A.

(

1959

).

Some alternatives to ecological correlation

.

American Journal of Sociology

,

64

(

6

),

610

–

625

.

Greiner

,

D. J.

(

2007

).

Ecological inference in voting rights act disputes: Where are we now, and where do we want to be?

Jurimetrics

,

47

(

2

),

115

–

167

. https://www.jstor.org/stable/29762964

10.1111/j.1467-985X.2008.00551.x

Greiner

,

D. J.

,

Baines

,

P.

, &

Quinn

,

K.M.

(

2021

). RxCEcolInf: R x C ecological inference with optional incorporation of survey information. R package version 0.1-5. https://CRAN.R-project.org/package=RxCEcolInf

Greiner

,

D. J.

, &

Quinn

,

K. M.

(

2009

).

R×C ecological inference: Bounds, correlations, flexibility, and transparency of assumptions

.

Journal of the Royal Statistical Society, Series A

,

172

(

1

),

67

–

81

.

. https://www.jstor.org/stable/20001867

Greiner

,

D. J.

, &

Quinn

,

K. M.

(

2010

).

Exit polling and racial bloc voting: Combining individual level and RxC ecological data

.

The Annals of Applied Statistics

,

4

(

4

),

1774

–

1796

.

Hawkes

,

A.

(

1969

).

An approach to the analysis of electoral swing

.

Journal of the Royal Statistical Society, Series A

,

132

(

1

),

68

–

79

.

Jiang

,

W.

,

King

,

G.

,

Schmaltz

,

A.

, &

Tanner

,

M. A.

(

2020

).

Ecological regression with partial identification

.

Political Analysis

,

28

(

1

),

65

–

86

.

Johnston

,

R. J.

,

Hay

,

A. M.

, &

Rumley

,

D.

(

1983

).

Entropy-maximizing method for estimating voting data: A critical test

.

Area

,

15

(

1

),

35

–

41

10.1111/j.1538-4632.2003.tb01098.x

Johnston

,

R. J.

, &

Pattie

,

C.

(

2003

).

Evaluating an entropy-maximizing solution to the ecological inference problem: Split-ticket voting in New Zealand, 1999

.

Geographical Analysis

,

35

(

1

),

1

–

23

.

10.1080/01621459.1966.10502016

Judge

,

G. G.

,

Miller

,

D. J.

, &

Tam Cho

,

W. K.

(

2004

). An information theoretic approach to ecological estimation and inference. In

G.

King

,

O.

Rosen

,

M. A,

Tanner

(Eds.),

Ecological inference. New methodological strategies

(pp.

162

–

187

).

Cambridge University Press

.

Judge

,

G. G.

, &

Takayama

,

T.

(

1966

).

Inequality restrictions in regression analysis

.

Journal of the American Statistical Association

,

61

(

313

),

166

–

181

.

Katz

,

J. N.

(

2014

).

Expert report on voting in the city of Whittier

.

Superior Court of the State of California

.

Kellermann

,

T.

(

2011

).

Vom Wahlergebnis zur Wählerwanderung: Welche Wähler wechselten wie ihre Entscheidung

.

Stadtforschung und Statistik

,

2011

(

1

),

34

–

40

.

King

,

G.

(

1997

).

A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data

.

Princeton University Press

.

King

,

G.

,

Rosen

,

O.

, &

Tanner

,

M. A.

(Eds.) (

2004

).

Ecological inference. New methodological strategies

.

Cambridge University Press

.

Klein

,

J. M.

(2019). Estimation of voter transitions in multi-party systems. Quality of credible intervals in (hybrid) multinomial-Dirichlet models [Master thesis dissertation]. Ludwig-Maximilians-Universität München.

Klima

,

A.

,

Schlesinger

,

T.

,

Thurner

,

P. W.

, &

Küchenhoff

,

H.

(

2019

).

Combining aggregate data and exit polls for the estimation of voter transitions

.

Sociological Methods & Research

,

48

(

2

),

296

–

325

.

10.1177/0049124117701477

10.1007/s10182-015-0254-8

Klima

,

A.

,

Thurner

,

P. W.

,

Molnar

,

C.

,

Schlesinger

,

T.

, &

Küchenhoff

,

H.

(

2016

).

Estimation of voter transitions based on ecological inference: An empirical assessment of different approaches

.

AStA-Advances in Statistical Analysis

,

100

(

2

),

133

–

159

.

Lau

,

O.

,

Moore

,

O. R. T.

, &

Kellermann

,

M.

(

2020

). eiPack: Ecological inference and higher-dimension data management. R package version 0.2-1. https://CRAN.R-project.org/package=eiPack

Manski

,

C. F.

(

2007

).

Identification for prediction and decision

.

Harvard University Press

.

O’Loughlin

,

J.

(

2000

).

Can King’s ecological inference method answer a social scientific puzzle: Who voted for the Nazi party in Weimar Germany?

Annals of the Association of American Geographers

,

90

(

3

),

592

–

601

.

10.1111/0004-5608.00213

10.1016/j.electstud.2014.08.006

Park

,

W.-H.

,

Hanmer

,

M. J.

, &

Biggers

,

D. R.

(

2014

).

Ecological inference under unfavorable conditions: Straight and split-ticket voting in diverse settings and small samples

.

Electoral Studies

,

36

,

192

–

203

.

10.1177/08944393211040808

Pavía

,

J. M.

(

2022

).

ei.Datasets: Real datasets for assessing ecological inference algorithms

.

Social Science Computer Review

,

40

(

1

),

247

–

260

.

10.1007/s43545-023-00658-y

Pavía

,

J. M.

(

2023

).

Adjustment of initial estimates of voter transition probabilities to guarantee consistency and completeness

.

SN Social Sciences

,

3

(

5

),

75

.

10.1016/j.apgeog.2017.06.021

Pavía

,

J. M.

(

2024

).

A local convergent ecological inference algorithm for RxC tables

.

Pavía

,

J. M.

, &

Cantarino

,

I.

(

2017

).

Dasymetric distribution of votes in a dense city

.

Applied Geography

,

86

,

22

–

31

.

10.1177/00491241221092725

Pavía

,

J. M.

, &

Romero

,

R.

(

2022

).

Improving estimates accuracy of voter transitions. Two new algorithms for ecological inference based on linear programming

.

Sociological Methods & Research

.

Pavía

,

J. M.

, &

Romero

,

R.

(

2023

).

Data wrangling, computational burden, automation, robustness and accuracy in ecological inference forecasting of RxC tables

.

SORT—Statistics and Operations Research Transactions

,

47

(

1

),

151

–

186

.

10.57645/20.8080.02.4

10.1007/s11135-017-0481-z

Pawlowsky-Glahn

,

V.

,

Egozcue

,

J. J.

, &

Tolosana-Delgado

,

R.

(

2015

).

Modeling and analysis of compositional data

.

John Wiley & Sons, Ltd

.

Plescia

,

C.

, &

De Sio

,

L.

(

2018

).

An evaluation of the performance and suitability of RxC methods for ecological inference with known true values

.

Quality & Quantity

,

52

(

2

),

669

–

683

.

Robinson

,

W. S.

(

1950

).

Ecological correlations and the behavior of individuals

.

American Sociological Review

,

15

(

3

),

351

–

357

.

Romero

,

R.

, &

Pavía

,

J. M.

(

2021

).

Estimating vote party entries and exits by ecological inference. Mathematical programming versus Bayesian statistics

.

Boletín de Estadística e Investigación Operativa

,

37

(

2

),

85

–

97

.

10.1080/02664763.2020.1804842

Romero

,

R.

,

Pavía

,

J. M.

,

Martín

,

J.

, &

Romero

,

G.

(

2020

).

Assessing uncertainty of voter transitions estimated from aggregated data. Application to the 2017 French presidential election

.

Journal of Applied Statistics

,

47

(

13–15

),

2711

–

2736

.

Rosen

,

O.

,

Jiang

,

W.

,

King

,

G.

, &

Tanner

,

M. A.

(

2001

).

Bayesian and frequentist inference for ecological inference: The RxC case

.

Statistica Neerlandica

,

55

(

2

),

134

–

156

.

10.1111/1467-9574.00162

10.1080/13597566.2020.1774750

Schakel

,

A. H.

, &

Romanova

,

V.

(

2020

).

Vertical linkages between regional and national electoral arenas and their impact on multilevel democracy

.

Regional and Federal Studies

,

30

(

3

),

323

–

342

.

10.1111/j.0092-5853.2004.00062.x

Tam Cho

,

W. K.

(

1998

).

Iff the assumption fits…: A comment on the King ecological inference solution

.

Political Analysis

,

7

,

143

–

163

.

Tam Cho

,

W. K.

, &

Gaines

,

B. J.

(

2004

).

The limits of ecological inference: The case of split-ticket voting

.

American Journal of Political Science

,

48

(

1

),

152

–

171

.

Thomsen

,

S. R.

(

1987

).

Danish elections, 1920–79: A logit approach to ecological analysis and inference

.

Politica

.

10.1080/02331938608843128

Tziafetas

,

G.

(

1986

).

Estimation of the voter transition matrix

.

Optimization

,

17

(

2

),

275

–

279

.

10.1111/j.1467-985x.2004.02046.x

Wakefield

,

J.

(

2004

).

Ecological inference for 2×2 tables (with discussion)

.

Journal of the Royal Statistical Society, Series A

,

167

(

3

),

385

–

445

.