-
PDF
- Split View
-
Views
-
Cite
Cite
Christian Baden, Giovanni Motta, Evolutionary correspondence analysis of the semantic dynamics of frames, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 187, Issue 4, October 2024, Pages 1065–1095, https://doi.org/10.1093/jrsssa/qnae022
- Share Icon Share
Abstract
We introduce and implement a novel dimension-reduction method for high-dimensional time-varying contingency-tables: the Evolutionary Correspondence Analysis (ECA). ECA enables a comparative analysis of high-dimensional, diachronic processes by identifying a small number of shared latent variables that shape co-evolving data patterns. ECA offers new opportunities for the study of complex social phenomena, such as co-evolving public debates: Its capacity to inductively extract time-varying latent variables from observed contents of evolving debates permits an analysis of meanings shared by linked sub-discourses, such as linked national public spheres or the discourses led by distinct political camps within a shared public sphere. We illustrate the utility of our approach by studying how the Greek and German right-, centre-, and left-leaning news coverage of the European financial crisis evolved between its outbreak in 2009 until its institutional containment in 2012. Comparing the use of 525 unique concepts in six German and Greek outlets with different political leaning over an extended period of time, we identify two common factors accounting for those evolving meanings and analyse how the different sub-discourses influenced one another over time. We allow the factor loadings to be time-varying, and fit to the latent factors a time-varying vector-auto-regressive model with time-varying mean.
1 Introduction
The growing availability of ‘big data’ capturing complex social behaviour has raised numerous challenges for statistical analysis. In social science text analysis, one key application of statistical modelling is the study of public debates. Generally, public debates are characterized by a rapid succession of events and topics, embedded within a slowly evolving process of cultural meaning-making. At any time, salient issues in the news are interpreted against a finite set of widely recognized frames (Motta & Baden, 2013), which might (for instance) cast rising inflation as an economic threat to personal livelihoods, a challenge for monetary policy making, or an indicator of macro-economic developments. Frames are constantly negotiated and evolve in a path-dependent manner, responding to both external events and to concurrent debates on related issues, which are interconnected across national, cultural, and linguistic borders (Trenz, 2004; Wessler et al., 2016).
In recent years, a growing awareness of such transnational inter-dependencies in public debates has resulted in a comparative turn in social science text analysis (Baden et al., 2022). Relying on automated Natural Language Processing technologies to extract high-dimensional co-occurrence and covariance data from recorded debates (e.g. Burscher et al., 2016), however, most comparative strategies restrict available analyses to cross-sectional data, requiring researchers to either disregard important over-time variation or to construct artificial phase-wise comparative designs. As a consequence, such approaches face important limitations whenever comparable contents do not appear synchronously across juxtaposed debates, and are unable to identify diachronic influences (e.g. agenda building, frame building; Sheafer & Gabay, 2009) and inter-dependencies (e.g. synchronization, consensus formation, polarization; Baden & Tenenboim-Weinblatt, 2017; Yardi & Boyd, 2010). Inversely, most longitudinal strategies are unfit for comparative analysis and remain restricted to low-dimensional data, such as the time-varying salience of few pre-defined topics, frames, or sentiments (e.g. Vliegenthart & Walgrave, 2008). One strategy that has recently gained popularity is Structural Topic Modelling (Jacobi et al., 2016; Roberts et al., 2014), which permits a treatment of time as a covariate in the extraction of regular patterns from high-dimensional evolutionary data. However, this strategy obscures the long-term evolutionary, path-dependent nature of dynamic framing processes, while its dependency on language limits its utility for cross-national comparison (Chan et al., 2020). Where researchers aim to study the time-varying qualities and configurations of meanings expressed and exchanged across linked debates, there is a need for new tools that enable a comparative analysis of time-changing, internally complex processes.
Most available statistical models run into difficulties when deployed to study dynamic processes that are both high-dimensional (big N) and non-stationary (big T). When N is large comparable to T, coefficient estimates suffer from high variance. When N is larger than T, standard methods (such as ordinary least squares) become unavailable. Supervised answers to this problems are penalized regression methods such as Lasso (Beltran et al., 2021), variable selection and partial least squares (Fatema et al., 2022). When N is large and the response is not available, it is desirable to reduce the cross-section dimension.
When (additionally to N) T is large, parameters are unlikely to remain constant over time (Motta et al., 2011). Contrarily to the classical stationary approach where parameters are estimated globally, non-stationarity requires estimating the parameters locally in time (Burscher et al., 2016; Rosin & Radinsky, 2019). At the same time, without an explicit modelling of time-dependent dynamics, such local estimation risks to misrepresent evolving frames in the debate as discrete patterns in time, separated by the constraints of chosen estimation methods rather than due to true discontinuities (Baumgartner et al., 2008).
In this paper, we introduce Evolutionary Correspondence Analysis (ECA), a model-based unsupervised method for high-dimensional time-varying contingency tables. In ECA, both T (time-series size) and N (cross-section size) are permitted to be large, while the innate contingency table structure supports a structured comparison of time-changing similarities and differences, inter-dependencies and persistent idiosyncrasies between co-evolving high-dimensional diachronic processes. Statistically, this translates into a double asymptotic framework where both T and N diverge to infinity, with N growing at least as fast as T, so as to meet the characteristic challenge presented by ‘big data’. By proposing an inductive, unsupervised method, we specifically address the common situation that the nature of key patterns in the data (such as frames) is neither known ex ante nor deductively definable, owing both to the scale and evolutionary nature of the data.
Conceptually, ECA draws upon the basic idea behind Evolutionary Factor Analysis (EFA), a method for the identification of time-varying latent structures in high-dimensional diachronic data (Motta & Baden, 2013). However, it extends the basic EFA framework in two critical ways.
ECA permits researchers to jointly model the distinct and common latent structures shaping multiple co-evolving high-dimensional processes (such as public debates). It thereby permits a structured comparative analysis of the ways in which different discourses debate the same issues, identifying similarities and differences on two levels of abstraction: (i) the configuration of specific latent patterns (e.g. similar but not fully identical, evolving frames) within each debate; and (ii) the relative contribution of such patterns to structuring each debate over time.
ECA extends existing approaches to time-series analysis to enable the study of both synchronous and (cross-)lagged, directed, or mutual influences between latent processes extracted from co-evolving high-dimensional processes. In this way, ECA allows identifying inter-dependencies between linked debates both in the form of semantic inductions (e.g. the spill-over of specific associations and frames from one debate to another) and dialogical interactions (e.g. the formation or rise of specific frames in one debate, in response to different frames emerging in another debate).
In analogy to EFA, ECA thus extends conventional Correspondence Analysis (CA) to account for the fact that many real-world processes undergo meaningful over-time transformation. Hence, where CA assumes that any associations captured in a contingency table remain valid over the entire period under observation, ECA permits a dynamic, inductive analysis that reveals both areas and times of relative stability and change within the high-dimensional data.
In the following, we introduce ECA and illustrate its contribution to the analytic toolbox using data from the Greek and German news coverage about the European financial crisis. We start by presenting the data used for this demonstration (Section 2). We then review the basic idea of EFA (Section 3), and develop our novel ECA (Section 4) as a time-varying version of CA. We then provide a factor-model representation of the (rescaled) matrix of counts (Section 5.1), and illustrate the identification properties of our model (Section 5.2).
In Section 6, we detail four important benefits of interpretation that CA enjoys as compared to Principal Components Analysis (PCA) from a statistical point of view, and explain how ECA offers important new avenues for the comparative study of high-dimensional diachronic processes. Finally, we discuss key implications for social scientific research and methodology (Section 7).
The Supplementary Material contains three Appendices of the online supplementary material. In Appendix A of the online supplementary material, we provide simulation results that illustrate the performance of our estimates. In Appendix B of the online supplementary material, we derive a novel transition formula between the eigenvectors of the covariance matrix and the eigenvectors of the product between row-profiles and column-profiles. In Appendix C of the online supplementary material, we derive a new asymptotic expression for the mean squared error of the estimated smooth loadings, which can be used to select the smoothing parameter.
2 The data at hand
For the present application, we study how the German and Greek public debate interpreted the 2009–2012 financial crisis. Both countries were pivotal players in the development and resolution of the crisis, directly affecting one another’s policy decisions and paying close attention to evolving public debates in both countries. In this context, studying the evolution and alignment of frames between both debates offers valuable insights into the varying degrees of public resonance of proposed crisis policies and may reveal valuable windows of opportunity for joint problem definition and policy responses (Schön, 1994). We accessed all news published between 1 October 2009 (four days after the elections in Germany, three days before the early elections in Greece) and 30 June 2012 (two weeks after the June elections in Greece, the second elections called in that year) in three Greek and three German major news broadsheets (Kleinnijenhuis et al., 2015). In both countries, we selected leading news outlets to represent the preferred framing strategies in politically left-leaning, centrist, and right-leaning sub-discourses: Within Greece, we collected all news coverage pertaining to the European financial crisis in the left-leaning Eleftherotypia (ET), the centrist Ta Nea (TN), and the right-leaning Kathimerini (KA). Within Germany, we did the same for the left-leaning Frankfurter Rundschau (FR) and Tageszeitung (TZ), the centrist Süddeutsche Zeitung (SZ), and the right-leaning Die Welt (including its Sunday edition Welt am Sonntag; DW). In total, 43,589 relevant articles were included in the analysis.
Within this coverage, we used a large dictionary to automatically recognize the presence of unique semantic concepts, using a coding routine implemented in the AmCAT autocoding software environment (van Atteveldt, 2008). Coded concepts included key actors (e.g. politicians, banks, international organizations), and issues (e.g. debt, unemployment, specific policies), as well as a wide variety of actions, qualities, and evaluations relevant to the financial crisis. This extraction of concepts prior to estimation serves to focus the analysis on meaningful patterns in the text, thus avoiding that the factors to be estimated will be dominated by variations in the use of common but uninformative words (e.g. Nicholls & Culpepper, 2021). Concept frequencies were aggregated per sub-discourse and week, treating the two German left-leaning outlets as one to counterbalance their lesser volume of relevant coverage. In this way, we obtained a combined, three-dimensional matrix of dimensions , representing the presence of concepts in the contents of sub-discourses over the weeks covered by the analysis. At time t, each row of the combined matrix refers to one high-dimensional, diachronic process representing the path-dependent representation of the financial crisis in the left-leaning, centrist, and right-leaning (subscripts: ) sub-discourses within the Greek () and German () public debate ( for Greece and for Deutschland). That is,
where the matrices and are both . For our subsequent analysis, we rely on a subset of concepts, for which we registered sufficient variance in their presence across all six sub-discourses. The time-varying matrix , of size , has general entry which is the frequency of occurrences of concept n in sub-discourse p at week t, with , and . In particular, for our application we have sub-discourses (three for Greece and three for Germany), concepts selected among the original 975 available ones, and weeks covered by our analysis. Illustrating the analytic contribution of ECA, we study the co-evolution and inter-dependencies between (a) the respective sub-discourses both within Greece () and within Germany (), and between (b) the aggregated media discourses in the Greek and German news as a whole (). In particular, we use ECA to reduce the dimensionality of the data to extract both shared and idiosyncratic latent factors accounting for the bulk of variance in the use of each concept, within each sub-discourse, at each point in time. In place of 525 variables expressing the highly correlated use of many manifest concepts, we thus obtain a small number of latent variables expressing time-varying meaning. Each latent variable represents one set of concepts whose occurrence temporarily covaries within the observed data, which can be interpreted as a potential frame or framing-choice (e.g. when a factor distinguishes between two opposing ways of framing an issue).1 The main factors jointly express those major variations in framing that structure each sub-discourse a specific point in time. For each factor, ECA then detects the time-varying structure of relationships between these variables, which can be used to directly compare the respective frames and debates, or to study the simultaneous and cross-lagged influences between debates over time.
Throughout the paper, we use bold-unslanted letters for matrices, bold-slanted letters for vectors, and unbold (normal) letters for scalars. We denote by the transpose of the matrix , by the Frobenius (Euclidean) norm of , by the null matrix, by the identity matrix of order m, by the vector of ones, and by the indicator function. Raw (or unsmoothed) estimators are denoted by hats, whereas smoothed estimators are denoted by tildes.
3 The logic behind the evolutionary factor analysis
Linear Factor models have been widely applied inside and outside communication research to model high-dimensional data whose structures can be explained by means of a few common latent factors (e.g. Doise et al., 1993; Landauer & Dumais, 1997; Leydesdorff & Vaughan, 2006; Semetko & Valkenburg, 2000). the following model for the observations
where is the vector of common components, and is the vector of idiosyncratic components. Both and are not observed. Linear Factor analysis is based on the idea that the common components are linear functions of some common latent low-dimensional factors through the so-called loadings
whereas the idiosyncratic errors explain information that is specific to each particular series, and thus uncorrelated to the latent factors. Factor Analysis allows for dimension reduction since is
and is , with much smaller than N.
The EFA was introduced by Motta (2009), and its asymptotic properties have been studied by Motta et al. (2011). So far EFA has been employed in Econometrics (Eichler et al., 2011), Biostatistics (Motta & Ombao, 2012) and Communication Science (Motta & Baden, 2013). Motta (2017) summarizes the main opportunities for comparative analysis in social science and distinguishes between different versions of EFA for public debates. EFA derives from three simple propositions regarding the dynamic structure of high-dimensional data:
In high-dimensional, time-ordered data, a large number of observations is structured by latent processes, which can be detected based on the patterns of systematic co-variation over time (Baumgartner et al., 2008; Hellsten et al., 2010);
These latent processes change over time, in an evolutionary, i.e. gradual, path-dependent manner relative to their own past (Kleinnijenhuis & Fan, 1999; Leydesdorff, 2011);
Relatively few latent processes suffice to explain the bulk of informative co-variation in high-dimensional, dynamic data (Motta & Baden, 2013).
As a consequence from these propositions, the evolutionary, latent structure of high-dimensional data can be modelled. As a starting point, we need to modify basic linear factor models to take into account the diachronic variability of the data and latent processes. The Evolutionary Factor Analysis extends the modelling and estimation methodology of linear Factor Analysis to the case where the parameters are time-varying. In Motta and Baden (2013), we permit time-varying loadings on otherwise stationary and static factors :
where is a diagonal matrix with time-invariant parameters and is a zero-mean unit-variance process uncorrelated over time, that is,
where and are, respectively, the vector of zeros and the identity matrix of order . The matrix is assumed to be time-invariant and diagonal to reflect, respectively, the orthogonality among the latent factors and their stationarity. As a consequence of (2),
Since is time-invariant, the factors are stationary in the sense that their means, their variances and their auto-covariances do not change over time. Since is uncorrelated over time, the factors are static in the sense that they are uncorrelated over time.
In this paper, we generalize the approach introduced by Motta and Baden (2013) in three different directions. Firstly we distinguish, within the same set of observations, between P different sub-discourses. That is, our observed matrix is rather than . This allows to distinguish, within the same set of latent factors, between P different sub-discourses. Hence, our vector of latent factors has size rather than . Secondly, we allow the latent factors to be auto-correlated. Finally, we permit the factors to be non-stationary. As explained below, these three generalizations can be understood by comparing the representations of the factors in (1) and (5).
In Motta and Baden (2013) we adopted a solution with static, stationary factors and smoothly time-varying loadings . As a consequence, in model (1) the dynamics is fully captured by the time-varying, high-dimensional loadings . In this paper, we still allow the loadings to be time-varying
and, moreover, we allow the latent factors to have a time-varying auto-regressive structure:
with and for all .
Model (4) aims at modelling multivariate time series characterized by non-stationarity. VARIMA models are popular tools that are widely employed to target this purpose. One important advantage of VAR models is that they can be written using a standard form that admit a unique representation. This property is in sharp contrast to the VARMA (or VARIMA) case, where a standard form is not unique (see Lütkepohl, 2005, Chapter 12). Moreover, VAR models can be estimated by least squares, whereas parameters estimation of VARMA (or VARIMA) requires iterative algorithms. The non-stationarity modelled by ARIMA (or VARIMA) processes is limited to the case where the (multivariate) time series is a random walk. For example, a VARIMA(0,1,0) is a VAR(1) with auto-regressive matrix equal to the identity matrix. The ‘I’ in ARIMA stands for ‘integrated’, and differencing the time series is often used to make the time series stationary by eliminating linear time-trends. The non-stationarity is modelled in a more adaptive/flexible way by a locally stationary VAR, for two reasons: it allows for (i) a non-diagonal and time-varying auto-regressive matrix rather than the diagonal and time-invariant identity matrix, and (ii) a non-linear time-trend .
In Section 5.1, we prove that the factors are identified by the main principal components of the covariance matrix of , whereas in Section 5.2 we prove that the factors are invariant to change in the scale of the observations .
We assume that model (4) holds for all and that for all . For all j and all t, the vector is and the corresponding matrix is . Since the factors are mutually orthogonal, is a matrix of zeros. As a consequence, we have, for all and all , the following Vector Auto-Regressive (VAR) model for each of the latent factors:
where , and B is the back-shift operator: . The factors in (4) are dynamic and non-stationary. They are dynamic because they are auto-correlated over time, that is, the structure of each factor at time t depends on its own past values through the auto-regressive matrix . The factors in (4) are also non-stationary in the sense that their mean and their auto-covariance are time-varying. Comparing the representations of the factors in (1) and (5), it is easy to see that is the jth observation set-specific, dynamic, and time-varying version of the indistinct-sets, static, and time-invariant matrix . Model (3)–(4) reduces to (1) when and for all j and all t.
In order to have meaningful asymptotic theory, we work in the framework of local stationarity introduced by Dahlhaus (1997), where the time-varying parameters are defined in rescaled time. Consider the Locally Stationary version of the Vector Auto Regression of order 1 in (5)
where , with . For ease of notation, we skip the dependence on the index j that refers to the factor we are smoothing (). Locally stationary means that if the functions and are ‘smooth’ in u and T is large, then
for values in real time close to each other. Model (6) is a locally stationary process written in rescaled time in a way that for all t, as T grows we observe more and more ‘observations’ of the same type around t. In terms of representation, the triangular array (which is implicit in the locally stationary framework) provides a unique definition of the transfer function of the process in time and frequency domain. In terms of estimation, defining the process in rescaled time is useful for non-parametric estimation.
In a factor model such as (3), the entries of the observed matrix are continuous measures, whereas CA deals with the observed frequency table , where is the contingency table defined in Section 2 and is the sum of the elements in . In Section 4, we define the ECA as a time-varying version of the CA, and then in Section 5 we derive a novel factor-model representation of (appropriately rescaled) frequencies, see (34).
4 Evolutionary correspondence analysis
Correspondence analysis was developed by Benzécri (1992). For an introductory description of CA, we refer to Greenacre (2016). Beh and Lombardo (2019) provide an overview of the many variants of CA, accompanied by an extensive list of references, and discuss both benefits and limitation of this technique. Beh and Lombardo (2021) explain how to extend CA to more than two categorical variables, and provide introductory remarks about non-symmetrical correspondence analysis (D’Ambra & Lauro, 1989).
CA is a technique designed for dimension reduction of frequency tables. It is conceptually similar to PCA, but scales the data (which should be non-negative) such that rows and columns are treated equivalently. In this paper, we generalize the CA to the case where the observed frequency table is time-varying.
Let be the contingency table at time t and assume, without loss of generality, that . Similarly to PCA, CA reduces the dimension from P to , with . From , we obtain the matrix , containing the (relative) frequencies (at each t, the sum of all the entries of is equal to 1). At each time t, the -entry of is defined as
where is the total number of counts at time t.
Let the marginal time-varying row-frequencies and the marginal column-frequencies be defined, respectively, as
From the matrix , we obtain two diagonal matrices. The matrix containing the marginal row-frequencies, and the matrix containing the marginal column-frequencies:
The main diagonal of contains the ‘typical concept’, which represents a hypothetical concept with average characteristics (average with respect to the sub-discourses). The main diagonal of contains the ‘typical set of observations’, which represents a hypothetical sub-discourse with average characteristics (average with respect to the concepts). The matrix is rescaled by and obtaining, respectively, the row-profiles and the column profiles
The rescaling allows comparing rows (sets of observations) with columns (concepts). For all t, the ECA divides:
the pth row (the pth set of observations, ) by , the corresponding time-varying row-frequency;
the nth column (the nth concept, ) by , the corresponding time-varying column-frequency.
Table 1 reports the row-profiles at time , for and . The last line is the Average Row-Profile, the main diagonal of the matrix . From Table 1, we can see that at time , the first concept (C1) makes up 0.0324% of all concepts mentioned by the centrist German newspaper Süddeutsche Zeitung (SZ), but only 0.0004% of those mentioned by the right-leaning paper Die Welt (DW). Independently of the sub-discourse considered, the first concept constitutes 0.04% of all references across all newspapers at this point in time. Among all the concepts in the table at time t, the third concept is the one that has been mentioned the most by SZ (0.2271%).
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Total . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.0011 | 0.1134 | 0.1134 | 0.1134 | … | 0.0011 | 100 |
Germany, centre (SZ) | 0.0324 | 0.0003 | 0.2271 | 0.0649 | … | 0.0973 | 100 |
Germany, right (DW) | 0.0004 | 0.0856 | 0.1285 | 0.3854 | … | 0.1285 | 100 |
Greece, left (ET) | 0.0569 | 0.0003 | 0.0003 | 0.0285 | … | 0.0285 | 100 |
Greece, centre (TN) | 0.0006 | 0.0006 | 0.0006 | 0.0006 | … | 0.0649 | 100 |
Greece, right (KA) | 0.1390 | 0.0007 | 0.0007 | 0.0007 | … | 0.0007 | 100 |
Average Profile | 0.0393 | 0.0238 | 0.0862 | 0.1018 | … | 0.0627 | 100 |
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Total . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.0011 | 0.1134 | 0.1134 | 0.1134 | … | 0.0011 | 100 |
Germany, centre (SZ) | 0.0324 | 0.0003 | 0.2271 | 0.0649 | … | 0.0973 | 100 |
Germany, right (DW) | 0.0004 | 0.0856 | 0.1285 | 0.3854 | … | 0.1285 | 100 |
Greece, left (ET) | 0.0569 | 0.0003 | 0.0003 | 0.0285 | … | 0.0285 | 100 |
Greece, centre (TN) | 0.0006 | 0.0006 | 0.0006 | 0.0006 | … | 0.0649 | 100 |
Greece, right (KA) | 0.1390 | 0.0007 | 0.0007 | 0.0007 | … | 0.0007 | 100 |
Average Profile | 0.0393 | 0.0238 | 0.0862 | 0.1018 | … | 0.0627 | 100 |
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Total . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.0011 | 0.1134 | 0.1134 | 0.1134 | … | 0.0011 | 100 |
Germany, centre (SZ) | 0.0324 | 0.0003 | 0.2271 | 0.0649 | … | 0.0973 | 100 |
Germany, right (DW) | 0.0004 | 0.0856 | 0.1285 | 0.3854 | … | 0.1285 | 100 |
Greece, left (ET) | 0.0569 | 0.0003 | 0.0003 | 0.0285 | … | 0.0285 | 100 |
Greece, centre (TN) | 0.0006 | 0.0006 | 0.0006 | 0.0006 | … | 0.0649 | 100 |
Greece, right (KA) | 0.1390 | 0.0007 | 0.0007 | 0.0007 | … | 0.0007 | 100 |
Average Profile | 0.0393 | 0.0238 | 0.0862 | 0.1018 | … | 0.0627 | 100 |
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Total . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.0011 | 0.1134 | 0.1134 | 0.1134 | … | 0.0011 | 100 |
Germany, centre (SZ) | 0.0324 | 0.0003 | 0.2271 | 0.0649 | … | 0.0973 | 100 |
Germany, right (DW) | 0.0004 | 0.0856 | 0.1285 | 0.3854 | … | 0.1285 | 100 |
Greece, left (ET) | 0.0569 | 0.0003 | 0.0003 | 0.0285 | … | 0.0285 | 100 |
Greece, centre (TN) | 0.0006 | 0.0006 | 0.0006 | 0.0006 | … | 0.0649 | 100 |
Greece, right (KA) | 0.1390 | 0.0007 | 0.0007 | 0.0007 | … | 0.0007 | 100 |
Average Profile | 0.0393 | 0.0238 | 0.0862 | 0.1018 | … | 0.0627 | 100 |
Table 2 reports the column-profiles at time , for and . The last column is the Average Column-Profile, the main diagonal of the matrix . From Table 2, we can see that of all the references to the first concept that were recorded at time , almost 40% derive from the Greek left-leaning Eleftherotypia (ET) and the right-leaning Kathimerini (KA), with another 20% found in the German centrist SZ. Across all concepts, the Greek centrist Ta Nea (TN) accounts for about 12% of all concept references.
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Average Profile . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.1988 | 32.8947 | 9.0662 | 7.6805 | … | 0.1247 | 6.8945 |
Germany, centre (SZ) | 19.8807 | 0.3289 | 63.4633 | 15.3610 | … | 37.4065 | 24.0928 |
Germany, right (DW) | 0.1988 | 65.7895 | 27.1985 | 69.1244 | … | 37.4065 | 18.2505 |
Greece, left (ET) | 39.7614 | 0.3289 | 0.0907 | 7.6805 | … | 12.4688 | 27.4697 |
Greece, centre (TN) | 0.1988 | 0.3289 | 0.0907 | 0.0768 | … | 12.4688 | 12.0494 |
Greece, right (KA) | 39.7614 | 0.3289 | 0.0907 | 0.0768 | … | 0.1247 | 11.2431 |
Total | 100 | 100 | 100 | 100 | … | 100 | 100 |
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Average Profile . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.1988 | 32.8947 | 9.0662 | 7.6805 | … | 0.1247 | 6.8945 |
Germany, centre (SZ) | 19.8807 | 0.3289 | 63.4633 | 15.3610 | … | 37.4065 | 24.0928 |
Germany, right (DW) | 0.1988 | 65.7895 | 27.1985 | 69.1244 | … | 37.4065 | 18.2505 |
Greece, left (ET) | 39.7614 | 0.3289 | 0.0907 | 7.6805 | … | 12.4688 | 27.4697 |
Greece, centre (TN) | 0.1988 | 0.3289 | 0.0907 | 0.0768 | … | 12.4688 | 12.0494 |
Greece, right (KA) | 39.7614 | 0.3289 | 0.0907 | 0.0768 | … | 0.1247 | 11.2431 |
Total | 100 | 100 | 100 | 100 | … | 100 | 100 |
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Average Profile . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.1988 | 32.8947 | 9.0662 | 7.6805 | … | 0.1247 | 6.8945 |
Germany, centre (SZ) | 19.8807 | 0.3289 | 63.4633 | 15.3610 | … | 37.4065 | 24.0928 |
Germany, right (DW) | 0.1988 | 65.7895 | 27.1985 | 69.1244 | … | 37.4065 | 18.2505 |
Greece, left (ET) | 39.7614 | 0.3289 | 0.0907 | 7.6805 | … | 12.4688 | 27.4697 |
Greece, centre (TN) | 0.1988 | 0.3289 | 0.0907 | 0.0768 | … | 12.4688 | 12.0494 |
Greece, right (KA) | 39.7614 | 0.3289 | 0.0907 | 0.0768 | … | 0.1247 | 11.2431 |
Total | 100 | 100 | 100 | 100 | … | 100 | 100 |
Sub-discourses . | C1 . | C2 . | C3 . | C4 . | … . | C525 . | Average Profile . |
---|---|---|---|---|---|---|---|
Germany, left (FR) | 0.1988 | 32.8947 | 9.0662 | 7.6805 | … | 0.1247 | 6.8945 |
Germany, centre (SZ) | 19.8807 | 0.3289 | 63.4633 | 15.3610 | … | 37.4065 | 24.0928 |
Germany, right (DW) | 0.1988 | 65.7895 | 27.1985 | 69.1244 | … | 37.4065 | 18.2505 |
Greece, left (ET) | 39.7614 | 0.3289 | 0.0907 | 7.6805 | … | 12.4688 | 27.4697 |
Greece, centre (TN) | 0.1988 | 0.3289 | 0.0907 | 0.0768 | … | 12.4688 | 12.0494 |
Greece, right (KA) | 39.7614 | 0.3289 | 0.0907 | 0.0768 | … | 0.1247 | 11.2431 |
Total | 100 | 100 | 100 | 100 | … | 100 | 100 |
It is easy to verify that
that is,
where is the vector with ones.
Table 3 shows that the Average Row-Profile, which is the main diagonal of the matrix as well as the last line in Table 1, is a weighted average of the row-profiles with weights given by the Average Column-Profile, which is the main diagonal of the matrix as well as the last column in Table 2. Analogously, the Average Column-Profile is a weighted average of the column-profiles with weights given by the Average Row-Profile.
× 0.068945 |
× 0.240928 |
× 0.182505 |
× 0.274697 |
× 0.120494 |
× 0.112431 |
× 0.068945 |
× 0.240928 |
× 0.182505 |
× 0.274697 |
× 0.120494 |
× 0.112431 |
× 0.068945 |
× 0.240928 |
× 0.182505 |
× 0.274697 |
× 0.120494 |
× 0.112431 |
× 0.068945 |
× 0.240928 |
× 0.182505 |
× 0.274697 |
× 0.120494 |
× 0.112431 |
4.1 Extracting factors and loadings
Correspondence analysis is based on the spectral decomposition of the rescaled and centred frequencies. Let be the matrix that rescales both rows and columns of the matrix of frequencies . Then define the symmetric (and time-varying) matrix which can be written as . That is, is the sample covariance matrix of . It is possible to show (see Saporta, 2006) that the largest eigenvalue of is identically equal to one (identically means for all and all t), because of the constraint that frequencies sum up to one. The same constraint implies that the corresponding factor is , and therefore the first principal component of is useless for the interpretation of the results.2 For this reason, practitioners perform CA on the (first rescaled and then) centred data matrix, that can be computed as , cf. Section 5.1. Let , and define its spectral decomposition as
being the matrix whose columns are the eigenvectors corresponding to the eigenvalues of , that are collected in decreasing order in the diagonal matrix . In our application where an , and therefore the rank of is at most , that is, for all t. In order to choose the number of latent factors to retain, we calculated the time-varying trace ratio
which measures the amount of variance that is captured by the first axes. In our application and , that is, the first two eigenvalues can explain more than 70% of the total variability in the data, and thus we choose and focus on two factors: the first and the second.
Correspondence analysis seeks to represent the interrelationships of categories of row and column variables (the sub-discourses and the concepts, respectively) on a two-dimensional map (more generally, in a -dimensional map). It can be thought of as trying to plot a cloud of data points (the cloud having height, width, thickness) on a single plane to give a reasonable summary of the relationships and variation within them. More concretely, loadings and factors can be represented on the same axes, see Figure 4. Moreover, for each of the P rows (the sub-discourses) and each of the N columns (the concepts), we can compute the contribution (of that row or that column) to the variability of the new (latent) axes. For any , the estimated time-varying factors and loadings are defined, respectively, as
for all . It easy to verify (see Benzécri, 1992) the following formulas
which explain the meaning of the so-called barycentric (or dual) relationship that exists between factors and loadings. The duality in (12) is a direct consequence of the symmetric role played by row- and column-profiles. Performing PCA on the row-profiles is equivalent to a PCA on the column-profiles: the factors of one analysis are (up to ) the loadings of the other analysis, and therefore they can be represented on the same factorial space (see Bouroche & Saporta, 1980, p. 93–94). In Section 6.3, we apply this joint representation to our data and interpret the results.
For all , the time-varying contributions of the sets of observations and concepts to the variability of the jth principal component (the jth new axis) are defined, respectively, as
Since , , and ,
the contribution of the th set of observations to the variability of the trivial factor () is the th row profile , and
the contribution of the th concept to the variability of the trivial loading () is the th column profile .
It is possible that a point that is close to the origin has a stronger contribution than that of a point that is far from to the origin. Indeed:
The strength of the contribution of the pth set of observations to the variability of the jth axis depends on both its mass and its squared coordinate on that axis. Evaluating its position on the jth axis—given by —will be thus not sufficient to determinate its contribution.
The strength of the contribution of the nth concept to the variability of the jth axis depends on both its mass and its squared coordinate on that axis. Evaluating its position on the jth axis—given by —will be thus not sufficient to determinate its contribution.
In the next section, we present two important transition formulas for the computation of the matrices of factors and loadings defined in (10) and (11), respectively. These formulas permit the transition between the space of the variables (the concepts) and the space of the observations (the sub-discourses). Equation (14) is useful whenever one of the sizes (P or N) is large compared with the other, whereas (15) allows to obtain factors and loadings directly from row-profiles and column profiles.
4.2 Transition formulas
For ease of notation let us drop the time dependence and define, analogously to the matrix , the matrix whose eigenvalues are collected in decreasing order by the diagonal matrix
Analogously to Principal Components Analysis (cf. Härdle & Simar, 2015), it is possible to prove that factors and loadings in (10) and (11) can be obtained, respectively, as
for any positive integer , where is the matrix whose columns are the eigenvectors corresponding to the largest eigenvalues of collected in decreasing order by the diagonal matrix . In our case and therefore, it is computationally faster to diagonalize (rather than ) and then obtain and using (14). In Proposition 4 of Appendix B of the online supplementary material, we show that the three matrices
share the same eigenvalues, and that the orthonormal eigenvectors of can be obtained as
where and are the unit-length eigenvectors of and , respectively. Both factors and loadings in (10) and (11) depend on the eigenvectors of the matrix . Equation (15) can be useful to obtain from the eigenvectors of either row-profiles or column-profiles. The result in (15) represents a novel contribution to CA: it permits to obtain loadings and factors using row-profiles and column-profiles rather than from the matrix of frequencies. Indeed, from row- and column-profiles we can obtain through (15), then (since , and share the same eigenvalues) we can find and via (10) and (11), respectively.
4.3 Smoothing the loadings
In principle, there are two approaches to smoothing the loadings: the first is the one we followed in Motta and Baden (2013), the second is the one presented in this paper.
Given our observations we compute the cross-products and smooth them, to obtain the time-varying covariance matrix. Then we obtain the latent matrices (loadings and factors) from the smoothed covariance matrix. Here the smoothing is applied in the first step to the cross-products, that is:
we first compute and smooth the matrix of cross-products, and
we then derive the latent matrices from the smoothed covariance matrix.
Given our observations, we compute the cross-products (without smoothing) to obtain the time-varying raw covariance matrix defined in (8). Then we obtain the latent quantities (loadings and factors) from this raw (or unsmoothed) covariance matrix. And finally, we smooth the latent quantities. Here the smoothing is not applied to the cross-products:
we first derive the latent matrices from the unsmoothed cross-products, and then
we smooth loadings and factors.
It is well known that unit-norm eigenvectors of a time-invariant matrix are unique up to a ± sign. In the case of time-invariant matrices, this indeterminacy can be resolved by fixing the first row of the eigenvector matrix to be positive (see Lawley & Maxwell, 1971, page 18). However when dealing with matrix-valued functions, following this convention leads to unsmooth eigenvector functions. A matrix-valued function is a path of matrices whose entries depend on a real variable (see, e.g. Bunse-Gerstner et al., 1991). If is smooth in t, there exist smooth and smooth orthonormal and diagonal, respectively, such that for all t (see Dieci & Eirola, 1999, Proposition 2.4). Nevertheless, although mathematically such a smooth does exist, the identification up to sign renders the eigenvectors computationally unsmooth. That is, the eigenvectors of obtained from conventional eigendecomposition of are generally not differentiable in t. In Appendix A.2 of the online supplementary material, we illustrate this phenomenon through a numerical example.
In order to remove the discontinuities over time of the time-varying loadings, we use the following criterion. For a given time point , we change the sign of the vector , the j-loadings at time , if
For more details about the estimator in (16), we refer the reader to Motta et al. (2023). When using (16) to pick ‘the right sign’ of the loadings at a certain point in time, we also need to change the corresponding value of the factors at that point in time as well, since at each time point the relationship between factors and loadings must hold.
We apply the criterion in (16) to the ‘raw loadings’ rather than the smoothed loadings defined in (17) below, since we are more likely to detect abrupt changes (smoothing reduces the size of the jumps at the points of discontinuity). To smooth the loadings, we follow two steps: in step (a) we make sure to avoid spurious discontinuities that are only due to the up-to-sign misidentification of eigenvectors, and in step (b) we smooth the loadings in a way to account only for present () and past () values of the raw loadings.
For , we use the criterion in (16) to ‘pick the right sign’ of the vector , the j-loadings at time , and we change the corresponding value of the factor . We do not need to flip all the factors accordingly: we do it for each j separately.
- For , define the smoothed loading matrix at time t aswhere m is the length of our one-sided window. The weights are decreasing with ℓ, with the denominator being . For example, with we obtain(17)
The properties of the smoothed loadings in (17) depend on the smoothing parameter m. In Appendix C of the online supplementary material, we derive the following asymptotic expression for the Mean Squared Error of the smoothed loadings:
The MSE is the sum of two terms: the first is the squared-bias, whereas the second is the variance. The larger the value of m, the larger the bias and the smaller the variance (and vice-versa). A small value of m allows to capture the local curvatures of loadings, at a cost of a larger variance. In non-parametric statistics, the bandwidth is a sequence that depends on the sample size T. In order to balance the trade-off between bias and variance, we select the optimal value of the smoothing parameter m as the minimizer of the MSE, that is, . The function shows how the optimal value for the parameter m (adopted to smooth the loadings) relates to the overall length T of the time series.
For our application, we find that the approximate MSE invariance holds for m between 16 and 18 weeks, see Figure 1. We choose (a quadrimester) because with and we obtain .

The mean squared error of the smoothed loadings as a function of the smoothing parameter m. We minimize it w.r.t. m by choosing .
4.4 Smoothing the factors
In our application, we extract two factors, and , both . After extracting the factors in (10) and adjusting their sign according to (16), we fit to each factor a time-varying VAR(1) model according to the specification in (5):
where
and
For let be the jth estimated factor, extracted using CA, and let the estimated time-varying means. Then define as the jth demeaned estimated factor, that is, the estimate of , and therefore
For example, when we have
For all j, we estimate jointly the P entries of the vector and the entries of the matrix in (18) by means of local linear regression, according to the weighted least-squares (WLS) approach introduced by Motta (2021). For a fixed j, consider the LS-VAR(1) in (18)
with and . If the largest eigenvalue of lies inside the unit circle
is locally stationary and causal. Our goal is to estimate at a fixed by WLS. We can rewrite (22) as
where for all . If we define
model (23) can be written as
and therefore we define as the minimizer of the weighted loss function
where the local weights are rescaled Kernel functions, and where the bandwidth sequence tends to zero slower than : and as . Letting , we consider model (24) and use the local-linear approximation of in a neighbourhood of u
to estimate , our parameter of interest, in the approximate model
Define
and the diagonal matrices
Motta (2021) proved that the local-linear minimizer of (25) is
where
The WLS equations (26) and (27) generalize the LS estimators of the time-invariant VAR(1) model (see Reinsel, 1997, Section 4.3.1) to the locally stationary framework. For our application, the means defined in (19) are obtained from in (27), and are plotted in Figure 4. The VAR(1) curves , defined in (20) and estimated according to in (27), are presented in Figure 6. We emphasize that the VAR matrix in (20) is ‘full’ rather than diagonal: the VAR(1) coefficients of sub-discourses k in (21) explain how sub-discourse k depends on its own lagged value as well as on lagged values of the other five sub-discourses.
We estimate the matrix in (20) by the last P columns of the matrix obtained in (26)–(27). The VAR(1) matrix depends on the bandwidth h, the smoothing parameter. Using the theory of local polynomials (Fan & Gijbels, 1996), it is possible to prove that
where is the order of the bias, and is the order of the variance. The mean squared error in (28) shows how the variance of the smoothed factors relates to the overall length T of the time series. The optimal bandwidth minimizing (28) has the usual rate , and the corresponding optimal rate of the Mean Squared Error in (28) is .
In practice the bandwidth is chosen by cross-validation. For non-parametric regression, in the case of dependent observations, cross-validation is known to be severely affected by dependence. In order to adjust for the effect of (possible) dependence on bandwidth selection, we select the smoothing parameter h by means of the ‘leave-(2ℓ +1)-out’ version of cross-validation by Chu and Marron (1991).
5 Understanding ECA as a time-varying factor model: identification and invariance
Correspondence analysis may be defined as a special case of principal components analysis (PCA) of the rows and columns of a table, especially applicable to a cross-tabulation. However CA and PCA are used under different circumstances. Principal components analysis is used for tables consisting of continuous measurement, whereas CA is applied to contingency tables (i.e. cross-tabulations). In Section 5.1, we establish the connection between CA and PCA, and show that it is possible to write the rows-and-columns rescaled matrix of frequencies as a factor model, see (33). Also, the same matrix can be approximated by a factor model of a lower dimensional rank, see (34). In Section 5.2, we show that in our model, changes to the observations translate into changes of the same scale to the latent factor.
5.1 Correspondence analysis as a factor model: identifying the common components
For all t, let
In this section, we show that there exists a representation of the rescaled frequencies as factor model with loadings given by the eigenvectors of the sample covariance , factors given by the principal components , and idiosyncratic components . Without loss of generality, in what follows we assume that . Then define for all and all t, the matrix
as the matrix whose columns are the principal components corresponding to the eigenvalues of , that is,
Using the ‘reconstruction’ formula (see Benzécri, 1992; van der Heijden & de Leeuw, 1985)
we can prove that
Hence for all , we can approximate as
Therefore, the eigenvectors and the principal components in (34) are, respectively, loadings and factors in model (3). If we reconstruct exactly the matrix (no dimension reduction allowed). Notice that equations (31) and (33), together with the ortho-normality of the columns of , imply that
that is, the idiosyncratic components are orthogonal to the loadings: . Equation (32) above shows that CA decomposes the departure from independence in a contingency table. Indeed, under the assumption of independence between rows and columns (sub-discourses and concepts, respectively) we would have , in matrix form , which would imply
If we define the matrices
of rank and , respectively, we can rewrite (33) and (34) as
where the approximation is due to the fact that . If then . The matrices and in (37) are, respectively, the common components and idiosyncratic components in model (3). In other words, the matrix contains the (rescaled) frequencies we would observe if rows and columns were independent. Hence measures the departure from independence, or commonness, between rows and columns. The approximation in (37) shows that
the first principal component of , that corresponds to the eigenvalue , represents the idiosyncratic components , whereas
the main principal components of , that correspond to the largest eigenvalues of , represent the common components.
The approximation in (37) is a time-varying factor model for the rescaled frequencies, with
idiosyncratic components of rank one and
common components of rank .
5.2 Correspondence analysis as a factor model: invariance of the scale of the factors
The modelling of time-varying latent factors suggests that the constructs being evaluated are changing with time. The use of time-varying loadings allows for both the scale and meaning of the latent variables to change across time. In this section, we show that, although both loadings and factors are time-varying, it is still possible to distinguish whether observed changes are due to changes in the latent variables (i.e. the factors ), or changes to what is being measured (i.e. the observations ). Recall (31) and define . Then it follows from the definition of the Frobenius norm that
Since
we have the following decomposition:
If , (38) becomes
The decomposition in (38) allows to measure what proportion of the observed changes are due to true changes in the latent variables, and shows that the observed changes are not affected by the scale of the ruler (i.e. the scale of the loadings). Equation (39) shows that allowing for time-variation in both latent factors and loadings simultaneously, does not pose identification problems. Indeed, for all t the scale of is uniquely determined by the scale of . That is, the observed changes are not affected by the scale of the ruler (i.e. by the scale of the loadings).
Suppose now we multiply the observations by a scalar , and define with corresponding covariance matrix . Due to the orthogonality between and (cf. Section 5.1), . Letting , it follows from (8) that , that is, has the same eigenvectors as and eigenvalues . Hence applying (31) to the observations , we obtain
for all t. Equation (40) shows that multiplying the observations by a scalar translate into the same rescaling for the factors. Using the same arguments, it is possible to prove that (40) holds for observations of the form , for any orthogonal matrix satisfying . That is, our approach disentangles changes in constructs (the factors) from changes in measurement (the loadings).
6 Using ECA to understand the co-evolution and inter-dependency of linked public debates
Compared to EFA in equations (1)–(2) of Section 3, ECA offers four key analytic opportunities that offer additional depth for ‘big data’ analysis and interpretation. First, the decomposition of the overall observed variance into the contributions of each sub-discourse to each latent factor represents the extent to which different sub-discourses make use of the same or different combinations of concepts to interpret the studied issue. Second, ECA enables the identification of distinct factors for each sub-discourse, which permit a direct analysis of the extent to which different sub-discourses make use of similar or different interpretations over time. Third, the time-varying location of each sub-discourse and the time-varying loadings of concepts on each factor in the same barycentric representation captures the specific content of those interpretations distinguished by each factor. Fourth, the estimated cross-lagged auto-regressive coefficients express how sub-discourses’ uses of specific interpretations at a given time is related to the previous interpretations’ presence, with respect to the same sub-discourse as well as other sub-discourses’ coverage. Importantly, the analysis does not focus on any specific constructs to be measured, but rather identifies evolving yet systematic patterns of concept references that account for variation in the news coverage, indicating differential framing choices. Other than conventional CA, ECA permits all of the above analyses to depend on time, such that the information captured by each factor is permitted to evolve, as are any associations and influences between these.
In the following, we will discuss each of the above points in turn, using our application to the German and Greek news debates regarding the 2009–2012 financial crisis to illustrate possible interpretations of the presented data.3
6.1 Latent factors can be interpreted as the contributions of variables and cases
Figure 2a shows the distribution of the ‘Typical’ (or average) concept with respect to the six sub-discourses, obtained by averaging characteristics over the concepts.4 The distribution of the typical concept for sub-discourse p is the time-varying contribution in (13), that is, the row-profile of sub-discourse p. Figure 2b,c shows the time-varying contributions in (13), with .

Horizontal axis: time in weeks, , running from 10/01/2009 to 06/30/2012. Panel a: time-varying distribution of the ‘Average Concept’ with respect to the 6 sub-discourses. Panels b,c: time-varying contributions of the 6 sub-discourses to Factor 1 and 2, respectively.
For each factor, the contributions add up to 1 across both columns and rows at each given point in time. Accordingly, ECA provides a fast way for measuring the (time-varying) extent to which each concept or each sub-discourse are responsible for the variability of each latent factor.
The distribution of the average concept (Figure 2a) captures the rapid evolution of concepts required to describe the constantly shifting news agenda, which is shared across sub-discourses but may be present to a time-varying extent: the German centrist and the Greek leftist sub-discourses dominate the debate until the Greek leftists vanish at week . After the left-leaning Eleftherotypia ceases to publish following its bankruptcy in December 2011, its role is (to some extent) taken over by the centrist Ta Nea. The German debate plays a slightly larger role in defining the common news agenda, especially in the beginning and end of the observed period, with a pronounced early influence of the conservative Welt, culminating with the initial riots in Greece; after that, the situation definitions presented by the centrist Süddeutsche Zeitung take over as the most typical formulation of the common news agenda.
The first and second factors account for differences in framing that distinguish between different sub-discourses over time: As Figure 2 shows, all sub-discourses contribute to Factor 1 in relatively stable proportions, suggesting the presence of evolutionary, but reasonably enduring differences that distinguish the observed sub-discourses. These differences, whose nature cannot yet been known from this display alone, are most prominent initially in the discourse of the German centre and the Greek left, and later, after the latter’s termination, both of the remaining Greek sub-discourses. Factor 2, by contrast, draws upon the different sub-discourses in a rapidly changing fashion, suggesting a succession of different distinctions that organize the coverage over time: The longest phase marked by a relatively consistent contribution of different sub-discourses extends from week 59 to week 73, and appears to focus on distinct framing choices within the Greek media (primarily between the centre and right), while the differences identified in this phase barely matter for the German coverage. At most times, the second factor extracts differences that distinguish primarily within the German or the Greek sub-discourses, but not both. While the factor’s varying attention to either Germany or Greece primarily responds to where more pronounced, evolving patterns emerge in the public debate, it plausibly reflects the changing power relations within either domestic debate. For example, Greece’s centrist sub-discourse dominates throughout most of 2011, but loses dominance when the Social Democratic government collapses in week 110 (November 2011).
To illustrate the added value of ECA versus any arbitrary contingency summary, we have computed a contingency table of the amount of news articles about the eurocrisis by party families (left, centrist, right) and by week, see Figure 3. This contingency table has the same visual appearance as Figure 2, but the percentages of Factor 2 by week (Figure 2c) are significantly different from the percentages in Figure 3, especially in the ‘predominantly white’ period from to . The contributions (of the six sub-discourses) obtained through ECA provide a syntheses which is joint and multi-dimensional at the same time. It is joint because it involves all the concepts; it is multi-dimensional because those contributions are distributed over two different factors, each capturing a separate/specific aspect of the phenomenon. In contrast, in Figure 3, we select five key concepts that capture competing ways of labelling the crisis to provide an individual and uni-dimensional synthesis: As Figure 3 shows, labels foregrounding the economic damage (‘Economic crisis’) are far more commonly invoked in Greece than in Germany, mostly by the Greek left; The Greek centre temporarily suspends its use of the label while the Social-democratic government tries to push its austerity agenda, but re-appropriates it as the crisis deepens. By contrast, interpretations as a ‘Financial crisis’ are irrelevant in Greece, but persistently dominate in Germany. The focus on ‘Debt’ is persistently shared across both countries, while interpretations as ‘Bank’ or ‘Euro crisis’ gain and lose focus in both countries over time. For example, with the introduction of a technocratic government in Greece after week 111 (December 2011), dominant use of the ‘Euro crisis’ label shifts from Greece to the German discourse. We can also see that under Greece’s social-democratic government, emphasis shifted from economics to debt in weeks 51–58 (October and November 2010), just at the time when its centrist sub-discourse gained dominance on Factor 2 in Figure 2. While also a conventional CA could have identified the enduring contributions made by each concept and sub-discourse, it is evident that flattening the analysis into a single phase under investigation loses much of the nuance and insight offered by ECA.

Horizontal axis: time in weeks, , running from 1 October 2009 to 30 June 2012. Vertical axis: weekly percentage of news articles about the five crisis split by party families. Panel a, bank crisis; Panel b, debt crisis; Panel c, economic crisis; Panel d, euro crisis; Panel e, financial crisis.
6.2 Separate sets of factors can be obtained for each case
While factors are extracted jointly from all the sub-discourses, as in EFA, ECA permits us to derive a different set of factors for each case (in our case, each sub-discourse) directly from the analysis. Unlike EFA (Motta & Baden, 2013), where all counts are summed up across cases (so we would lose the distinction between sub-discourses) to obtain a matrix, with ECA we are able to ‘retain’ the specific information brought by the different sub-discourses by looking at a time-varying frequency-table of size . As a result, for each sub-discourse, we obtain time-varying means (see Figure 4a) and time-varying auto-regressive coefficients (see Figure 6). By plotting the location of each sub-discourse relative to the joint factors, as can be seen in Figure 4a, we can analyse how the respective sub-discourses align with one another over time. We now interpret the smoothed factors in (19), plotted in Figure 4a, whereas in Section 6.3 we interpret the smoothed loadings in (17), presented in panel b of the same figure.

The Sub-discourses and primary crisis labels on common factors 1 & 2 (barycentric representation). The integers denote the week numbers running from 10/01/2009 to 06/30/2012. Panel a: Sub-discourses as barycentre of their uses of concepts. We are plotting the curves defined in (19) and obtained from in (27). The projection of the pth row-profile () on the jth axis () is obtained by a weighted average of the coordinates of the pth row-profile with weights given by the loadings on the jth axis. Panel b: Crisis labels as barycentre of their use across sub-discourses. We are plotting five selected smoothed loadings, that is, five selected rows of in (17). The projection of the nth column-profile () on the jth axis () is obtained by a weighted average of the 6 coordinates of the nth column-profile with weights given by the factors on the jth axis.
Focusing on those two factors that represent time-varying distinctions between those frames foregrounded by different media, Figure 4a displays the position of each outlet’s coverage. The plot shows that Factor 1 durably distinguishes between the coverage presented by the German outlets (black, yellow, and red lines), which are persistently located on the right side of the figure, and the Greek outlets (blue, grey, and light blue lines), which remain on the left side at all times. Within each country, all three sub-discourses remain closely aligned with one another over time on this factor. We can thus identify Factor 1 as capturing persistent, country-specific perspectives and idiosyncrasies in the coverage.
By contrast, those time-varying distinctions captured by Factor 2 reflect a wide range of different similarities and contrasts in the debate regarding different ways of interpreting the crisis (notably, as a debt, Euro, or banking crisis; the distinction between interpretations as a financial or economic crisis is, as we have seen in Figure 3, part of the country-specific differences). At the outset, before the crisis escalated, for instance, the factor contrasts the discourse of the Greek right (blue, week 1) against that of the German left (red), with the German centre and right (yellow, red) leaning toward the Greek right, the Greek left (light blue) halfway toward the German left, and the Greek centrist Ta Nea (grey) at a neutral midpoint. As the crisis develops (ca. week 20, February–March 2010), the contrast between the Greek right and the other Greek sub-discourses remains, while all three German papers find themselves in between; with the Europeanization of the crisis (starting with Ireland around week 55, October 2010), the contrast between German left and Greek right is restored, with the remaining outlets lined up in between. In the running-up to the Greek planned (and later cancelled) referendum over the controversial European rescue plan (ca. week 110), the entire German discourse as well as the Greek right contrast against the Greek left. Within Greece, shifting frames mostly reconstitute a right–left conflict over changing concerns, with the centrist sub-discourse mostly taking in an intermediate position (except for summer 2010, around week 44, when the conflict is between Ta Nea and the conservative Kathimerini). In Germany, we see phases opposing both left and right against the centrist SZ, led initially by the left-leaning outlets and later, on a different conflict, by the right-leaning Welt. Toward the end of the debate, both conservative media assume a distinctive position, while the rest gathers close to the centre. Factor 2 captures time-varying distinctions in the framing of the news, which adjoin and separate different sub-discourses over time.
In order to show the added value of (the evolutionary) ECA vs. (the time-invariant) CA, we have computed a standard (or ‘ordinary’) CA and plotted the results in Figure 5. In Section 6.3 we interpret and compare ECA (Figure 4) with CA (Figure 5), and highlight how ECA captures the evolutionary paths of smoothed loadings and smoothed factors, as compared to the static nature of CA.

Joint representation of the six political Sub-discourses and five crisis-related concepts, obtained by applying a time-invariant (standard) CA. The size of a circle is the contribution to the first factor, whereas the size of an asterisk is the contribution to the second factor.
6.3 Variables and cases can be represented dynamically within the same factor space
Unlike PCA, CA enjoys a symmetry property between factors and loadings: we can represent loadings and factors jointly, that is, on the same factorial space. This symmetry is due to the important barycentric property which is shared by loadings and factors that are extracted by CA. More precisely for a given axis j (with ), the factors are the barycentre of the loadings, and loadings are the barycentre of the factors, up to a scale factor given by the square root of the jth eigenvalue:
where the weights and (with and ) are the time-varying row-profiles and column-profiles defined in (7), respectively.
Geometrically, this means that we can project the row-profiles in the factorial space of the loadings, and the column-profiles in the factorial space of the sub-discourses, such that both variables (our concepts) and cases (the six sub-discourses) can be located within the same coordinate system, and their location can be interpreted in equivalent ways. In Figure 4, we make use of this property: Below panel a, which represents the row-profiles as the barycentres of the columns (see above), panel b represents the column-profiles as the barycentres of the rows on the same two factors as above. This display permits the joint interpretation of both representations, wherein the time-varying concept loadings can be tied to the composition of the news coverage of the respective sub-discourses, and each sub-discourse can be characterized based on its alignment within the space spanned by the respective concepts. The origin of the axes represents, at any point in time, the ‘average (or typical) sub-discourse’ and the ‘average (or typical) concept’, respectively. As the semantic meaning captured by each dimension is expressed by the time-varying factor loadings of all 525 concepts, tracking how key labels and charged concepts load on the respective axes offers a fast and informative access to interpreting what framing choices are expressed by each identified factor. Illustrating the time-varying factor loadings of all 525 concepts would result in a dense configuration of overlapping curves. In Figure 4b, we thus represent the time-varying trajectories of five concepts selected to illustrate five different ways of framing the crisis. Specifically, we represent the time-varying loadings of the main labels used to define the crisis that had been coded among the 525 concepts—as Euro crisis (green line), debt crisis (black), financial crisis (pink), banking crisis (red), or economic crisis (blue). Reading the figure in conjunction with the insights obtained from the location of sub-discourses on the same axes (panel a), we can see that the Greek debate persistently focuses on the economic crisis (positive loadings on Factor 1), whereas the German debate is more concerned about those aspects related to the financial and banking system, as well as the debt crisis (negative loadings). At the same time, the between-country differences expressed by Factor 1 are not fully stable, as numerous other concepts align in different constellations over time such as references to a Euro crisis, which contribute more to the German papers’ framing (positive values) throughout the second year of the crisis (dominated by the controversy over Euro-bonds) but are more associated with the Greek coverage (negative values) before and after that.
Regarding those distinctions in the framing of the crisis captured by Factor 2, Figure 4 shows that none of the crisis labels contribute to the distinctive framing at the outset of the crisis, before it was recognized as such and further defined. By week 20 (February–March 2010), references to the banking crisis load positively, where the sub-discourse of the Greek left was located (see above), while the other Greek sub-discourses prefer a focus on the Euro crisis, which loads negatively. By week 55, references to an economic and Euro crisis (positive loadings, associated with the Greek right) contrast against references to a banking crisis (negative loadings, associated with the German left). In week 110 (November 2011), all crisis labels are again in a position close to the centre, reflecting their secondary role in defining present framing conflicts; financial, banking and debt crisis load slightly above zero (associated with the shared view of the German debate and the Greek right), and economic crisis slightly below, associated with the Greek left. Beyond these five labels, which of course offer only a very cursory understanding of the respective interpretations, we could further examine the loadings of the remaining 520 concepts to reconstruct how the differences in interpretation captured by each factor evolve and are associated with the coverage of different outlets over time. While we will not discuss this in detail here, the analysis of concept loadings shows that a focus on micro- versus macro-economic effects and mechanisms structures the meaning expressed on Factor 2 at most times, albeit with shifting emphases (e.g. on employment, productivity, inflation, etc.). The analysis also shows that there is no persistent right–left cleavage, neither within the Greek nor in the German public debate.
As explained in Section 4, ECA is performed on the time-varying matrix of counts. It is interesting to investigate the results obtained by applying an ‘ordinary’ (time-invariant) CA to the same data. To this end, we extract loadings and factors from the time-invariant matrix , and plot he results in Figure 5. The figure correctly identifies the two dominant meanings of both factors: Factor 1 still distinguishes the Greek from the German public debate, while Factor 2 captures differences of perspective between the domestic political camps. However, the loss of the temporal dimension hides not only the important contribution of also the German domestic debate to Factor 2 (which can be seen plainly from Figure 2, but also the important evolution of cleavages and alliances in the domestic debates: As we have shown in Figure 4, the Greek right is not consistently opposed to the Greek left and centre, as Figure 5 indicates, nor are the three German sub-discourses remotely as well-aligned over time as their positioning in Figure 5 would suggest. The time-invariant view also deprives us of recognizing the close alignment between the Euro crisis narrative and the Greek right, or the very variable use of the ‘Bank crisis’ and ‘Euro crisis’ labels among German media.
6.4 For each factor, separate auto-regressive coefficients measure dynamic causality within domestic sub-discourses as well as between foreign sub-discourses
Exploiting the fact that in ECA, unlike EFA, separate factors can be obtained for each sub-discourse, it follows that it is also possible to separately estimate the auto-regressive coefficients representing the relative evolution of these factors over time.
Granger (1969) defined a concept of causality which, under suitable conditions, is fairly easy to deal with in the context of VAR models. Granger causality is a statistical concept derived from the notion that causes may not occur after effects and that if one variable is the cause of another, knowing the status on the cause at an earlier point in time can enhance prediction of the effect at a later point in time (see Lütkepohl, 2005, Section 2.3). The VAR model has been widely employed in econometric analyses (Granger & Newbold, 1986) to elucidate underlying mechanisms using Granger causality. In a VAR model, such as our equation (18), causalities and non-causalities can be determined by looking at coefficients of the matrix in (20). Self-influences are measured by the diagonal entries () of , whereas cross-influences by the off-diagonal entries (). Cross-influences should be assumed only when changes cannot be predicted on the basis of auto-regression.
For our application, we compute confidence bands of the VAR coefficients obtained by fitting to the factors a time-invariant (or standard) VAR model (see Reinsel, 1997, Section and 4.3.1), and we focus on those time-varying coefficients that (i) are significantly different than zero, and (ii) take values outside the bands. With this approach, we achieve two goals: (a) we can interpret the coefficients in terms of Granger Causality, and (b) we can appreciate the added value of ECA versus CA.
In Figure 6, we show these evolutionary VAR coefficients (colored curves) alongside the coefficients obtained by an ‘ordinary’, time-invariant model (black flat-lines). These coefficients can be interpreted as the extent to which one sub-discourse’s use of the framing represented by the respective factor influence the coverage of the same and other sub-discourses in the subsequent week. Disregarding the very beginning and end of the line, which respond heavily to the initial and final states of the recorded data, Figure 6a shows a profound influence of the Greek left onto the Greek right sub-discourse on Factor 1 (the country-specific perspective; weeks 23–58, from March to November 2010), marking the formation of the Greek popular opposition discourse against the social-democratic government’s rapidly escalating measures to contain the growing crisis. However, as can be seen in panel b, the emerging shared interpretation of the economic crisis does not include the left’s emphasis on the role of banks and debt: At no time is there a significant influence in the same direction on Factor 2 (the specific interpretation of the crisis). Instead, the Greek left-wing sub-discourse temporarily exerts its influence upon the Greek centre’s interpretation of the crisis (Factor 2, panel c), raising pressure to consult the electorate about planned measures, and culminating in a referendum proposed in week 105, immediately preceding the collapse of the government. Earlier, during the rise of the debt- and austerity-focused discourse among the Greek centre, the centrist sub-discourse had markedly distanced itself from the left, which was at that time fanning mass protests against the government (week 72, April 2011). Panel d, finally, shows that following the establishment of Papedemos’ technocratic government in Greece (week 111, December 2011), the sub-discourse of the Greek right exerted a persistent positive influence upon its German counterpart, reflecting efforts among the German conservative government to support Greece’s efforts at managing the crisis. Beyond these exemplary influences, an analysis of all possible interactions shows a rich web of temporary, mutual or one-sided influences, which are not limited to sub-discourses in one country alone, but also illustrate close inter-dependencies bwteen the public debates in both countries. By contrast, the time-invariant measure of influences identifies only very few enduring influences (mostly of the Greek left on Factor 1), and some auto-regressive dynamics within the same sub-discourse on Factor 2—and misrepresents the time-changing directionality and strength of mutual inter-dependencies.

Horizontal axis: time in weeks, , running from 10/01/2009 to 06/30/2012. Vertical axis: selected entries of the time-varying VAR matrices and defined in (20), estimated according to (27). Panel a: first factor of Greece-right caused by the first factor of Greece-left. Panel b: second factor of Greece-right caused by the second factor of Greece-left. Panel c: second factor of Greece-centre caused by the second factor of Greece-left. Panel d: second factor of German-right caused by the second factor of Greece-right.
7 Conclusions
In this paper, we identify and interpret the small-dimensional latent factors underlying the evolution over time of large-dimensional semantic concepts with respect to three political-ideologically differentiated sub-discourses (left, centre, and right) in two countries (Greece and Germany). Our objective is threefold: (i) reducing the number of correspondences involved into a contingency table, (ii) describing the evolutionary (or smoothly time-varying) structure of the relationships between the reduced variables, and (iii) allowing N to grow at least as fast as T, so to meet the typical challenge presented by ‘big data’. The novel ECA introduced in this paper targets our objective.
ECA enables an analysis of the co-evolution of linked debates that goes beyond the capacities of existing methodological approaches in several critical ways. Drawing upon the evolutionary perspective adopted from EFA (Motta & Baden, 2013), ECA permits a rigorous comparative analysis of expressed meanings, despite the fact that the specific contents of the debate are constantly changing (Baumgartner et al., 2008). By distilling latent variables as the dominant factors structuring co-variation in a much higher-dimensional data process, it enables an analysis of auto-regressive and cross-lagged dependencies without the need to focus on only a few, static, and pre-selected variables. Through the joint estimation of underlying factors, it is possible to delineate what meanings are shared (trivial factor), distinguish between enduringly different perspectives (Factor 1; notably, in our example, the labelling as primarily economic or financial crisis) or constitute specific, transient frames (Factor 2, which in our case captured a variety of temporarily salient disagreements about the interpretation of the crisis). In particular, the barycentric properties of ECA allow an analysis that regards the changing associations between semantic concepts and the variable alignments between different sub-debates within the same low-dimensional space. In this way, we can not only understand how different interpretations shape the co-evolution of debates at different times, but also how each sub-discourse contributes to these controversies and where it positions itself therein.
Smoothness is a key assumption in our model. The loading matrix in (3), as well as the mean-vector and the VAR-matrix in (4) are allowed to vary smoothly (i.e. slowly) over time. Another assumption of our approach is that the factors in (4) are regressed upon their own first-order past value, thus excluding the possible dependence of the factors upon higher-order lagged values. Our method might therefore not be appropriate when the underlying parameters measure transitions that are abrupt (i.e. discontinuous) over time, and/or in the case of long-memory latent factors.
Given the methodological focus of the present paper and the limitations of space, we have only sketched some illustrative analyses above, which remain far behind a full analysis enabled by ECA. The presented analyses serve to demonstrate the specific capacities added by ECA: They illustrate how ECA enables us to trace complex patterns of textual co-variations across different sub-discourses and over time to understand the time-changing focus of salient controversies; to identify the differential power relations among sub-discourses, which gain and lose dominance in their capacity to define prevalent perspectives over time; and to reconstruct the inter-dependencies between the framing choices found in different sub-discourses, reflecting the formation of temporary alliances (e.g. among the Greek left and right early into the crisis) and the exertion of political pressure (e.g. forcing the Greek government to call a referendum to respond to popular protests). Our findings not only underscore the critical value of modelling ongoing, evolutionary changes in these debates, but additionally reveal a rich web of reciprocal influences between different sub-discourses, which remain hidden in conventional CA. The variable grain of analysis afforded by ECA lends itself to both further statistical analysis (e.g. adding covariates, such as econometric data) and a nuanced qualitative investigation (e.g. adding event timelines or information about editorial lines) of comparative patterns.
Extending the gaze to social scientific research more generally, there exists a large variety of phenomena that shares a similar data structure to the case documented here: From intergovernmental negotiations (e.g. delegations whose members hold time-varying preferences on numerous issues), to group-dynamic interactions (e.g. social media platforms whose users interact with one another in time-changing ways), to psychological processes (e.g. individuals whose beliefs or attitudes toward many objects evolve in inter-dependent ways), there are many social processes that can be characterized by time-variant data on a large number of variables obtained in equivalent fashion from multiple sites for the purpose of comparative analysis. The same is true for applications beyond the social sciences, be they complex patterns of neuronal activation, meteorological data, market survey or radio signals. ECA offers an avenue for studying such data in ways that do not require researchers to assume those underlying processes responsible for complex observations to remain known and stable, so as to obtain low-dimensional time-series data; it does not require the collapse of diachronic changes in order to obtain cross-sectional data matrices accessible to conventional dimension reduction techniques; nor is it limited to the case-wise analysis of complex evolutionary change, as EFA has been. Despite its focus the macro-level dynamics of the time-varying interrelations between a finite set of latent variables, ECA maintains the link to the underlying micro-level patterns of evolving manifest data. As the procedure is entirely formal and makes no assumptions about the specific nature of represented processes, it can be easily adapted also to the analysis of other linked, high-dimensional evolutionary processes both within and beyond the social sciences. Given that such processes are in fact quite common, and that existing methodological tools are often ill-equipped to permit a comparative analysis of complex, co-evolving data, ECA offers a valuable addition to the methodological toolbox.
Acknowledgments
Christian Baden would like to thank Dimitra Dimitrakopoulou for her help with collecting and interpreting the data.
Funding
This work is funded by the project ‘Data collection for this study has been supported by the European Union, Marie Skłodowska-Curie Grant No. 627682’.
Data availability
The Dataset we used in this article are available as Supplementary material files. Source codes for the reproduction of our results are available at https://github.com/giovanni-motta/.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series A.
References
Footnotes
It should be noted that no text-driven method can directly identify frames, which are cultural objects that transcend the text (Van Gorp, 2007); rather, any frame invoked inevitably leaves traces in the text, which can be recognized as lexical indicators of referenced semantic meanings. The interpretability of inductively detected patterns as frames must be subsequently validated against available cultural meanings (Baden, 2018; Nicholls & Culpepper, 2021).
The usual CA definition subtracts out the expected values at the start so that the so-called ‘trivial’ factor is eliminated, and then the dimensions are labelled from 1 to .
All interpretations are exemplary and serve to illustrate those analytic opportunities enabled by ECA. We do not aim to offer a fully-developed analysis of the data. For all discussed framing choices, we did ascertain that these are indeed interpretable as semantically coherent frames.
A figure with the ‘typical sub-discourse’ would be difficult to interpret, as we would have to distinguish among different colours.
Author notes
Conflict of interest: None declared.