-
PDF
- Split View
-
Views
-
Cite
Cite
Mark Schweizer, Comparing holistic and atomistic evaluation of evidence, Law, Probability and Risk, Volume 13, Issue 1, March 2014, Pages 65–89, https://doi.org/10.1093/lpr/mgt013
- Share Icon Share
Abstract
Fact finders in legal trials often need to evaluate a mass of weak, contradictory and ambiguous evidence. There are two general ways to accomplish this task: by holistically forming a coherent mental representation of the case, or by atomistically assessing the probative value of each item of evidence and integrating the values according to an algorithm. Parallel constraint satisfaction models of cognitive coherence posit that a coherent mental representation is created by discounting contradicting evidence, inflating supporting evidence and interpreting ambivalent evidence in a way coherent with the emerging decision. This leads to inflated support for whichever hypothesis the fact finder accepts as true. Using a Bayesian network to model the direct dependencies between the evidence, the intermediate hypotheses and the main hypothesis, parameterised with (conditional) subjective probabilities elicited from the subjects, I demonstrate experimentally how an atomistic evaluation of evidence leads to a convergence of the computed posterior degrees of belief in the guilt of the defendant of those who convict and those who acquit. The atomistic evaluation preserves the inherent uncertainty that largely disappears in a holistic evaluation. Since the fact finders’ posterior degree of belief in the guilt of the defendant is the relevant standard of proof in many legal systems, this result implies that using an atomistic evaluation of evidence, the threshold level of posterior belief in guilt required for a conviction may often not be reached.
1. Introduction
In legal trials, fact finders need to form a conviction regarding the truth of factual statements based on a mass of often incomplete, ambivalent and contradicting evidence. There are two fundamentally different ways the fact finder can go about this difficult task: she can either assess the evidentiary strength of each item of evidence and then integrate her individual assessments according to some general rule to arrive at a conclusion, or she can assess the whole mass of evidence globally, forming a holistic overall impression of the case. The former method is sometimes referred to as ‘atomistic’ evaluation of evidence, while the latter is called ‘holistic’ (Twining, 2006, p. 309). ‘Atomistic’ and ‘holistic’ can only describe the fact finding process at a very general level; a number of different approaches to the evaluation of evidence fall into each category. In this article, one currently popular model of holistic evaluation of evidence based on cognitive coherence is contrasted with a leading theory of atomistic evaluation of evidence, namely subjective probability theory.
Holistic evaluation of evidence assumes that legal decision making is based on constructing and evaluating coherent interpretations or stories from the available items of evidence (see Pennington & Hastie, 1992 for a classic approach). Cognitive coherence theories understand evaluation of evidence as a process of forming a coherent mental representation of the evidence, integrating it with the background knowledge of the subject (Simon et al., 2004). A more coherent mental representation leads to higher subjective confidence that the representation is correct (Glöckner et al., 2010, p. 219). During the evaluation of the evidence, coherence is maximized by discounting contradicting evidence, inflating supporting evidence and interpreting ambivalent evidence in a way that is coherent with the emerging decision (Simon, 2004, p. 522). One important empirical prediction of cognitive coherence theories of evidence evaluation is that the mental model of the case ‘shifts’ during the decision process towards an interpretation coherent with the emerging decision (Holyoak & Simon, 1999; Carlson & Russo, 2001; Engel & Glöckner, 2012). The result of this process, referred to as ‘coherence shift’, is that even when the evidence has little probative weight, the fact finder has a high degree of confidence in having made the correct decision (Simon et al. 2004, p. 819). If the standard of proof that has to be met before a fact finder may accept a factual proposition as true is understood as a degree of conviction, or belief, in the truth of the allegation, cognitive coherence theories of evidence evaluation imply that the threshold value may be reached even when the evidence is ambivalent, weak or partially missing (Simon, 2004, p. 519).
Subjective probability theory is also based on a notion of coherence, but on an entirely different concept of coherence. According to subjective probability theory, the partial beliefs of a subject are coherent if they do not violate the axioms of probability theory, namely positivity, certainty and additivity (Finetti, 1937). The subject is assumed to hold some prior belief in the truth of a proposition, which he updates when he learns of new evidence. If the subject is to remain coherent in the sense of subjective probability theory, the updating must be done according to Bayes rule, which is why people who believe that degrees of belief should conform to the axioms of probability theory are often referred to as ‘Bayesians’. Subjective probability theory is primarily a normative theory of forming a conviction in the truth of a proposition; nobody claims (anymore) that it accurately describes the actual psychological process of belief formation (Kaye, 1988, p. 178, but see Lagnado, 2011). Whether it is applicable in the context of evidence evaluation by judicial fact finders is subject to a decades old controversy (see Tillers, 2011 for an overview). Ensuring the coherence of partial beliefs, in the sense of subjective probability theory, quickly becomes computationally intractable (Callen, 1982). However, in the late 1980s, algorithms for inference in so called ‘Bayesian networks’ were developed, which allow the compact representation of the full joint probability distribution using a directed graph and conditional probabilities (Pearl, 1988). A number of authors have suggested using Bayesian inference networks for the evaluation of evidence in legal contexts (Edwards, 1991; Robertson and Vignaux, 1992; Kadane and Schum, 1996; Taroni et al., 2006; Fenton and Neil, 2011; Juchli et al., 2012; Lagnado et al., 2012; Fenton et al., 2012). Evidence evaluation with the help of a Bayesian network is atomistic as it requires an assessment of the probative value of each item of evidence, and the prior probability of each hypothesis. It is holistic in the sense that it calculates the overall impact of the interrelated items of evidence on the probability of the hypothesis of interest.
Both cognitive coherence theories and subjective probability theory are models of belief formation. Both have been proposed as normative theories of evidential reasoning in law (see references in Sections 3 and 4). The coherentist approach is certainly the more natural approach, and it has been suggested that it may be superior to a probabilistic approach for this reason (Allen, 1991, pp. 410, 413; Amaya, 2008, p. 307). There is indeed ample empirical evidence that cognition is a process of coherence maximization, and that fast, automatic, unconscious, i.e. intuitive, processing of information is coherentist (see references in Section 2). This does not, however, imply that a coherentist, holistic, account of evidence evaluation is necessarily superior to a probabilistic account from a normative point of view. This article investigates how the posterior belief in the truth of a main hypothesis, in this case whether the defendant is guilty of taking money from a safe, differs when the evidence is evaluated holistically, through maximization of cognitive coherence, versus atomistically, within a probabilistic network. For the atomistic evaluation of the evidence, the prior beliefs and the likelihoods for each item of evidence and each intermediate hypothesis are elicited from the subjects. The resulting parameters are then integrated using a Bayesian network, allowing the computation of the posterior belief in the truth of the main hypothesis for each subject based on her own partial beliefs. This posterior degree of belief is the degree of belief the subject should have, provided her partial beliefs are coherent in the sense of subjective probability theory, and can be contrasted with the degree of belief in the guilt of the defendant based on a holistic assessment of the case. The main result of the experiment reported in this article is that the average degree of belief in the guilt of the defendant of those who convict is lower when the evidence is assessed atomistically versus holistically, while it is higher for those who acquit. While the subjects interpret the same evidence in completely different ways when they assess it holistically, their computed posterior probability of guilt converges when their atomistic assessments are integrated according to the logic of subjective probability theory.
The rest of this article is structured as follows: in Section 2 the Parallel Constraint Satisfaction (PCS) model of cognitive coherence is briefly explained, and Section 3 provides a cursory introduction into subjective probability theory and Bayesian networks. Section 4 considers the implications of the empirically observed coherence shifts for normative models of holistic evaluation of evidence and Section 5 sets out the hypotheses to be experimentally tested. Section 6 describes the experiment, its results and its limitations. The conclusion summarizes the main contributions of this article.
2. Holistic evaluation of evidence
2.1 The story model of evidence evaluation
The story model of juror decision making is an early holistic account of evidence evaluation. Pennington and Hastie, building on the work of Bennet and Feldman (Bennett and Feldman, 1981), propose that jurors construct stories to ‘make sense’ of the evidence presented at trial (Pennington and Hastie, 1986; Pennington and Hastie, 1988; Pennington and Hastie, 1993). A story—a chain of events connected by physical or intentional causality (Pennington and Hastie, 1993, p. 196)—will be found convincing if it covers (explains) all or most of the evidence, is coherent, and is unique (Pennington and Hastie, 1993, p. 198 sq.). The coherence of a story has three components: consistency, plausibility and completeness. A story is consistent if it does not contain internal contradictions, it is plausible if it corresponds with the decision maker’s background beliefs and it is complete if it has all the parts of an episodic structure (initiating events, psychological responses, goals, actions and consequences; Pennington and Hastie, 1993, p. 197). Finally, a story is unique if it is the only one judged as coherent; a lack of uniqueness will lead to reduced confidence in a story and a decision (Pennington and Hastie, 1993, p. 199).
According to the story model, the construction of the story is guided by the evidence, but the construction of the story also influences the interpretation of the evidence. Decision makers will generate evidence that is actually not existent, but is coherent with their constructed story (Pennington and Hastie, 1986, p. 249; Pennington and Hastie, 1988, p. 526 sq.), and will ignore or downplay the importance of evidence that is incoherent with the emerging story (Wagenaar et al., 1993, p. 45). This may lead to the acceptance of plausible stories that are insufficiently anchored in evidence (Wagenaar et al., 1993, p. 75).
2.2 Coherence construction by parallel constraint satisfaction (PCS)
In the last decade or so, the PCS model of coherence-based reasoning has been advanced as an alternative descriptive model of the fact finder’s decision making process (Holyoak and Simon, 1999; Simon et al., 2004; Simon, 2004; Engel and Glöckner, 2012). The story model of evidential reasoning and models of coherence maximization by PCS are closely related (Byrne, 1995; Simon et al., 2004, p. 816; Thagard, 2004, p. 243; Engel and Glöckner, 2012, p. 1). The PCS model of coherence construction is both more specific and more general than the (less formal) story model. It is more specific in the sense that it provides a computational algorithm that allows computing the coherence of a mental representation, and it is more general since it does not posit that coherence is necessarily constructed by forming a narrative structure covering the evidence. PCS models are hence applicable to decision making in situations where it may seem contrived to speak of a ‘story’ or ‘narrative’ structuring the evidence (Simon et al., 2004, p. 816).
In line with a basic claim from Gestalt psychology, cognitive coherence theories regard the assessment of evidence as holistic and relying at least partially on an automatic process that has been adapted from perception (Simon et al., 2004). The process of constructing cognitive coherence can be computationally implemented as a PCS process (Thagard, 1989; Thagard and Verbeurgt, 1998; Holyoak and Simon, 1999). A constraint is a relationship between two cognitions (propositions). The coherence problem consists of dividing the set of propositions in two sub-sets of accepted (or true) and rejected (or false) propositions in a way that satisfies the most constraints. If two propositions are coherent (fit together), the constraint is positive and it is satisfied if the two statements connected by it are in the same sub-set, while an incoherent relationship is represented by a negative constraint which is satisfied if the two statements connected by it are in different sub-sets (see Thagard, 2000, p. 16 sq., for a full exposition). In Thagard’s theory of explanatory coherence, two propositions fit together or cohere if one is an explanation for the other or the two propositions explain together a further proposition. Two propositions are incoherent if they are logically contradictory or if they compete to explain a further proposition (i.e. if either of them is sufficient to explain the further proposition. See Thagard, 1989, p. 436 sqq.; Thagard, 2000, p. 43 sqq., for a full exposition). The strength of the (in)coherence is expressed as weight of the constraint. In most cases, not all the constraints can be satisfied at the same time. The goal is divide the propositions into ‘accepted’ and ‘rejected’ propositions so that the weight of the satisfied constraints is maximized.
There can be no general algorithm that exactly solves all PCS problems in polynomial time (Thagard and Verbeurgt, 1998). However, a number of algorithms for approximate solutions are available; the most popular, and the one almost exclusively used in psychological research, uses a representation of the problem in a connectionist network (Read et al., 1997). In a connectionist network, positively linked variables excite each other while negatively linked variables inhibit each other. In an iterative process, activation spreads through the network. Each and every element influences, and is influenced by, the entire network, so that every processing cycle results in a slightly modified state of the network. The core feature of constraint satisfaction mechanisms is that the connectionist network will reconfigure itself until the constraints settle at a point of maximal coherence. This process forces coherence upon a mental representation of the task that is initially incoherent in complex decisions. Since the links between nodes in a connectionist network are bidirectional, the evidence influences the hypotheses, but the activation of the hypotheses also influences the interpretation of the evidence (Holyoak and Simon, 1999). The formation of coherence in an iterative process therefore leads to a polarization of the evidence: evidence that supports the emerging decision is strongly endorsed while contradicting evidence is dismissed, rejected or ignored. These so called ‘coherence shifts’ or more generally ‘predecisional information distortions’ (Russo et al., 2008) have been demonstrated in a variety of decision making tasks (Brownstein, 2003), most notably also for legal decision making (Holyoak and Simon, 1999; Carlson and Russo, 2001; Hope et al., 2004; Lundberg, 2004; Simon et al., 2004; Engel and Glöckner, 2012; Glöckner and Engel, 2013). Coherence shifts occur unconsciously and affect evidence that is logically unrelated (Holyoak and Simon, 1999, p. 11 sq., p. 18). For example, Simon et al. show that when mock jurors receive, after tentatively expressing which verdict they lean towards, a new piece of evidence that is either strongly exculpatory or incriminatory, those who change their verdict from the tentative to the final stage also change their evaluation of the other evidence, logically unrelated to the new piece of (DNA) evidence (Simon et al., 2004, p. 824 sqq.).
2.3 Coherence shifts lead to inflated confidence
The devaluation of contradicting evidence and the inflation of supporting evidence as well as the interpretation of ambiguous elements as supportive for the emerging decision lead to an inflated confidence in having made the correct decision. In other words, although the evidence of the case is objectively weak, as is evidenced by the fact that the decision makers are split over whether the evidence supports a guilty verdict, both those who find the defendant guilty and those who find him innocent are quite confident that they have made the right choice (Holyoak and Simon, 1999; Simon et al., 2004; Glöckner and Engel, 2013). For example, under a scenario describing the case against a person being accused of taking money from a safe, subjects are split over whether to convict or acquit the defendant (Simon et al., 2004; Engel and Glöckner, 2012; Glöckner and Engel, 2013). But the distribution of the confidence levels of the subjects is skewed towards high confidence in both those who convict and those who acquit the defendant (Simon et al., 2004, p. 819). Persons who find the defendant guilty express an average posterior degree of belief in guilt of the defendant which is about twice as high as those who find him innocent (roughly 80% versus 40%, see Glöckner and Engel, 2013, p. 241). In a holistic evaluation of the evidence, subjective confidence in the truth of the main hypothesis is mostly the result of coherence shifts during the decision making process, and therefore a questionable standard of proof (but see Glöckner and Engel, 2013, p. 238, which shows that raising the standard of proof does have the desired effect of reducing the number of convictions given the same evidence).
3. Atomistic evaluation of evidence
3.1 Subjective probability theory as a normative model for the evaluation of evidence
According to the subjective interpretation of probability, probability is a degree of belief (Finetti, 1937). Unlike the frequentist interpretation, the subjective interpretation of probability allows to speak intelligibly of the ‘probability’ of a single case (Hacking, 2008, p. 136). ‘Subjectivists’ or ‘Bayesians’, or, as I prefer to call them, ‘coherentists’, believe that the partial beliefs of a subject should (normatively) not violate the axioms of probability theory, i.e. positivity (probability is a non-negative real number), certainty (the probability of a certain event is 1) and additivity (the probability of one of several mutually exclusive events occurring is the sum of their individual probability). From positivity and certainty follows immediately that probabilities are normalized, i.e. bound between 0 and 1. A variety of arguments can be made why degrees of belief should conform to the axioms of probability theory. The least technical one is that unless the beliefs of a subject conform to the axioms of probability theory, the subject can be made the victim of a ‘Dutch book’, a set of bets that incurs him a certain loss, no matter how the state of the world turns out (Finetti, 1937; Christensen, 2007, p. 116 seq.).
The central importance of Bayes’ theorem for subjective probability theory stems from the fact that the subject should update her prior belief in A when she learns that B is the case according to this theorem. For Bayesians, Bayes’ theorem is a normative rule for rational updating of beliefs (Good, 1950, p. 61).
This form of Bayes rule makes transparent that it is the ratio Pr(B|atrue)/Pr(B|afalse), called the likelihood ratio, that determines the degree of change from prior to posterior odds, or from prior to posterior probability. In subjective probability theory, the likelihood ratio is therefore a measure of evidentiary strength (Good, 1983, p. 132).
Whether subjective probability theory is a useful model for the formation of a belief in the context of the forensic evaluation of evidence is subject to a debate that has been likened to a 40-year war (Park et al., 2010, p. 1) and has been reinvigorated by the decision of the Appeal Court of England and Wales in R v. T ([2010], EWCA Crim 2439; for an introduction to the latest round of the controversy see Aitken, 2012). Some people take issue with the betting paradigm of subjective probability theory (Cohen, 1977, p. 90), while others are convinced that the expression of degrees of belief that are not grounded in observed relative frequencies in mathematical terms will lead to ‘wholly inaccurate, and misleadingly precise, conclusions’ (Tribe, 1971, p. 1359; this is essentially also the position of the Appeal Court in R v. T). As Taroni et al. have put it, the proof of the pudding is in the eating—the demonstration of the practical use of Bayesian inference should convince sceptics (Taroni et al., 2006, p. 23).
4. Bayesian networks as decision aids for the evaluation of evidence
Holding partial beliefs that are coherent in the sense of subjective probability theory quickly becomes impossible without some sort of decision aid (Charniak, 1991, p. 55). Bayesian networks, also referred to as ‘belief nets’ (Darwiche, 2009, p. 71), are a graphical representation of the direct dependencies among a set of variables and force coherence in the sense of subjective probability theory on the set of partial beliefs represented by the network (Charniak, 1991, p. 55).
A Bayesian network is a directed acyclic graph in which a node (variable) is connected by a directed edge to another node if the variable represented by the node has a direct influence on the other variable (for a general introduction into Bayesian networks see Taroni et al., 2006, p. 33 seq.) A directed graph is acyclic if there is no way to start at some node A and follow a sequence of edges that leads back to node A (colloquially, it does not contain a ‘feedback cycle’, Jensen and Nielsen, 2007, p. 34). A conditional probability table is associated with each node (root nodes are only associated with ‘unconditional’ or ‘prior’ probabilities), which gives the probability for each mutually exclusive state of the variable given its parents (a parent of a node is an immediate ancestor of this node, i.e. any node that is directly connected to the node). In the network used here, each variable can take on only two states which can be interpreted as ‘true’ and ‘false’. Using the concept of conditional independence, Bayesian networks can represent all direct and indirect dependencies of the problem domain by only explicitly showing the direct dependencies.
The following simple example, adapted from Taroni et al. (2006, p. 39), may illustrate the concept. The subject holds some prior beliefs about the fairness of a coin, which can be fair (heads and tails on opposite sides), tails only or heads only. H is the variable that represents this prior belief, and H = fair, H = only heads and H = only tails are the three mutually exclusive states it can take. The subject now observes the outcome of a first throw of this coin (she cannot examine the coin). The variable E1 represents this evidence (while E2, E3 stand for the second and third toss), and it can take the states E1 = heads or E1 = tails. This obviously tells the subject something about the fairness of the coin, and this in turn will influence her expectation that the next toss of the coin lands on heads (see Fig. 1a). However, given that the coin is in any of its states, the outcome of the first toss will not tell the subject anything about the further outcomes. The variables E1, E2 and E3 are conditionally independent given H. This knowledge of the conditional independencies is brought to the table by the human expert who knows that the first toss of a fair coin tells her nothing about the probable outcome of the second toss and allows a more parsimonious representation of the problem as given in Fig. 1b. Figure 1 b also shows the (conditional) probability tables associated with each node of the network.

Bayesian network with all dependencies (a) and only the direct dependencies (b) represented by directed arcs.
That is, after observing three tosses that fall on heads in a row, her belief that the coin is fair is reduced from 0.95 to 0.75. For more complex queries, the calculation is tedious using paper and pencil even for small networks and impossible for large networks. Algorithms have been developed that perform these calculations efficiently for large networks (Pearl, 1986; Lauritzen and Spiegelhalter, 1988). For certain classes of networks, an exact solution is impossible, but algorithms for approximate solutions exist (Darwiche, 2009, p. 340 seq.).
For the user of Bayesian networks, knowledge of the algorithms is just as unnecessary as knowledge of the internal workings of a calculator is unnecessary for the use of a calculator (Fenton and Neil, 2011, p. 131). It is sufficient to know that the algorithms have been accepted by the scientific community as correct, and that different implementations lead to the same results. There are a number of both commercial and free software programs available for probabilistic inference using Bayesian networks. All calculations for this article were performed with SamIam (Sensitivity Analysis, Modeling, Inference And More) 3.0, which is developed by the Automated Reasoning Group of Professor Adnan Darwiche at UCLA. This software is free and well documented by a number of scientific papers and a book (Darwiche, 2009); however, the same results could have been obtained by any number of programs.
It must be noted that subjective probability theory is by no means the only ‘atomistic’ model of evidence evaluation. Cohen’s ‘inductive probabilities’ (Cohen, 1977), the evidentiary value model of Ekelöf/Halldén/Edman (Ekelöf, 1964; Edman, 1973; Halldén, 1973) and Shafer-Dempster belief functions (Shafer 1976) are also ‘atomistic’ models of the evaluation of evidence, in the sense that they require the assessment of the probative value of each item of evidence, and the overall assessment of the case is generated computationally by an algorithm. I do not wish to distract from the merits of these models. However, neither of these models allows the computation of a normative degree of belief; they do not purport to be models of belief formation, but rather models of evidentiary support. Therefore, their results are not directly comparable to a holistic degree of belief in the guilt of the defendant. Only subjective probability theory allows a meaningful comparison of a degree of belief that has been arrived at intuitively with one based on normative rules.
5. The normative status of holistic models of evidential reasoning
While the story model of juror decision making was developed as a descriptive model of juror decision making (Pennington and Hastie, 1993, p. 192), legal scholars have long suggested that narrative theories of judicial proof may also serve as normative models of fact finding (Allen, 1991, 1994, 1997; MacCormick, 2005, p. 220 sqq.). According to Allen, jurors in civil cases compare the plausibility of the competing stories advanced by the plaintiff and defendant and decide for the side that advances the more plausible overall story, or explanation, of the evidence. The ‘relative plausibility theory’ is decidedly holistic—‘… the battle is fought at the level of competing visions, not at the level of individual fact, and factual details are put to the service of establishing that the organizing vision is more likely than those offered in opposition’(Allen, 1991, p. 390). In criminal trials, the prosecution’s story is to be accepted only if ‘there is no plausible account consistent with innocence’ (Allen 1991, p. 382)—‘consistent’ is in this context to be understood as ‘coherent’, i.e. not limited to inconsistency in the sense of a logical contradiction, but including accounts that do not fit well with background beliefs.
A number of authors have also suggested that PCS models of coherence construction can justify the acceptance of a hypothesis by a court of law as true (Thagard, 2004; Thagard, 2006; Amaya, 2008, p. 307, 2009, p. 142 sqq.; Pardo and Allen, 2008, p. 230 [referencing Thagard for criteria of a good explanation]). Critics point to the danger that the holistic approach may obscure the inconclusive and inconsistent nature of the evidence (Menashe and Shamash, 2005; Griffin, 2012), that (merely) providing a best explanation is too permissive a standard and has historically lead to the acceptance of many theories as true that have later turned out to be incorrect (Laudan, 2007).
In criminal matters, where the applicable standard of proof is ‘beyond any reasonable doubt’ (In re Winship, 397 U.S. 358 [1970]; for English law see Miller v. Minister of Pensions, [194] 2 ALL E.R. 372) holistic theories such as the story model or PCS models of coherence maximization face another problem. It is not sufficient for a guilty verdict that the state’s story is more plausible than the defendant’s story. Rather, the state’s story must be true ‘beyond any reasonable doubt’. Allen therefore requires that ‘there is no plausible account consistent with innocence’ for a guilty verdict. Clearly, it cannot be required that there is no account consistent with innocence—the reasonable doubt standard only requires the absence of any reasonable doubt, not the absence of any doubt, or any account, however unreasonable or implausible, consistent with innocence (Ninth Circuit Model Criminal Jury Instructions, 2003 ed., § 3.5—Reasonable Doubt—Defined: ‘It is not required that the government prove guilt beyond all possible doubt. A reasonable doubt is a doubt based upon reason and common sense and is not based purely on speculation.’). Hence, normative story models of criminal proof must define how plausible the alternative account may be to still permit a guilty verdict, or in other words, how large the gap between the confidence in the story supporting a guilty verdict and the confidence in an alternative account inconsistent with a guilty verdict has to be (Laudan, 2007, p. 299 sq.). Thagard, who proposes his theory of explanatory coherence (a PCS model of coherence maximization) as a normative theory of criminal proof, admits as much in a remarkable act of intellectual candour: ‘From the perspective of the theory of explanatory coherence, reasonable doubt might be viewed as an additional constraint… requiring that hypotheses concerning guilt must be substantially more plausible than one concerning innocence’ (Thagard, 2003, p. 366 sq.). The difficulty, of course, lies in defining how substantial the gap in plausibility must be to support a guilty verdict.
Accepting that PCS models of decision making are excellent descriptive accounts of the fact finders’ decision making process in legal trials raises another concern. The polarization of evidence predicted by PCS models—leading to inflated support for whichever hypothesis the decision maker accepts—suggest that the gap between the accepted story and an alternative story inconsistent with guilt may be perceived to be large even in cases where the evidence is objectively ambiguous and inconclusive. In other words, the subjective confidence in having found the right explanation for the evidence may be higher than justified by the evidence itself, leading to convictions in cases where there is in fact reasonable doubt. An open question is whether an atomistic evaluation of evidence, namely one that models the decision problem in a Bayesian Network, counteracts the tendency in intuitive or holistic assessments of the evidence to find inflated support for whichever hypothesis is accepted as true.
6. Hypotheses
The experiment reported in this article sought to test three main hypotheses. The first hypothesis is that a holistic, intuitive assessment of the case would lead to highly different subjective probabilities of guilt by those who convict and those who acquit the defendant. This is based on the prediction of PCS models of evidence evaluation that people will maximize the coherence of the evidence and the verdict, leading to inflated support of whichever verdict they chose. The second hypothesis proposes that an atomistic evaluation of evidence reduces the inflated confidence in the truth of whichever hypothesis the subject accepts. I hypothesize that forcing the subjects to assess the likelihoods for each individual item of evidence and integrating the obtained values using a Bayesian network would lead to reduced ‘coherence shifts’. This is based on the observation that hypothetical thinking helps reduce coherence shifts (Simon, 2004, p. 544). Thinking in a likelihood framework forces hypothetical thinking upon the subjects by making them consider that the observation may also have been made if the hypothesis to be tested was not true.
The third main hypothesis is that computing the posterior probability of guilt reduces the variability in the assessment of guilt compared to a holistic, intuitive assessment. This hypothesis is based on results by Schum and Martin who report that when the evaluation of evidence is decomposed into individual items, inter-individual differences in the evaluation of the evidence are reduced (Schum and Martin, 1982).
7. Experiment
7.1 Method
7.1.1 Participants
I invited 120 subjects to the Hermann Ebbinghaus lab at the University of Erfurt, Germany, for completion of a computer-based questionnaire. Sessions lasted about one hour. Subjects were paid € 6 for participation. Six subjects could not complete the questionnaire because of a computer malfunction. 16 subjects provided values that resulted in networks that could not be queried (see Section 7.2, for an explanation) and were excluded from further analysis, which leaves 98 subjects. The subjects were students between the ages of 19 and 47 with an average age of just under 24 (median 23). 72% were women. An overwhelming majority majors in pedagogy or psychology.
7.1.2 Material and procedure
Subjects first read the instructions for completing the questionnaire and answered corresponding test questions before reading the case material. They were instructed that they could imagine a subjective probability of x% as the expectation of blindly drawing a red ball from an urn containing 100 balls, thereof x red balls. Additionally, the meaning of conditional probability was explained, and some examples were given that were not from the domain of legal evidence. It was explained that likelihoods need not add up to one.1 Subjects were asked to state probabilities as percentages from 0% to 100%, this being more natural than the mathematical convention of bounding probabilities between 0 and 1.
Subjects then read the scenario of a case involving the (alleged) theft of money from a safe (the ‘Jason Wells/Hans H. case’), a scenario that has been used in a number of psychological studies (Simon et al., 2004; Engel and Glöckner, 2012; Glöckner and Engel, 2013; a full transcript has been published in the last two references). The scenario, of just over 700 words, describes Hans, a 34-year-old married man with two children, employed at a construction company. Hans has recently been denied a promotion. He has a prior criminal record for attempted burglary at age 18, but has not since come into conflict with the law. One day, €5200 is missing from the company’s safe. 8 people, among them Hans, have access to the safe, which was last opened at 7.14 pm. A technician testifies that he saw Hans leave the office in which the safe is located at about 7.15 pm. A surveillance video shows a car of the rare kind Hans drives leaving from the office building at 7.17 pm, but the license plate is illegible. Another witness, Silvia, testifies that she saw Hans at a school function at 8 pm wearing different clothes than the ones he wore at work, and it would be difficult to get from the office to the school in less than 40 min at that time of day. The day after the disappearance of the money from the safe, Hans repays a bank loan of €4870. Hans claims he received this money from his sister-in-law who owns a flower shop, but he cannot produce a receipt for the transaction. He explains this by the practice in the flower business of ‘occasionally’ doing business without issuing receipts.
The Hans case contains contradictory as well as missing evidence. Missing is the receipt, for which Hans offers an explanation, but Hans also fails to call his sister-in-law as a witness. The case is silent on why Hans does not offer the testimony of his sister-in-law, and the subjects are not expressly alerted to the omission.
After reading the scenario, subjects indicate whether they consider Hans guilty of taking the money or not. They then state their subjective probability of guilt (‘holistic before’). Then, subjects indicate the subjective probability of guilt they think is required for a criminal conviction (‘own standard’) and after reading the following definition of the criminal standard of proof used by the German Federal Supreme Court (e.g. BGH, 30 July 2009 – 3 StR 273/09 = BeckRS 2009, 25658; translation into English by the author):
They indicate the subjective probability they think this standard requires (‘legal standard’). They are then asked to give their prior probability of guilt. First, they are asked to state their prior belief given that Hans is one of eight people who have access to the safe (‘objective prior’—this prior is of course also a subjective probability, but unlike the other subjective probabilities in this case, it is based on a known relative frequency). The subjects are then asked to state their prior belief for guilt given that Hans is one of eight people with access, has been denied a promotion and has a prior criminal record (‘subjective prior’ taking into account Hans’ character and motive). Subjects also indicate their prior belief in Hans having received the money from his sister-in-law (the other root node of the network).The conviction of the judge does not require an absolute certainty that excludes other possibilities with logical necessity. An adequate degree of certainty that overcomes reasonable doubt is sufficient. The judge is not prohibited from drawing possible, albeit not cogent inferences, from facts if such inferences are supported.
The likelihood ratio for each item of evidence was then elicited from the subjects using natural language questions. For example, for the witness statement of the technician, subjects have to answer the questions: ‘How likely is it that the technician testifies he saw Hans leaving the office, given that Hans left the office?’ and ‘How likely is it that the technician testifies he saw Hans leaving the office, given that Hans did not leave the office?’. Subjects assessed a total of 11 likelihood ratios; this allowed the computation of two different versions of the Bayesian network (see below). At the end of the questionnaire, subjects were asked again what their holistic subjective probability for Hans’ guilt was (‘holistic after’).
7.1.3 Computation of the Posterior Probability of guilt
The posterior probability of guilt was computed for each subject using the parameters obtained from that subject and the structure of the Bayesian network given in Fig. 2. Evidence variables, i.e. variables the state of which is observed, are shown in dashed rectangles. The hypothesis variable (or ‘query variable’;Darwiche, 2009, p. 84) is the variable of interest; it is shown in a rectangle with a thick border. Intermediate variables are variables that cannot be observed and mediate the influence of the evidence variables on the hypothesis variable; they are shown in solid line rectangles. The two evidence variables ‘refused promotion’ and ‘old criminal record’ are not required for the computation of the network, as their state is known and they are parents of the hypothesis variable. They are included in Fig. 2 for sake of completeness. The evidence regarding motive and character is of course reflected in the subjective prior probability for Hans’ guilt and does therefore influence the computed posterior probability of guilt, but it needs not be added as distinct variables to the network.

Bayesian network for the Jason Wells/Hans H. case. Evidence variables with dashed borders, intermediate variables with solid borders, hypothesis variable with thick border.
It would be highly impractical to carry out the actual summations required for the solution by hand, and all calculations were therefore performed using the software SamIam 3.0. A total of four posterior subjective probabilities of Pr(ht) were computed for each subject: the first one using the structure given above with all the evidence variables instantiated (‘computed posterior 1’) and the second one with the same structure, but without the evidence variable W (calls sister-in-law as witness) being instantiated (‘computed posterior 2’). The third and fourth posterior probabilities were computed with an alternative structure of the network in which the variable S (school function) is a child of H (took money). This is incorrect, as the probability of reaching the school in time is not directly dependent on taking the money, only on leaving the building in time (simply put, Hans is not slower across town with money in his pockets), but it is an intuitive representation of the variables’ dependency structure. In this version of the network, Pr(F|H) replaces Pr(F|B), everything else remaining the same. Again, two posterior probabilities Pr(ht) were computed, one with the evidence variable W instantiated (‘alternative computed posterior 1’), one without instantiation of W (‘alternative computed posterior 2’).
7.2 Results
Of the 114 subjects who completed the questionnaire, 16 gave values for the likelihoods that made computation of the posterior for Pr(ht) impossible. This occurs when subjects indicate probabilities that are inconsistent with the evidence. For example, one subject indicated that the probability that Hans would be at the school function whether or not he left the office building at 7.17 pm was 0%. She also indicated that the probability of Silvia testifying that Hans was at the school function, given that Hans was not at the school function, was 0%. Under these assumptions, it is impossible that Silvia testifies that Hans is at the school function, but we know that Silvia testified to this. Therefore, the conditional probabilities are inconsistent with the evidence and the network cannot be queried. The 16 subjects with networks that could not be queried were excluded from further analysis. Of those who were excluded, 11 (69% versus 62% of the non-excluded subjects) would have convicted Hans. The average holistic probability of guilt for those 11 subjects was 80.2%, which is not significantly different from the 80.4% average holistic probability of guilt for the non-excluded convictors. The average holistic probability of guilt for those five excluded subjects who acquitted Hans was 23.1%, which is below the 45% for the non-excluded acquitters. If anything, including these 16 subjects in the analysis would therefore have increased the observed coherence shift.
Sixty-one subjects (62%) found Hans guilty of taking the money (‘convictors’) and 37 acquitted him (“acquitters”). Table 1 reports the average values for the priors, the holistic posterior probabilities of guilt, and the four computed posteriors (computed as explained in the last paragraph of section B above). The mean holistic probability of guilt of those who convict the defendant is with 80.4 almost twice as high as the mean holistic probability of guilt of those who acquit with 45.0. After having answered all the likelihood questions, the gap in the holistic probability of guilt between the convictors and acquitters decreases slightly. While the mean holistic probability of guilt of the acquitters remains the same, the mean holistic probability of guilt of the convictors decreases slightly to 75.7. The difference, however, remains large and highly significant. No significant differences are found between the priors of those who convict and those who acquit.
Mean prior, holistic and computed posterior probabilities, by subjects who convict and acquit (standard deviation)
. | Objective prior . | Subjective prior . | Holistic before . | Holistic after . | Computed post. 1 . | Computed post. 2 . | Alt. computed post. 1 . | Alt. computed post. 2 . |
---|---|---|---|---|---|---|---|---|
Convictors | 12.8 (2.5) | 24.2 (18.4) | 80.4 (20.6) | 75.7 (19.6) | 69.0 (36.3) | 50.5 (33.3) | 67.9 (37.1) | 46.7 (33.9) |
Acquitters | 12.6 (2.4) | 20.6 (18.3) | 45.0 (27.0) | 45.1 (26.6) | 62.8 (36.8) | 40.5 (29.7) | 59.7 (37.2) | 34.5 (28.8) |
Average | 12.8 (2.5) | 22.9 (16.7) | 67.0 (28.8) | 64.2 (26.8) | 66.7 (36.4) | 46.7 (32.2) | 64.8 (37.2) | 42.1 (32.4) |
Difference | 0.2 | 3.6 | 35.4*** | 30.4*** | 6.2 | 10.0+ | 8.2 | 12.2+ |
. | Objective prior . | Subjective prior . | Holistic before . | Holistic after . | Computed post. 1 . | Computed post. 2 . | Alt. computed post. 1 . | Alt. computed post. 2 . |
---|---|---|---|---|---|---|---|---|
Convictors | 12.8 (2.5) | 24.2 (18.4) | 80.4 (20.6) | 75.7 (19.6) | 69.0 (36.3) | 50.5 (33.3) | 67.9 (37.1) | 46.7 (33.9) |
Acquitters | 12.6 (2.4) | 20.6 (18.3) | 45.0 (27.0) | 45.1 (26.6) | 62.8 (36.8) | 40.5 (29.7) | 59.7 (37.2) | 34.5 (28.8) |
Average | 12.8 (2.5) | 22.9 (16.7) | 67.0 (28.8) | 64.2 (26.8) | 66.7 (36.4) | 46.7 (32.2) | 64.8 (37.2) | 42.1 (32.4) |
Difference | 0.2 | 3.6 | 35.4*** | 30.4*** | 6.2 | 10.0+ | 8.2 | 12.2+ |
***p < 0.001, +p < 0.1 (using a two-sided t-test) (in order to check for robustness, a Wilcoxon rank-sum test was also used. The results remain the same).
Mean prior, holistic and computed posterior probabilities, by subjects who convict and acquit (standard deviation)
. | Objective prior . | Subjective prior . | Holistic before . | Holistic after . | Computed post. 1 . | Computed post. 2 . | Alt. computed post. 1 . | Alt. computed post. 2 . |
---|---|---|---|---|---|---|---|---|
Convictors | 12.8 (2.5) | 24.2 (18.4) | 80.4 (20.6) | 75.7 (19.6) | 69.0 (36.3) | 50.5 (33.3) | 67.9 (37.1) | 46.7 (33.9) |
Acquitters | 12.6 (2.4) | 20.6 (18.3) | 45.0 (27.0) | 45.1 (26.6) | 62.8 (36.8) | 40.5 (29.7) | 59.7 (37.2) | 34.5 (28.8) |
Average | 12.8 (2.5) | 22.9 (16.7) | 67.0 (28.8) | 64.2 (26.8) | 66.7 (36.4) | 46.7 (32.2) | 64.8 (37.2) | 42.1 (32.4) |
Difference | 0.2 | 3.6 | 35.4*** | 30.4*** | 6.2 | 10.0+ | 8.2 | 12.2+ |
. | Objective prior . | Subjective prior . | Holistic before . | Holistic after . | Computed post. 1 . | Computed post. 2 . | Alt. computed post. 1 . | Alt. computed post. 2 . |
---|---|---|---|---|---|---|---|---|
Convictors | 12.8 (2.5) | 24.2 (18.4) | 80.4 (20.6) | 75.7 (19.6) | 69.0 (36.3) | 50.5 (33.3) | 67.9 (37.1) | 46.7 (33.9) |
Acquitters | 12.6 (2.4) | 20.6 (18.3) | 45.0 (27.0) | 45.1 (26.6) | 62.8 (36.8) | 40.5 (29.7) | 59.7 (37.2) | 34.5 (28.8) |
Average | 12.8 (2.5) | 22.9 (16.7) | 67.0 (28.8) | 64.2 (26.8) | 66.7 (36.4) | 46.7 (32.2) | 64.8 (37.2) | 42.1 (32.4) |
Difference | 0.2 | 3.6 | 35.4*** | 30.4*** | 6.2 | 10.0+ | 8.2 | 12.2+ |
***p < 0.001, +p < 0.1 (using a two-sided t-test) (in order to check for robustness, a Wilcoxon rank-sum test was also used. The results remain the same).
The large difference between the posterior probability of guilt of the convictors and acquitters becomes much smaller (from 35.4 percentage points, respectively 30.4 percentage points, to 6.2 respectively 8.2 percentage points) and statistically insignificant for the posteriors calculated with the Bayesian networks taking into account that Hans did not call his sister-in-law as a witness (see columns 5 and 7 of Table 1). The differences in the computed posteriors of guilt not taking into account the missing witness become much smaller, too (10.0, respectively 12.2 percentage points), but remain marginally significant (see columns 6 and 8 of Table 1).Taking into account or ignoring that Hans failed to call his sister-in-law results in significant differences in the mean computed posteriors. Within a random effect regression with the computed posterior as the dependent variable the dummy variable for the instantiation of variable W of the network has predictive power (b = 19.98, z(98) = 6.92, p < 0.001). No effect for the verdict (convict or acquit) was found (b = −8.1, z(98) = 1.2, p = 0.214).
Figure 3 graphically displays the main results reported above and in Table 1. It shows how both convictors and acquitters share almost the same priors, but differ strongly in their holistic posterior probability of guilt. Assessing the probative value of each individual item of evidence, without integrating the individual values according to the rules of subjective probability theory does not lead to convergence. It decreases the holistic posterior for the convictors somewhat but has no effect on the holistic posterior of the acquitters. However, computing the posterior using the likelihoods given by the subjects greatly increases the posterior for the acquitters and decreases the posterior for the convictors, bringing the two groups closely together. Computing the posterior without taking into account that Hans did not call his sister-in-law as a witness further decreases the posterior probability for both groups. However, the main effect, the closing of the gap between the assessments of the case, remains.

Mean posterior probabilities, by subjects who convict and acquit (error bars indicate 95% confidence intervals).
Table 2 reports the likelihoods for each individual item of evidence and each intermediate hypothesis, split by convictors and acquitters. As the computed posteriors do not fully converge, it is of interest to examine whether convictors and acquitters differ in the assessment of specific items of evidence. With the exception of the difference in means for the likelihood Pr(ot|ht) that Hans leaves the office at 7.15 pm, given that he took the money, and the likelihood Pr(tt|of) that the technician testifies that he saw Hans leaving the office, given that Hans did not leave the office, which were significant at a level of p < 0.05 and p < 0.1, respectively, none of the differences in mean likelihoods are statistically significant. Adjusting the α-level for the multiple comparisons, e.g. by a Bonferroni correction, results in none of the differences remaining statistically significant. However, this is slightly misleading as the differences in the assessment of the probative values of the individual items of evidence, although small, all go in the same direction, with the convictors giving each item of evidence a higher probative value (see Fig. 4).

Mean likelihood ratios for each item of evidence and each intermediate hypothesis, by subjects who convict and acquit.
. | Pr(ot|ht) . | Pr(ot|hf) . | Pr(tt|ot) . | Pr(tt|of) . | Pr(bt|ot) . | Pr(bt|of) . | Pr(vt|bt) . | Pr(vt|bf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 74.7 (25.5) | 41.8 (28.8) | 89.3 (15.7) | 23.6 (25.5) | 78.6 (22.1) | 29.7 (26.4) | 87.4 (24.7) | 22.8 (32.8) |
Acquitters | 63.7 (28.2) | 40.5 (28.4) | 85.5 (19.8) | 33.2 (29.6) | 77.1 (22.9) | 36.6 (28.3) | 90.6 (16.9) | 34.5 (39.4) |
Difference | 11.0** | 1.3 | 3.8 | 9.6+ | 1.5 | 6.9 | 3.2 | 11.7 |
. | Pr(ot|ht) . | Pr(ot|hf) . | Pr(tt|ot) . | Pr(tt|of) . | Pr(bt|ot) . | Pr(bt|of) . | Pr(vt|bt) . | Pr(vt|bf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 74.7 (25.5) | 41.8 (28.8) | 89.3 (15.7) | 23.6 (25.5) | 78.6 (22.1) | 29.7 (26.4) | 87.4 (24.7) | 22.8 (32.8) |
Acquitters | 63.7 (28.2) | 40.5 (28.4) | 85.5 (19.8) | 33.2 (29.6) | 77.1 (22.9) | 36.6 (28.3) | 90.6 (16.9) | 34.5 (39.4) |
Difference | 11.0** | 1.3 | 3.8 | 9.6+ | 1.5 | 6.9 | 3.2 | 11.7 |
. | Pr(ft|bt) . | Pr(ft|bf) . | Pr(st|ft) . | Pr(st|ff) . | Pr(rt|mt) . | Pr(rt|mf) . | Pr(wt|mt) . | Pr(wt|mf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.9 (29.6) | 61.7 (31.8) | 93.4 (14.7) | 19.2 (26.2) | 35.8 (31.0) | 7.4 (15.2) | 90.6 (19.4) | 35.0 (29.6) |
Acquitters | 47.6 (28,3) | 65.9 (28.3) | 87.2 (20.1) | 27.1 (26.9) | 34.5 (28.9) | 9.4 (17.4) | 86.6 (22.9) | 35.1 (29.1) |
Difference | 6.3 | 4.2 | 6.3 | 7.9+ | 1.3 | 2.0 | 4.0 | 0.1 |
. | Pr(ft|bt) . | Pr(ft|bf) . | Pr(st|ft) . | Pr(st|ff) . | Pr(rt|mt) . | Pr(rt|mf) . | Pr(wt|mt) . | Pr(wt|mf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.9 (29.6) | 61.7 (31.8) | 93.4 (14.7) | 19.2 (26.2) | 35.8 (31.0) | 7.4 (15.2) | 90.6 (19.4) | 35.0 (29.6) |
Acquitters | 47.6 (28,3) | 65.9 (28.3) | 87.2 (20.1) | 27.1 (26.9) | 34.5 (28.9) | 9.4 (17.4) | 86.6 (22.9) | 35.1 (29.1) |
Difference | 6.3 | 4.2 | 6.3 | 7.9+ | 1.3 | 2.0 | 4.0 | 0.1 |
. | Pr(ft|ht) . | Pr(ft|hf) . | Pr(ct|ht, mt) . | Pr(ct|hf, mt) . | Pr(ct|ht, mf) . | Pr(ct|hf, mf) . | . | . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.1 (29.6) | 76.9 (24.3) | 62.5 (35.8) | 66.1 (30.4) | 63.6 (25.9) | 14.4 (25.8) | ||
Acquitters | 45.9 (27.4) | 75.4 (24.5) | 67.8 (36.5) | 74.6 (27.1) | 64.1 (31.1) | 18.6 (29.0) | ||
Difference | 7.2 | 1.5 | 5.3 | 8.5 | 0.5 | 4.2 |
. | Pr(ft|ht) . | Pr(ft|hf) . | Pr(ct|ht, mt) . | Pr(ct|hf, mt) . | Pr(ct|ht, mf) . | Pr(ct|hf, mf) . | . | . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.1 (29.6) | 76.9 (24.3) | 62.5 (35.8) | 66.1 (30.4) | 63.6 (25.9) | 14.4 (25.8) | ||
Acquitters | 45.9 (27.4) | 75.4 (24.5) | 67.8 (36.5) | 74.6 (27.1) | 64.1 (31.1) | 18.6 (29.0) | ||
Difference | 7.2 | 1.5 | 5.3 | 8.5 | 0.5 | 4.2 |
**p < 0.05, +p < 0.1 (using a two-sided t-test) (in order to check for robustness, a Wilcoxon rank-sum test was also used. The results remain the same).
. | Pr(ot|ht) . | Pr(ot|hf) . | Pr(tt|ot) . | Pr(tt|of) . | Pr(bt|ot) . | Pr(bt|of) . | Pr(vt|bt) . | Pr(vt|bf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 74.7 (25.5) | 41.8 (28.8) | 89.3 (15.7) | 23.6 (25.5) | 78.6 (22.1) | 29.7 (26.4) | 87.4 (24.7) | 22.8 (32.8) |
Acquitters | 63.7 (28.2) | 40.5 (28.4) | 85.5 (19.8) | 33.2 (29.6) | 77.1 (22.9) | 36.6 (28.3) | 90.6 (16.9) | 34.5 (39.4) |
Difference | 11.0** | 1.3 | 3.8 | 9.6+ | 1.5 | 6.9 | 3.2 | 11.7 |
. | Pr(ot|ht) . | Pr(ot|hf) . | Pr(tt|ot) . | Pr(tt|of) . | Pr(bt|ot) . | Pr(bt|of) . | Pr(vt|bt) . | Pr(vt|bf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 74.7 (25.5) | 41.8 (28.8) | 89.3 (15.7) | 23.6 (25.5) | 78.6 (22.1) | 29.7 (26.4) | 87.4 (24.7) | 22.8 (32.8) |
Acquitters | 63.7 (28.2) | 40.5 (28.4) | 85.5 (19.8) | 33.2 (29.6) | 77.1 (22.9) | 36.6 (28.3) | 90.6 (16.9) | 34.5 (39.4) |
Difference | 11.0** | 1.3 | 3.8 | 9.6+ | 1.5 | 6.9 | 3.2 | 11.7 |
. | Pr(ft|bt) . | Pr(ft|bf) . | Pr(st|ft) . | Pr(st|ff) . | Pr(rt|mt) . | Pr(rt|mf) . | Pr(wt|mt) . | Pr(wt|mf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.9 (29.6) | 61.7 (31.8) | 93.4 (14.7) | 19.2 (26.2) | 35.8 (31.0) | 7.4 (15.2) | 90.6 (19.4) | 35.0 (29.6) |
Acquitters | 47.6 (28,3) | 65.9 (28.3) | 87.2 (20.1) | 27.1 (26.9) | 34.5 (28.9) | 9.4 (17.4) | 86.6 (22.9) | 35.1 (29.1) |
Difference | 6.3 | 4.2 | 6.3 | 7.9+ | 1.3 | 2.0 | 4.0 | 0.1 |
. | Pr(ft|bt) . | Pr(ft|bf) . | Pr(st|ft) . | Pr(st|ff) . | Pr(rt|mt) . | Pr(rt|mf) . | Pr(wt|mt) . | Pr(wt|mf) . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.9 (29.6) | 61.7 (31.8) | 93.4 (14.7) | 19.2 (26.2) | 35.8 (31.0) | 7.4 (15.2) | 90.6 (19.4) | 35.0 (29.6) |
Acquitters | 47.6 (28,3) | 65.9 (28.3) | 87.2 (20.1) | 27.1 (26.9) | 34.5 (28.9) | 9.4 (17.4) | 86.6 (22.9) | 35.1 (29.1) |
Difference | 6.3 | 4.2 | 6.3 | 7.9+ | 1.3 | 2.0 | 4.0 | 0.1 |
. | Pr(ft|ht) . | Pr(ft|hf) . | Pr(ct|ht, mt) . | Pr(ct|hf, mt) . | Pr(ct|ht, mf) . | Pr(ct|hf, mf) . | . | . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.1 (29.6) | 76.9 (24.3) | 62.5 (35.8) | 66.1 (30.4) | 63.6 (25.9) | 14.4 (25.8) | ||
Acquitters | 45.9 (27.4) | 75.4 (24.5) | 67.8 (36.5) | 74.6 (27.1) | 64.1 (31.1) | 18.6 (29.0) | ||
Difference | 7.2 | 1.5 | 5.3 | 8.5 | 0.5 | 4.2 |
. | Pr(ft|ht) . | Pr(ft|hf) . | Pr(ct|ht, mt) . | Pr(ct|hf, mt) . | Pr(ct|ht, mf) . | Pr(ct|hf, mf) . | . | . |
---|---|---|---|---|---|---|---|---|
Convictors | 53.1 (29.6) | 76.9 (24.3) | 62.5 (35.8) | 66.1 (30.4) | 63.6 (25.9) | 14.4 (25.8) | ||
Acquitters | 45.9 (27.4) | 75.4 (24.5) | 67.8 (36.5) | 74.6 (27.1) | 64.1 (31.1) | 18.6 (29.0) | ||
Difference | 7.2 | 1.5 | 5.3 | 8.5 | 0.5 | 4.2 |
**p < 0.05, +p < 0.1 (using a two-sided t-test) (in order to check for robustness, a Wilcoxon rank-sum test was also used. The results remain the same).
Figure 4 shows graphically that the atomistic evaluation of the probative force of the evidence of both convictors and acquitters follows the same pattern, i.e. both groups tend to agree with the direction and strength of the probative value. However, those who convict tend to value the probative force of each item of evidence higher than those who acquit—interestingly also exculpatory items of evidence, such as the witness statement by Silvia (ratio_silvia). Note that the seemingly large difference in the evaluation of the probative force of the (lack of) receipt is deceiving; this is driven by a small and insignificant difference in the assessment of the likelihood of Hans having a receipt although he did not receive the money from his sister-in-law.
Table 3 shows the subjective posterior probability of guilt that the subjects believe is required for a conviction in a criminal case (‘own standard’) and the subjects’ interpretation of the criminal standard of proof in Germany as a threshold subjective probability (‘legal standard’).
Mean subjective probability thresholds required for a conviction, by subjects who convict and acquit (standard deviation)
. | Own standard . | Own standard (w/o 100%) . | Legal standard . |
---|---|---|---|
Convictors | 93.1 (15.7) | 87.6 (19.5) | 73.1 (13.0) |
Acquitters | 95.9 (6.2) | 93 (6.8) | 76.4 (20.4) |
Difference | 2.8 | 5.4 | 3.3 |
. | Own standard . | Own standard (w/o 100%) . | Legal standard . |
---|---|---|---|
Convictors | 93.1 (15.7) | 87.6 (19.5) | 73.1 (13.0) |
Acquitters | 95.9 (6.2) | 93 (6.8) | 76.4 (20.4) |
Difference | 2.8 | 5.4 | 3.3 |
Mean subjective probability thresholds required for a conviction, by subjects who convict and acquit (standard deviation)
. | Own standard . | Own standard (w/o 100%) . | Legal standard . |
---|---|---|---|
Convictors | 93.1 (15.7) | 87.6 (19.5) | 73.1 (13.0) |
Acquitters | 95.9 (6.2) | 93 (6.8) | 76.4 (20.4) |
Difference | 2.8 | 5.4 | 3.3 |
. | Own standard . | Own standard (w/o 100%) . | Legal standard . |
---|---|---|---|
Convictors | 93.1 (15.7) | 87.6 (19.5) | 73.1 (13.0) |
Acquitters | 95.9 (6.2) | 93 (6.8) | 76.4 (20.4) |
Difference | 2.8 | 5.4 | 3.3 |
In order to investigate whether the high unguided standard of proof was caused by the 42 of the subjects who indicated that a certainty of 100% is necessary, I excluded these subjects for an additional calculation of the mean own standard, because under no legal rule is absolute certainty a requirement for a conviction. Excluding these subjects leads to the average threshold posteriors reported in the middle column of Table 3. There are no significant differences in the mean thresholds required for conviction between those who acquit and those who convict.
Participants should only convict if their posterior probability of guilt meets or exceeds the threshold value required for a conviction. Table 4 therefore compares the posteriors (column headings) with the probability thresholds required for a conviction (line headings) and counts instances where the posterior meets or exceeds the threshold. Four different threshold values are used: the personal, unguided standard (line 1), i.e. the threshold value that the individual participant expressed as being required for a conviction. In line 2, the personal interpretation of the German Supreme Court’s verbal standard of proof as a subjective posterior probability is used as comparison. In line 3, the threshold value is the average unguided expression of the standard of proof as a posterior probability (average across all participants, both convictors and acquitters, as there were no significant differences, see Table 3). Finally, in line 4, the average interpretation of the Supreme Court’s verbal standard of proof is used as threshold value required for a conviction.
Instances of posteriors meeting or exceeding the threshold probability for conviction
. | Holistic posterior (bf.) . | Comp. post. 1 (w/ adv. inf.) . | Comp. post. 2 (w/o adv. inf.) . | |||
---|---|---|---|---|---|---|
. | Convictors (n = 61) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . |
Personal own standard | 11 | 0 | 19 | 8 | 10 | 1 |
Personal legal standard | 38 | 4 | 36 | 16 | 19 | 7 |
Average own standard | 19 | 2 | 30 | 14 | 11 | 3 |
Average legal standard | 46 | 6 | 39 | 19 | 20 | 5 |
. | Holistic posterior (bf.) . | Comp. post. 1 (w/ adv. inf.) . | Comp. post. 2 (w/o adv. inf.) . | |||
---|---|---|---|---|---|---|
. | Convictors (n = 61) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . |
Personal own standard | 11 | 0 | 19 | 8 | 10 | 1 |
Personal legal standard | 38 | 4 | 36 | 16 | 19 | 7 |
Average own standard | 19 | 2 | 30 | 14 | 11 | 3 |
Average legal standard | 46 | 6 | 39 | 19 | 20 | 5 |
Instances of posteriors meeting or exceeding the threshold probability for conviction
. | Holistic posterior (bf.) . | Comp. post. 1 (w/ adv. inf.) . | Comp. post. 2 (w/o adv. inf.) . | |||
---|---|---|---|---|---|---|
. | Convictors (n = 61) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . |
Personal own standard | 11 | 0 | 19 | 8 | 10 | 1 |
Personal legal standard | 38 | 4 | 36 | 16 | 19 | 7 |
Average own standard | 19 | 2 | 30 | 14 | 11 | 3 |
Average legal standard | 46 | 6 | 39 | 19 | 20 | 5 |
. | Holistic posterior (bf.) . | Comp. post. 1 (w/ adv. inf.) . | Comp. post. 2 (w/o adv. inf.) . | |||
---|---|---|---|---|---|---|
. | Convictors (n = 61) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . | Convictors (n = 62) . | Acquitters (n = 37) . |
Personal own standard | 11 | 0 | 19 | 8 | 10 | 1 |
Personal legal standard | 38 | 4 | 36 | 16 | 19 | 7 |
Average own standard | 19 | 2 | 30 | 14 | 11 | 3 |
Average legal standard | 46 | 6 | 39 | 19 | 20 | 5 |
In columns 1 and 2, the threshold value is compared to the subject’s own holistic posterior belief in guilt, expressed before answering the likelihood question. In columns 3 and 4, the comparison standard is the subject’s own computed posterior of guilt taking into account the adverse inference based on the missing witness. In columns 5 and 6, the comparison standard is the subject’s own computed posterior of guilt without the adverse inference based on the missing witness.
If the posterior exceeds the personal threshold value, the subject should (according to his or her own standard) convict or, if not, acquit. The top left cell of Table 4 shows that the holistic posterior (before answering the likelihood questions) for only 11 subjects exceeds their own personal threshold value for a conviction. 42 subjects (38 who actually convicted and 4 who acquitted) have holistic posteriors that meet or exceed their own interpretation of the legal standard of proof. The last two lines of Table 4 compare the individual posterior probabilities of guilt with the average threshold probabilities and the average legal standard for all subjects and count all instances where the posterior meets or exceeds the average threshold.
Figure 5 shows the instances where different posteriors meet or exceed different threshold levels for conviction as proportions of the total number of possible convictions. Of the subjects, 62% actually convicted Hans. As is evident, only a fraction of those 61 subjects should have convicted him had they adhered to their own standard of proof.

Proportion of cases where the posterior meets or exceeds the threshold probability required for a conviction.
7.3 Discussion
The holistic posterior probability for the defendant’s guilt by the subjects convicting the defendant is roughly twice as high as the holistic posterior probability of guilt for the subjects acquitting the defendant. This closely replicates results by Glöckner and Engel (2013) and indicates that under a holistic evaluation of the evidence, convictors and acquitters really ‘interpreted the same case in completely different ways’ (Glöckner and Engel, 2013, p. 241), as predicted by PCS models of cognitive coherence. Coherence shifts are uniquely predicted by bidirectional models of decision making, which assume that cues (evidence) and emerging decision (verdict choice) influence each other (Glöckner et al., 2010, p. 441). All the ‘fast and frugal’ heuristics from the ‘adaptive toolbox’ (Todd, 2002) so far suggested take the input information as stable and apply different search, stopping and weighing rules (Glöckner et al., 2010, p. 441). Still, one could surmise that the ‘take the best rule’, which posits that decision makers rely on the single best cue (evidence) and ignore all other cues (Gigerenzer and Goldstein, 1999, p. 81), could lead to inflated confidence since evidence contradicting the best evidence is completely ignored. The data from the current study does not allow excluding this possibility.
When the subjects atomistically evaluate the evidence, the evaluation of the evidence by acquitters and convictors converges: computing the posterior probability of guilt using a Bayesian network with the parameters for the prior probabilities and the likelihoods obtained from the subjects makes the difference in the evaluation of the case between the convictors and acquitters largely disappear. Merely answering the likelihood questions is not sufficient to achieve this effect; it only marginally decreases the posterior probability of guilt for the convictors and has no effect on the posterior probability of guilt of the acquitters. The data therefore support the first main hypothesis: an atomistic evaluation of evidence in a likelihood framework leads to the disappearance of coherence shifts.
This conclusion is further supported by the data for the likelihood ratios for each item of evidence. The differences of the mean likelihoods for convictors and acquitters are not significant at the p < 0.05 level for all except one likelihood. A comparison of the likelihood ratios ‘ratio credit 1’ and ‘ratio credit 2’ shows that the subjects correctly interpret the repayment of the loan—the day after the disappearance of the money from the safe—as incriminating evidence, given that Hans has not received the money from his sister-in-law (‘ratio credit 2’); however, the subjects assign no probative value to the repayment given that Hans has received the money from his sister (‘ratio credit 1’). In the latter case, the repayment is adequately explained even without Hans having taken the money from the safe.
It is also noteworthy that on average, all the evidence was judged to be of limited strength. None of the likelihood ratios computed by dividing the average likelihoods reached more than 5. According to the verbal scale for forensic evidence suggested by Evett et al., a likelihood ratio of 1–10 can be verbally expressed as ‘limited evidence to support’ (Evett et al., 2000, 236). A different picture emerges when individual likelihood ratios are computed for each subject, using the likelihoods provided by that subject. These likelihood ratios were sometimes very large, indicating very strong evidence (see below, D. Limitations, for a discussion of the large observed inter-individual differences in the assessment of the likelihoods).
Taking into account or ignoring that Hans failed to call his sister-in-law as a witness results in a large difference of about 20 percentage points in the computed posteriors. It reflects the intuition that not calling the witness allows inferring that the witness’ testimony would be unfavourable for Hans. This intuition is reflected in U.S. case law going back to Graves vs. United States, where the U.S. Supreme Court stated: ‘The rule even in criminal cases is that if a party has it peculiarly within his power to produce witnesses whose testimony would elucidate the transaction, the fact that he does not do it creates the presumption that the testimony, if produced, would be unfavourable’ (Graves v. United States, 150 U.S. 118, 121 [1893]). A party has the power to produce a witness if ‘[it] had the physical ability to locate and produce the witness and there was such a relationship, in legal status or on the facts as claimed by the party as to make it natural to expect the party to have called the witness’ (Thomas v. United States, 447 A.2d 52, 57 [D.C. 1982]). Given that the witness in question is Hans’ sister-in-law and would have first-hand knowledge of the relevant issue whether Hans received the money from her, these conditions appear to be met. The data supports the conclusion that the subjects took the mere omission of calling the sister-in-law as a witness as evidence against the truth of the proposition that Hans received the money to pay back the credit from his sister-in-law, which in turn increases the probability of Hans having taken the money because the alternative explanation becomes less probable. The fact that Bayesian networks can model such relatively complex chains of inference is one of their strengths (see Hahn and Oaksford, 2007, p. 707 sqq., for Bayesian models of other forms of adverse inference).
However, the adverse inference is not permissible in this case, since the prosecutor also could have called the witness. U.S. courts have applied the missing witness inference rule in criminal cases, provided that the state cannot reasonably locate the missing witness (U.S. v. Anchondo-Sandoval, 910 F.2d 1234, 1238 [5th Cir. 1990]). This prerequisite is most probably (the scenario is silent on the issue) not met because there are no reasons to think that the state could not have located Hans’ sister-in-law and called her as a witness. In defence of the subjects it must be stressed that they were only asked about the likelihood of Hans not calling the witness given that he received or did not receive the money from his sister-in-law. A fairer question would have been how likely it was whether Hans or the prosecution did not call the witness given that Hans did not receive the money from his sister-in-law.
There are no significant differences in the threshold probability required for a conviction between those who convicted and those who acquitted the defendant. Interestingly, the mean unguided estimate of the required threshold level for a conviction in criminal matters is closer to the values of well above 90% that are stated in the German legal literature (e.g. Hoyer, 1993, p. 439) than the subjects’ interpretation of the threshold level required by the German Federal Supreme Court. However, the subjects are inconsistent with their own standards: 50 subjects convicted Hans although their stated holistic posterior probability of guilt did not exceed their own stated threshold probability for a conviction, and still 23 convicted although their holistic posterior did not even meet their own understanding of the legal standard of proof (Table 3). Arguably, the comparison with the average legal standard is most appropriate, as the legal standard of proof should not vary between decision makers. Comparing the computed posteriors with the average legal standard shows that drawing an adverse inference from the failure to call the witness leads, as expected, to a substantially larger proportion of convictions.
The average holistic posterior for the guilt of the defendant of those who convict (80.4%) is above the average legal standard required for a conviction as expressed by the subjects (74.4%). However, the average computed posterior probability of guilt even for the convictors just barely exceeds 50% (50.5%) if one does not draw an adverse inference from the missing witness, as would be correct in this case. Compared to the average of the legal standard in criminal matters in Germany as expressed by the subjects, Hans should not have been convicted based on the item-by-item assessment of the evidence. A posterior probability of guilt of 50.5% also fails to exceed any reasonable quantification of the ‘beyond reasonable doubt’ standard of proof in criminal matters of U.S. law. While there is considerable inter-individual variability in the expression of the ‘beyond reasonable doubt’ standard as a degree of probability (Hastie, 1993, p. 101 sq.) it is generally understood to require a much higher probability than the ‘preponderance of the evidence’ standard of just above 50% used in civil cases (Lillquist, 2002, p. 94). Whether quantification is desirable at all is the subject of an on-going debate; while scholars have long advocated the use of a numerical definition of the standard of proof, courts have remained hostile to attempts at quantification (Tillers and Gottfried, 2007).
The second main hypothesis is not supported by the data: based on Schum and Martin (1982), it was hypothesized that an item-by-item assessment of the evidence would lead to a reduction in the variance of the posterior probability of guilt. This was evidently not the case. As the standard deviations for the likelihoods (Table 2) indicate, there was actually higher variance in the assessment of the conditional probabilities than in the assessment of the holistic posterior probability of guilt. This is because many subjects chose extreme values of 0% or 100% for the likelihoods. These extreme values for the evidence and intermediate variables carry over into the computed posteriors, which also show higher standard deviations than both the holistic posteriors (Table 1). It has long been thought that posterior probabilities computed using Bayesian networks are robust to changes in the values for the evidence and intermediary variables (Pradhan et al., 1996); however, this is not generally true. Networks with extreme values for the evidence and intermediate variables (i.e. values close to the bounds of 0 and 1) and intermediate values on the query variable(s) are sensitive to changes in the parameters of the evidence variables (Chan and Darwiche, 2002). Intuitively, this can be explained by considering that a small absolute change in an extreme value of a likelihood, let us say from 0.001 to 0.01, increases the likelihood ratio by an order of magnitude, while the same small change in an intermediate probability, say from 0.601 to 0.61, has almost no influence on the likelihood ratio (assuming all else being equal).
I can only speculate as to why the results from Schum and Martin (1982) could not be replicated. A plausible explanation is that the 20 subjects of Schum and Martin gave a total of 16 000 probability assessments (800 per subject) over the course of several days and assessed the same evidence repeatedly (Schum and Martin, 1982, p. 127 seq.). This may have induced learning and thereby higher consistency. The subjects in this study, on the other hand, were unfamiliar with the task of assigning numerical values to degrees of belief and had little opportunity for learning. This unfamiliarity with a task that is known to be difficult (see, e.g. Thagard, 2003, p. 370) may have led to the great observed variance.
7.4 Limitations and further research
As should have become evident from the discussion, the main limitation of this study stems from the old maxim that averages can be deceiving. The average values for the parameters of the network and the average computed posteriors support the main hypothesis. Looking at the data for the individual subjects reveals great inter-individual differences. Since it is desired that the evaluation of evidence in a judicial context is predictable, these inter-individual differences should be reduced if Bayesian networks are to be a useful tool for the fact finder in cases where there are no relative frequencies that could inform the subjective probabilities. Further research should therefore explore whether more sophisticated elicitation techniques for the (conditional) probabilities lead to less inter-individual variability (for an overview of different elicitation techniques, see O'Hagan et al., 2006).
The second limitation of this study is that the structure of the network was designed by the experimenter and therefore the same for all subjects. When evaluating a mass of evidence using a Bayesian network, the impact of the evidence on the probability of the hypothesis of interest depends not only on the probative force of each individual item of evidence, but also on their interrelations. In other words, the assessment of the overall probative force of the evidence depends (also) on the structure of the network. As the direct dependencies which structure the model are based on the expert’s knowledge and assumptions about the workings of the world, different experts may structure the problem differently. Further research should explore whether the main effect, the large reduction in the difference in the posterior probability of guilt for the convictors and acquitters, remains if not only the parameters for the network, but also the network structure is elicited from the subjects. This requires teaching the participants how to construct a Bayesian network, which pushes the boundaries of the experimental paradigm.
8. Conclusion
This study is the first to empirically demonstrate an advantage of using a Bayesian network for the evaluation of evidence in a case where there are no relative frequencies that could form the basis for assessing the probative value of the evidence. The study shows that the large difference between the posterior subjective probability of guilt of judges who convict and judges who acquit largely disappears when the posterior probability of guilt is computed using a Bayesian network parameterised with the values obtained from the judges. This result is important because the posterior degree of belief in the guilt of the defendant is the relevant standard of proof in most legal systems. Forcing coherence in the sense of subjective probability theory on the partial beliefs of the judge using a Bayesian network suppresses the polarization of evidence observed in the holistic evaluation of evidence and reduces the resulting inflated confidence in having made the right choice. It makes transparent that certainty is often unattainable in legal fact finding, and that the subjective feeling of certainty is mostly an illusion.
Funding
Swiss National Science Foundation (SNF), grant no. PA00P1_131519.
1 This instruction was added based on the observation from a pre-test, in which a substantial number of subjects gave responses to the likelihood questions that always summed to 100%, which seems to imply that they (wrongly) thought that this must be the case.