In Atkins v. Virginia, the U.S. Supreme Court held that the Eighth Amendment ban on cruel and unusual punishment precludes capital punishment for intellectually disabled offenders. Death penalty states responded with laws defining intellectual disability in various ways. In Hall v. Florida, the Court narrowly struck down the use of a measured IQ of 70 to mark the upper limit of intellectual disability because it created ‘an unacceptable risk that persons with intellectual disability will be executed’. But the Court was unclear if not inconsistent in its description of an upper limit that would be acceptable. Four dissenting Justices accused the majority not only of misconstruing the Eighth Amendment, but also of misunderstanding elementary statistics and psychometrics. This article uses more complete statistical reasoning to explicate the Court’s concept of unacceptable risk. It describes better ways to control the risk of error than the Court’s confidence intervals, and it argues that, to the extent that the Eighth Amendment allows any quantitative cut-score in determining an offender’s intellectual disability, these more technically appropriate methods are constitutionally permissible.

1. Introduction

‘The record of how human beings have performed on IQ tests does more than measure us against one another for entry into Harvard law school.’

—James R. Flynn, Intelligence and Human Progress 3 (2013)

For some of us, a single IQ point can mark the difference between life and death. Such is the teaching of Hall v. Florida.1Hall is the Supreme Court’s third—and most detailed—case to address just when a criminal who might be intellectually disabled can nevertheless be sentenced to death. In Hall, the Court held unconstitutional Florida’s use of a measured IQ score of 70 to mark the upper limit of intellectual disability. Four dissenting Justices were incensed. In an opinion written by Justice Alito,2 they interpreted Justice Kennedy’s majority opinion3 to mean that only IQ scores of 75 or more could serve as a cut-off.4 Using their own computations of ‘confidence intervals’ (CIs),5 they concluded that this rule ‘totally transforms the allocation and nature of the burden of proof’.6 The Court reached a rule ‘unhinged from legal logic’,7 Justice Alito lambasted, because it ‘misunderstands how the SEM [standard error of measurement] works’8 and made ‘factual mistakes that will surely confuse States attempting to comply with its opinion’.9

The opinions of this divided Court have escaped probing analysis of their treatment of the statistical issues in ascertaining intellectual disability.10 Impressed with potential doctrinal implications of the case for criminal jurisprudence more generally, commentators have seen it as supporting challenges to other aspects of capital-sentencing procedures.11 Even more broadly, one astute scholar of mental health law sensed in Hall the seeds of a general ‘scientization of the criminal law’.12 This commentary on the implications of Hall is intriguing and important, but it leaves undisturbed and unexamined the statistical science in Hall and its relationship to constitutional constraints and legal policies.

Therefore, this article focuses on the statistical issues in the case. It examines the more important statements in the Justices’ opinions on measurement error in IQ testing. These statements are of more than didactic and technical interest. The statistical properties of IQ scores are critical to understanding the capital-sentencing options retained by states after Hall. The Hall majority condemned Florida for ignoring ‘a statistical fact’13 that constituted ‘one of the most important concepts in measurement theory’.14 Without careful analysis of the relevant statistical principles, it is not possible to ascertain how much Hall should constrain states that wish to carve out a range of IQ scores that would preclude further proof of intellectual disability. Furthermore, because IQ scores are of major importance in the broader clinical assessments that courts must review, understanding their statistical properties and limitations is vital for all adjudication of intellectual disability claims.

Section 2 of this article sets the stage for the examination of the exchange among the Justices on the technical concepts such as standard error of measurement (SEM) and CIs. It describes the issue in Hall as it emerged in the Court’s ‘zig-zagging death penalty jurisprudence’15 and the general divide between the majority and the dissent. It reads the majority opinion as accepting the premise that a state may, without further inquiry, deem all individuals with true IQs above a fixed number as not intellectually disabled but that it must adopt some margin of error in applying this categorical rule to measured scores.

Section 3 explains why the state’s choice of an IQ cut-score, whether conceived of as a true score or framed as a measured one, is inherently arbitrary. It shows how the measured score of 70 selected by Florida as well as the not fully specified but higher numbers demanded by the majority affect the proportion of the population who could be found to be intellectually disabled.

Section 4 analyses SEM and resulting CIs that are central to the Court’s disposition of the case. It clarifies ambiguities in the Court’s opinion and criticizes the dissent’s effort to link a CI with the burden of persuasion. It notes that the statistical apparatus of CIs is not even necessary to establish a cut-off for observed IQ scores that accounts for the SEM. Finally, this section identifies a different statistic—the standard error of estimate (SEE)—that is more appropriate than the Court’s SEM in adjusting for measurement error.

Section 5 moves beyond the situation of a single IQ score and the confines of classical test theory. First, it considers the Justices’ discussion of combining IQ scores to achieve greater precision. It shows that the majority’s comments on the difficulty of aggregation should not bar the use of established methods. Similarly, it suggests that the exclusive reliance on classical test theory and CIs in the opinions should not preclude the use of Bayesian statistical procedures that would permit a more perspicacious treatment of the problem of measurement error in IQ scores than the frequentist one that the Court endorsed.

2. The intellectual disability trilogy

In the space of 25 years, the Supreme Court has moved from permitting the state to execute a man with an IQ in the 50–63 range16 to barring the state from executing a man with an IQ in the 71–80 range.17 This transformation began in 1989, in Penry v. Lynaugh.18 A welter of conflicting opinions in Penry established that before imposing a capital sentence, the sentencing judge or jury must have the opportunity to consider and give appropriate weight19 to what was then called mental retardation.20 At the same time, the Court declined to hold that the Eighth Amendment’s ban on cruel and unusual punishment insulated all mentally retarded individuals from death at the hands of the state. Writing for the majority, Justice O’Connor concluded that ‘there is insufficient evidence of a national consensus against executing mentally retarded people convicted of capital offenses for us to conclude that it is categorically prohibited by the Eighth Amendment’.21 Four dissenting Justices did not dispute this assessment of national opinion, but they insisted that capital punishment is unacceptable for ‘the mentally retarded as a class’22 because their limitations make the death sentence invariably disproportionate.23

A scant 13 years later, in Atkins v. Virginia,24 the Court found a ‘dramatic shift in the state legislative landscape’.25 The requisite ‘national consensus’26 that it is categorically wrong to execute ‘the mentally retarded’ had coalesced.27 Moreover, Justice Stevens’s majority opinion buttressed this conclusion with the arguments of the dissenters in Penry. The execution of the mentally retarded, he maintained, does not measurably advance the goals of retribution and deterrence.28 In addition, he linked the substantive conclusion that it is wrong to execute anyone who is intellectually disabled to the accuracy of the procedures for identifying death-worthy defendants:

The reduced capacity of mentally retarded offenders provides a second justification for a categorical rule making such offenders ineligible for the death penalty. The risk that the death penalty will be imposed in spite of factors which may call for a less severe penalty … is enhanced, not only by the possibility of false confessions, but also by the lesser ability of mentally retarded defendants to make a persuasive showing of mitigation in the face of prosecutorial evidence of one or more aggravating factors. Mentally retarded defendants may be less able to give meaningful assistance to their counsel and are typically poor witnesses, and their demeanor may create an unwarranted impression of lack of remorse for their crimes. As Penry demonstrated, moreover, reliance on mental retardation as a mitigating factor can be a two-edged sword that may enhance the likelihood that the aggravating factor of future dangerousness will be found by the jury. Mentally retarded defendants in the aggregate face a special risk of wrongful execution.29

But Atkins left unclear and unresolved the question of precisely who is ‘mentally retarded’ for Eighth Amendment purposes. Because Penry permitted capital punishment of unequivocally intellectually disabled offenders, it created little pressure to define intellectual disability carefully. By fashioning a blanket exemption from capital punishment for everyone ‘within the range of mentally retarded offenders about whom there is a national consensus’,30 however, Atkins brought the problem of line drawing to the foreground. Yet, the Court made no effort to define ‘the range of mentally retarded offenders’. Instead, it grandly announced that ‘in determining which offenders are in fact retarded … “we leave to the State(s) the task of developing appropriate ways to enforce the constitutional restriction …” ’.31

Still, Atkins dropped some breadcrumbs. A footnote implied that the ‘clinical definitions’ of the American Association on Mental Retardation (AAMR) and the American Psychiatric Association (APA) would do the trick.32 The AAMR’s definition referred to ‘significantly subaverage intellectual functioning, existing concurrently with related limitations in two or more … adaptive skill areas … and manifest[ing] before age 18’.33 The fourth edition of the APA’s Diagnostic and Statistical Manual of Mental Disorders (DSM-4) was ‘similar’ and added that ‘ “[m]ild” mental retardation is typically used to describe people with an IQ level of 50-55 to approximately 70’.34 In addition, in concluding that execution of the mentally retarded had become ‘truly usual’,35 the Court observed that ‘only five [states] have executed offenders possessing a known IQ less than 70 since we decided Penry’.36

States responded to Atkins’ Olympian command or to earlier public discomfort with executions of the intellectually disabled37 in various ways. Several states enacted laws using a measured IQ score of 70 (or a corresponding number of standard deviations [SDs] below the population mean) as the upper limit for intellectual disability.38 Florida had enacted such a law even before Atkins. Its statute, as interpreted by the state’s Supreme Court, effectively ‘defined intellectual disability to require an IQ test score of 70 or less. If, from test scores, a prisoner is deemed to have an IQ above 70, all further exploration of intellectual disability is foreclosed’.39

Florida applied its statute to Robert Lee Hall. Although Hall’s family and educational history provided distressing evidence of intellectual disability,40 the IQ scores deemed admissible at a post-Atkins hearing were too high to permit this compelling evidence to be considered. They ranged from 71 to 80.41 And so, 12 years after Atkins, the Court squarely confronted the issue of defining intellectual disability in Hall v. Florida. Descending from Olympus to Delphi, Justice Kennedy invoked the APA’s view as expressed in its Diagnostic and Statistical Manual (DSM) that because IQ scores are imprecise, a somewhat higher measured score than 70 could justify clinical inquiry into the individual’s adaptive functioning in school, in the family, and elsewhere.42 Thus, professional criteria that had appeared in Atkins to be merely sufficient for a finding of intellectual disability—because they were roughly congruent with legislative and popular views of what it means to be so disabled—turned out to be ‘a fundamental premise of Atkins’.43

It is easy to read this statement in Hall about the DSM as declaring that this branch of death penalty jurisprudence must track the diagnostic criteria of psychiatrists and psychologists.44 Adopting this reading of Hall, the dissent scoffed that the majority had equated ‘evolving standards of decency’—which is the traditional test for determining the scope of the Eighth Amendment45—with ‘the evolving standards of professional societies’.46 But the majority’s language is more restrained. Justice Kennedy cautioned that the professional standards informed but did not dictate the outcome.47 Even so, the Court gave great weight to the views of the psychiatrists and psychologists. In this case at least,48 their standards informed the Court not only that there was a unanimous expert consensus,49 but also that the consensus rested on a ‘statistical fact’—‘one of the most important concepts in measurement theory’50—that ‘[e]ach IQ test has a “standard error of measurement”, … often referred to by the abbreviation “SEM” ’.51 And, most other capital punishment states accepted the professionally prevalent practice of considering adaptive functioning for individuals with IQ scores as high as two SEMs above 70.52 From these facts alone, the Court inferred that Florida’s ‘rigid rule … creates an unacceptable risk that persons with intellectual disability will be executed and thus is unconstitutional’.53

Strictly speaking, this conclusion—that a state may not use a measured-score cut-off as low as 70—is as far as the Court’s holding goes. This holding is grounded upon the concept of a ‘true score’. Intuitively, the true score is the defendant’s actual IQ, which might be somewhat different from each and every measurement of it.54 The case establishes that (when it is consistent with the unanimous position of the mental health community)55 a state may conclusively presume that all defendants whose IQs ‘truly’ are above a specific number (such as 70) are not intellectually disabled—but it may not apply this presumption to defendants with ‘measured’ scores that are only somewhat higher than this targeted number. This holding invites a series of questions: why can any true score be the basis for a ‘conclusive’ presumption of non-disability? How low can this score be? How much higher than the lowest true score cut-off must the lowest ‘measured’ cut-score be? Section 3 responds to the first two questions. Sections 4 and 5 address the last question.

3. The need to allow states to use cut-scores and the meaning of ‘significantly subaverage’

The Hall Court did not reject all IQ cut-offs as unconstitutionally rigid.56 One could imagine an opinion striking down the Florida law by asserting that intellectual disability is a single, overarching category to be managed by psychologists or psychiatrists informed by multiple streams of information about the constructs of both intellectual and adaptive functioning, and holding that the Florida law was unconstitutional because it did not allow these experts to present all that information. This outcome would have been even more deferential to the expert community than was Justice Kennedy’s opinion in Hall. It would have resolved the matter by decisively committing every Atkins claim to warmer and fuzzier clinical judgements as opposed to erecting a cold and unyielding statistical barrier.57

But could the Eighth Amendment (or any other part of the Constitution) be held to prevent a state from using some cut-score for administrative convenience? Surely, at some point a high IQ score means, ipso facto, there is no significant risk of executing someone who cannot be said to deserve this punishment because of an intellectual incapacity. The ‘unacceptable risk’ approach—which implies that some level of risk is acceptable—was a less-radical response to implementing Atkins. The Hall majority did not dispute Justice Alito’s observation that ‘[a] defendant who does not display significantly subaverage intellectual functioning is [simply] not among the class of defendants we identified in Atkins.’58 ‘Significantly subaverage intellectual functioning’ as introduced in Atkins is integral to the conceptualization of intellectual disability in psychology, and a standardized test is a more objective and reliable instrument than is a clinician’s impressions of how far a defendant’s intellect is from the population norm. Medically, a substantial impairment in intellect is necessary to a diagnosis of intellectual disability,59 and low IQ scores currently are essential to an expert determination of a substantial impairment.60

IQ scores are so important because intellectual disability is not a condition that can be attributed to a single mechanism such as the loss of dopamine-producing brain cells that causes Parkinson’s.61 It is no less real,62 but its very definition is based in large part on its rarity in the population. The question of how rare is rare enough has plagued efforts to develop satisfactory diagnostic guidelines. Over the past 50 years, proposed IQ cut-off scores have varied from one to two SDs below the population mean.63 The popularity of the ‘more traditional’64 figure of two SDs seems to reside in a desire to keep the number of diagnoses low, partly to avoid stigmatizing large numbers of minorities in school settings.65

As the number of SDs for the cut-off grows, the proportion of people who are subject to the classification shrinks. IQ scores, like height, weight, and many other physical characteristics,66 tend to have a ‘normal’ or ‘Gaussian’ distribution in the population.67 Consequently, this relationship is strongly non-linear in the vicinity of –2 SD. A score change of a few points can dramatically change the proportion of people who are affected. Because this fact has a good deal to do with the professional preference for one cut-score over another, it is worth pausing to draw a picture of where the cut-scores discussed in Hall lie.

The Florida legislature defined ‘significantly subaverage general intellectual functioning’ as ‘performance that is two or more standard deviations from the mean score on a standardized intelligence test’.68 Justice Kennedy translated this into the scale on which IQ scores are reported as follows:

The concept of standard deviation describes how scores are dispersed in a population … . The standard deviation on an IQ test is approximately 15 points, and so two standard deviations is approximately 30 points. Thus a test taker who performs ‘two or more standard deviations from the mean’ will score approximately 30 points below the mean on an IQ test, i.e., a score of approximately 70 points.69

The ‘normal’ distribution is the bell-shaped one prominent in elementary statistics courses. The exact shape of all normal distributions can be determined from just two numbers—the mean and the SD. The mean states where the bell sits, and the SD determines how steeply its sides flow down from the top.

A few symbols will help immeasurably in keeping track of the different quantities that are important in Hall. We can use x to refer to all the IQ scores in a population, μx to denote their mean, and σx to represent their SD. As is conventional in statistics, z will denote the scores expressed in units of SDs.70 As shown in Fig. 1, for an IQ test with a population mean of μx = 100 and a population SD of σx = 15, 2.5% of the scores lie at or below 70 (z = –2, which is to say 2σx below the mean). More than twice as many, 5.1%, fall at or below 75 (z = –1.67), and nearly seven times as many, 16.7%, occur at or below 85 (z = 1).
Frequency distribution of IQ scores in a population with a mean of 100 and a SD of 15. Note that 2.5% of the scores lie below 71 (the Florida cut-off), 5.1% are below 76 (the Hall Court’s suggestion), and 16.7% are below 86 (an older cut-off) (IQ scores are integers, and the height of each bar is proportional to the relative number of people at each integer. The normal curve is a continuous line that fits these heights. In this figure and throughout this article, I use the area under the curve from x – 0.5 to x + 0.5 to compute the proportion of people with an integral IQ score x.).
Fig. 1.

Frequency distribution of IQ scores in a population with a mean of 100 and a SD of 15. Note that 2.5% of the scores lie below 71 (the Florida cut-off), 5.1% are below 76 (the Hall Court’s suggestion), and 16.7% are below 86 (an older cut-off) (IQ scores are integers, and the height of each bar is proportional to the relative number of people at each integer. The normal curve is a continuous line that fits these heights. In this figure and throughout this article, I use the area under the curve from x – 0.5 to x + 0.5 to compute the proportion of people with an integral IQ score x.).

It may be worth summarizing what we have learned about the law, psychology, and statistics of IQ cut-scores so far. Both opinions in Hall treat a major deficit in intellectual (as opposed to adaptive) functioning as a necessary but not sufficient condition for death penalty disqualification, and both regard IQ tests as a valid measure of the relevant intellectual functioning.71 For decades, behavioural scientists and clinicians, seeking to limit the percentage of the population that would be labelled intellectually disabled and appreciating the shape of the normal curve that describes the distribution of IQ in the general population72 have deemed a departure of at least twice the population SD σx as indicative of the ‘significantly subaverage intellectual functioning’ that is essential to a diagnosis of disability.

The choice of any particular cut-score to mark ‘significantly subaverage’ obviously is arbitrary, for there is no precise point at which the quality ‘significantly subaverage’ springs into existence. There is little or no difference in intellectual functioning in people who are within an IQ point or two of one another. Thus, the choice of any particular number of SDs is largely a convention.73

The inherent room for debate regarding this convention and its accordion-like history74 did not trouble the Court in Hall. To the contrary, the majority took the use of two SDs as an unquestioned starting point in defining the kind of intellectual disability that precludes capital punishment. With that value in place, the case turned on a second convention—one pertaining to measurement error. As stated in Section 2, the majority held that a cut-score of z = –2 would be satisfactory for true scores, but was constitutionally deficient for real scores (those with measurement error). Measurement error prevented a state from circumventing the need for a more comprehensive (and softer) clinical evaluation of offenders with IQ scores in the range of z = –1.67 to z = –2.

In the next section, I try to clarify what statistical theory has to say about the constitutionally tolerable location of the threshold for a conclusive presumption that the offender is not intellectually disabled. This section considers, corrects, and builds on the Justices’ treatment of the following statistics and ideas: (1) the meaning of a test-taker’s ‘true score’; (2) estimates of test reliability; (3) estimates of the test’s SEM; and (4) the CI for the test-taker’s true score.

4. True scores and single-measurement error within classical test theory

4.1 First- and second-order questions

Why is z = 1.67 SDs below the mean—the figure that the Court seemed to accept—the highest that the Constitution permits? Why not z = 2.00 as Florida wanted? One possible answer is that the 2.47% of the population whose scores lie between these two values includes too many people who are actually intellectually disabled and hence death disqualified. Presumably, this is what the Court meant when it spoke of ‘an unacceptable risk that persons with intellectual disability will be executed’.75 but Justice Kennedy’s opinion does not quantify this risk or the trade-off with the risk of not executing an offender who, under existing law, otherwise deserves to die. The only reason the Court offered to suspect that there is a substantial fraction of intellectually disabled capital offenders in this IQ range is that psychologists and psychiatrists have guidelines that use z = –2 as a benchmark (because scores this low are kind of rare) but then blur the boundary by referring to the possibility of measurement error in individual cases. For example, the American Association on Intellectual Development and Disabilities (AAIDD) states that a ‘significant limitation[] in intellectual functioning’ exists when ‘[a]n IQ score … is approximately two standard deviations below the mean, considering the standard error of measurement for the specific assessment instruments’ used and the instruments’ strengths and limitations’.76

The majority ran with this idea. As indicated in Section 3, it adopted the conventional choice of z = –2 (corresponding to 70 points on the usual IQ score scale), for the line that separates the intellectually able from the intellectually disabled but then argued that the ‘inherent error’ in IQ tests means that to be reasonably sure of drawing the line at 70 (the figure selected because the number of people whose scores fall at or below 70 is not too large), it must be drawn at 75. Adjusting for measurement error in this manner represents imposing a second convention (relating to the uncertainty in individual measurements) on top of the first (establishing the departure from the average that suffices to permit a diagnosis of intellectual disability).

This two-part framework presupposes that the choice of z = –2 is at or near the edge of what is constitutionally permissible. If Florida could have drawn the first-order line substantially below this value, then z = –2 already contains a tolerable margin for error. The failure to explain how the AAIDD–APA choice of an IQ score designed to prevent stigmatizing too many people as intellectually disabled captures the reasons expressed in Atkins for regarding mental retardation as a death-disqualifying condition exposes Justice Kennedy’s opinion to the objection that ‘definitions of intellectual disability … are promulgated for use in making a variety of decisions that are quite different from the decision whether the imposition of a death sentence in a particular case would serve a valid penological end’.77

Of course, direct attention to the validity of IQ scores for the desired purpose could lead to a higher rather than a lower score for precluding further inquiry into the existence of the disability.78 My point is not that a state should be free to impose a stricter limit. It is simply to underscore that the Hall Court does not come to grips with the basic question in defining an IQ cut-score.79

Pretermitting that question, the Court pursued the second-order question of measurement error. Let us assume that μx – 2σx = 70 is as high as one can go in demarcating the range of IQ scores that are so aberrant as to permit the diagnosis of intellectual disability. How does it follow that 75 (or some similar number) is the lower bound for scores that make it unnecessary to consider the adaptive functioning prong of the diagnosis? The Court invoked the ‘ “standard error of measurement”, … often referred to by the abbreviation “SEM” ’.80 The majority opinion then quoted the DSM-5, which states, ‘Individuals with intellectual disability have scores of approximately two standard deviations or more below the population mean, including a margin for measurement error (generally ±5 points) … . [T]his involves a score of 65–75 (70 ± 5).’81

The dissent disputed the assertion that ±5 points is the ‘margin for measurement error’. An exasperated Justice Alito complained that ‘there is no reason to assume a SEM of 5 points’82 when ‘we know that the SEM for Hall's most recent IQ test was 2.16 — less than half of the Court's estimate of 5’.83 To evaluate this exchange, and to see whether Hall permits a state to specify a cut-off below 75, we must appreciate how the SEM is used to produce ‘a margin of error’ and what alternatives are available. Unless one believes that anything that the APA and AAIDD do must be followed (even if there are statistically superior approaches that would satisfy the concerns articulated in Atkins), it becomes necessary to consider (1) what figure to use for the SD of errors and (2) how many such standard errors are required to keep the risk of misclassification tolerable. We will see that Justice Kennedy could have avoided the ambiguities detected by Justice Alito by expressing the margin of error in terms of some fixed number of SEMs rather than the number 75. And looking more deeply, we will see that the Court’s SEM is not the best standard error to use for an individual’s score and that there are other reasonable ways to use the most appropriate statistic. A constitutional analysis ought to reflect these statistical facts as well as the popularity of the SEM.

4.2 True scores and measurement error

Individual IQ scores are squishy. They are not perfectly reproducible. Measured scores fluctuate around ‘true scores’, and to the extent that the observations erratically depart from the true value, their ‘reliability’ is less than 1. In psychometrics, classical true score theory defines an individual’s unknown ‘true score’ (which we can denote with the Greek letter tau, τ) as the average score that a given individual would achieve after taking the same test an infinite number of times (without learning anything from taking the test each time and without any other relevant changes). The difference between this test-taker’s score x on a given test administration and the true score τ is the error (e) in that measurement (X). In these symbols, x = τ + e. If the error is positive, the true score is less than the measured score. If it is negative, the true score is greater than the observed score. We know the observed score. We want to infer the true score.84

We can at least imagine many measurements of the same individual’s IQ. Some of the errors in these measurements will be large, some small. These errors e will have some distribution with a particular shape that gives rise to a mean and a SD. Classical true score theory posits that the errors e in the individual’s repeated scores are normally distributed with mean 0 and an unknown SD. The mean error (μe = 0) indicates that the errors are centred on the true score, and the SD (σe) tells how spread out they usually are. If we place true scores τ along the x-axis, the observed scores come from the normal distributions that could be drawn around each true score in Fig. 2.
Theoretical distribution of observed scores x shown for one individual with a true score of τ = 70 in a population with normally distributed true scores with mean μτ = 100 and SD στ = 15. The individual’s observed score distribution (indicated in the smaller histogram) is normal with mean μx = τ = 70 and SD σe = 2.16.
Fig. 2.

Theoretical distribution of observed scores x shown for one individual with a true score of τ = 70 in a population with normally distributed true scores with mean μτ = 100 and SD στ = 15. The individual’s observed score distribution (indicated in the smaller histogram) is normal with mean μx = τ = 70 and SD σe = 2.16.

These normal distributions attached to the true scores must not be confused with the normal distribution that applies to the true IQs in a population. The error distributions for each possible true score apply to individuals at every true score level. They could be different even for test-takers who have the same true scores. Only one such distribution, for one test-taker, appears in Fig. 2. Different individuals at each true score level would have distributions with larger or smaller widths, since some people would be more consistent in their scores on repeated tests. The mean of each individual’s error distribution is μe = 0,85 not μx = 100, and its SD σe for a given individual’s error distribution is much smaller that the SD of true scores across the entire population of heterogeneous people. Smaller, but not zero. To assess the precision of the estimated true score τ of a given individual, we need to estimate σe for that individual.

4.3 Reliability and SEM

The Court fully recognized that IQ scores of different individuals are expected to fluctuate randomly around the true scores for these individuals. The majority put the point as follows:

Each IQ test has a ‘standard error of measurement’ … often referred to by the abbreviation ‘SEM’. A test's SEM is a statistical fact, a reflection of the inherent imprecision of the test itself … . An individual's IQ test score on any given exam may fluctuate for a variety of reasons. These include the test-taker's health; practice from earlier tests; the environment or location of the test; the examiner's demeanor; the subjective judgment involved in scoring certain questions on the exam; and simple lucky guessing.86

The SEM is an approximation of an average σe for all test-takers.87 Instead of testing a single person many times to estimate σe, which is hardly feasible, test developers could test many people twice. Assuming that the test and retest scores are obtained within a short enough time period that each individual’s true IQ itself has not changed88 and that no individual improves with practice, the correlation between the test and the retest scores would indicate the reliability of the test. If the two scores were always identical, one could predict the second score perfectly from the first—the correlation would be 1. If they were completely unrelated, the first score would be of no value in predicting the second, and the correlation would be zero. Denoting the correlation in the test–retest scores for an entire population by the Greek letter rho (ρ), it can be shown that
(1)
If σx is 15, as the Court proposed, and if the reliability of the test is, say, ρ = 0.82,90 then the SEM is 15 √0.18 = 6.36.

In practice, test developers use less-direct methods to estimate reliability. Roughly speaking, they split the test in half and compare the score on one half with that on the other half.91 These two internal scores are analogous to test–retest scores. It is as if each person took two tests at once. If the internal reliability is ρ = 0.98,92 then the SEM is 15 √0.02 = 2.12.

As this example indicates, the Court’s understanding of the overall SEM as an inherent property of the test is not quite correct.93 The SEM can vary with the group used to estimate test–retest reliability; moreover, other measures of reliability are also employed (to account for measurement error induced by different forms of the same test or scoring by different raters).94 As Justice Alito wrote, ‘there is not a single, uniform SEM across IQ tests or even across test-takers. Rather, “the [SEM] varies by test, subgroup, and age group” ’.95

This complication, however, does not preclude estimates of σe for different test-takers. As is typical in applied statistics, there will be arguments about the application of the general concepts and group statistics to individuals,96 but let us assume that one has a good estimate of reliability for the kinds of random measurement error that are of concern for the test and the defendant who has taken it. If the reliability, and hence the overall SEM, accounts for all the important sources of departures from the individual’s true score, what can we say about the individual’s true score? The majority thought that the answer lies in using the overall SEM to form a CI around the observed score.97

4.4 CIs from the SEM (SEM-IS)

Justice Kennedy asserted that ‘[t]he SEM allows clinicians to calculate a range within which one may say an individual's true IQ score lies.’98 He added that ‘SEM is a unit of measurement: 1 SEM equates to a confidence of 68% that the measured score falls within a given score range, while 2 SEM provides a 95% confidence level that the measured score is within a broader range.’99 This exposition is not ideal. Clinicians do not use the SEM to decide whether ‘the measured score falls within a given score range’.100 The measured score is the IQ score x, and one can say with 100% confidence whether this score falls within any interval one likes. Moreover, as explained below, ‘confidence’ is a term of art—95% ‘confidence’ does not have the simple meaning that a lay reader would ascribe to it (and that the Court may have thought it does).

In an act of statistical one-upmanship, Justice Alito supplied examples of how to compute CIs with the SEM. According to the dissent,

If a test-taker scores a 72 on an IQ test with a SEM of 2, the 66% confidence interval is the range of 70 to 74 (72 ± 2). In this situation, there is approximately a 66% chance that the test-taker's ‘true’ IQ is between 70 and 74; roughly a 17% chance that it is above 74; and roughly a 17% chance that it is 70 or below. Thus, there is about an 83% chance that the score is above 70.

Similarly, using two SEMs, we can build a 95% confidence interval. [L]et us hypothesize a case in which the defendant's obtained score is 74. With the same SEM of 2 as in the prior example, there would be a 95% chance that the true score is between 70 and 78 (74 ± 4); roughly a 2.5% chance that the score is above 78; and about a 2.5% chance that the score is 70 or below. The probability of a true score above 70 would be roughly 97.5%.101

The arithmetic is mostly correct.102 The semantics are not. It is common to form a CI by adding and subtracting some number of SEMs to the observed score x.103 And to increase ‘confidence’, one need only add and subtract a larger number of SEMs.104 To this extent, the opinion is correct. But ‘confidence’ is an abstruse and technical term.105 It is not, as Justice Alito proposed, the probability or ‘chance that the test-taker's “true” IQ falls within this range’.106 In classical true score theory, the true score is a fixed number, not a random variable with probabilities distributed across some range of possible values. In Justice Alito’s second example, 74 ± 4 is one of many possible 95% CIs. If we retested the same person, the second score might be 76, giving rise to a new CI of 76 ± 4. A third test might yield a CI of 80 ± 4. As the number of intervals constructed from the formula CI = x ± 1.96 SEM grows infinitely large, 95% of them will include the true score τ, whether that score is between 70 and 78, whether it is between 72 and 80, whether it is between 76 and 80, or whether it is inside any other conceivable interval. Thus, the CI is useful for indicating the precision of an estimate,107 but ascertaining the probability that the true score is within a given region requires other methods.108

The conflation of ‘confidence’ with ‘probability’ leads the dissent to accuse the majority of misallocating the burden of persuasion. Justice Alito wrote that

As Hall concedes, the Eighth Amendment permits States to assign to a defendant the burden of establishing intellectual disability by at least a preponderance of the evidence … . In other words, a defendant can be required to prove that the probability of a 70 or sub–70 IQ is greater than 50%. Under the Court's approach, by contrast, a defendant could prove significantly subaverage intellectual functioning by showing simply that the probability of a ‘true’ IQ of 70 or below is as little as 17% (under a one-SEM rule) or 2.5% (under a two-SEM rule). This totally transforms the allocation and nature of the burden of proof.

[I]t would be simple enough to devise a 51% confidence interval—or a 99% confidence interval for that matter. There is therefore no excuse for mechanically imposing standards that are unhinged from legal logic and that override valid state laws establishing burdens of proof. The appropriate confidence level is ultimately a judgment best left to legislatures, and their judgment has been that a defendant must establish that it is more likely than not that he is intellectually disabled. I would defer to that determination.109

To continue where Justice Alito left off in his construction of CIs, the 51% CI centred on an observed score extends ±0.69 SDs around that score. For the postulated σe = SEM = 2, a score of x = 72 produces a 51% CI with a lower bound of 71110; a score of x = 71 yields a lower bound of 70.111 According to Justice Alito’s analysis, therefore, a state that defined ‘significantly subaverage intellectual functioning’ as a true IQ score of 70 could exclude all offenders with observed scores (on that IQ test) of 72 or more from further consideration on the theory that it is probably the case that these offenders’ true scores exceed 70.

There are two problems with this reasoning. First, Justice Alito stops short of the logical conclusion to the argument. On the dissent’s understanding of ‘confidence’ and the burden of persuasion, there is no reason to compute a CI. After all, we are only concerned with errors in the lower tail of the curve. (By definition, a defendant whose true IQ score is higher than the interval estimate for any observed score of at least 70 is not intellectually disabled.) The normal curve is symmetric, so exactly 50% of the area under the curve lies to the left of the measured score. Look back at Fig. 2. When the distribution is centred on 70 itself, 50% of the error distribution dips below 70. When it is centred on 71, less than 50% of the error distribution is below 70. The same is true of the error distribution around an observed score of 72, 73, and so on. Consequently, no CI at all needs to be used. Every measured score greater than 70 has the majority of its measured error in the region above 70. No computation is required.

The second and more fundamental problem is that ‘confidence’ is not a probability statement about a fact subject to the burden of persuasion. As I show in a more detailed essay on ‘confidence’ and the burden of persuasion, depriving the state of the opportunity to fix the line at either 70 (considering only the lower tail of the error distribution) or 72 (considering both tails) does not necessarily conflict with the state’s general use of a 51% burden of persuasion; and it is not at all obvious that the more-probable-than-not standard should apply in this context.112 More defensible procedures for establishing a cut-score for measured IQ scores are available. The next section presents one of them.

4.5 SEM-adjusted maximum score (SEM-AM)

Justice Alito defended the 70-point cut-off as much simpler to apply than the majority’s incorporate-a-margin-of-error approach. He remarked that

it is unclear to me whether the Court concludes that a defendant is constitutionally entitled to introduce non-test evidence of intellectual disability (1) whenever his score is 75 or lower, on the mistaken understanding that the SEM for most tests is 5; (2) when the 66% confidence interval (using one SEM) includes a score of 70; or (3) when the 95% confidence interval (using two SEMs) includes a score of 70.113

Unfortunately, Justice Kennedy did not clarify which of these possibilities the Court had in mind, but it is doubtful that the references to the 70–75 range and to Hall’s concession that a state could choose 75 as an absolute cut-off are integral parts of its holding. The disability-testing guidelines now tend to be framed in terms of standard errors rather than specific scores.114 The Court simply ‘agree[d] with the medical experts that when a defendant’s IQ test score falls within the test′s acknowledged and inherent margin of error [around the statutorily prescribed cut-off for eligibility], the defendant must be able to present additional evidence of intellectual disability, including testimony regarding adaptive deficits’.115 It is not that hard to use a rule that sets the maximum IQ score for a finding of intellectual disability at the legislative maximum for the true score (such as 70) plus 1.96 SEMs for the test in question.116 For an IQ test with an SEM of 2.16, this maximum would be 70 + 1.96 × 2.16 = 74. A score on this test of 75 or above thus would make an offender ineligible for the Atkins exemption.

We can call such a rule an SEM-adjusted maximum score rule (SEM-AM, for short) because it takes the legislatively determined true score and boosts it some number of SEMs to fix the maximum IQ score from a single test that can place an offender (subject to other evidence) in the intellectually disabled range. In symbols, the adjusted maximum is xmax = τleg + k SEM, where τleg is the legislative choice for the largest true score consistent with intellectual disability and k is the number of SEMs for the desired ‘confidence’ (68% for k = 1, and 95% for k = 1.96). If SEM were equal to the standard error σe for every defendant, one could say that the SEM-AM rule for 95% confidence would keep only about 2.5% of offenders with true IQ scores at τleg from producing further evidence of their disability.117

In this way, a state can pick a single number xmax, as Justice Alito wanted. Moreover, this simple procedure does the same thing as the SEM-IS rule the dissent deprecated as unduly complicated. In statistical jargon, the SEM-AM rule is simply a null hypothesis test that tells us whether, at the 0.025 level for a false alarm, an observed score is larger than 70.118 The outcomes of this test for statistical significance are identical to the decisions that follow from using two-sided 95% CIs in the SEM-IS procedure. That is, if we decide that a true score is above 70 if and only if the 95% CI for the measured score x is entirely above 70, we will reject the claim of a true IQ of 70 or less in just those cases in which the measured IQ x exceeds 70 + 1.96 SEM.119 Whether we decide according to the SEM-IS or the SEM-AM rule makes no difference.120

Two caveats on these rules are in order. First, we cannot specify the expected proportion of defendants who are falsely classified as ineligible, for that depends on how many convicted capital offenders have true scores of 70, 69, 68, 67, and so on. For example, if we were to assume that 30% have true IQs of τ = 70, that 15% have τ = 69, that 8% have τ = 68, and that 4% have τ = 67, then the expected error rate would be (30% × 0.025) + (15% × 0.0077) + (8% × 0.0020) + (4% × 0.0004) + … = 0.88%.121

Second, as noted earlier, the SEM is a kind of average error across all IQ scores. As an estimator of σe, it works best for estimating the variation in true scores that are near the mean for all of these test-takers. But the only offenders who can seek the exemption from capital punishment are in the vicinity of two or more SDs from the population mean (if that is the targeted true score τleg). The most appropriate SEM for an SEM-AM rule might come from reliability statistics for a group of low-scoring test-takers in the region of τleg.

4.6 CIs from the standard error of estimate (SEE-IS)

Let us return to the Court’s main idea of constructing an interval estimate based on the observed score and seeing whether it goes low enough to be considered in the substantially subaverage zone necessary for a diagnosis of disability. The SEM interval score rule (SEM-IS) that we previously discussed used intervals of x ± k SEM. But psychometricians have long recognized that a different standard error is more suitable for inferring true scores from observed ones.122 This ‘standard error of estimate’ or SEE123 indicates the variation in the true scores of different test-takers ‘who have the same measured score’. It can be used in place of the SEM to produce an SEE interval score rule (SEE-IS). The SEE is a conditional SEM. That is, many people will score the same as a given individual j. All of them have the one-observed score xj, but not all of them have the same true score τj. Some have τ above xj, some have τ below xj, and some have τ = xj. Assuming normality and equal variance of the error term for all test-takers, the conditional distribution of τ for the xj-score cohort is normal with SD
(2)
Applying this formula to the previous examples of SEMs of 6.36 and 2.12 gives SEEs of 5.76 and 2.10. Conditioning on the observed score has reduced the standard error.

Conditioning has a second consequence. It shifts the point on which the interval sits towards the mean. Instead of using the observed score x as the point estimate, we use ρx + (1 – ρ)μx.125 This adjustment reflects the common, but often unrecognized phenomenon of regression to the mean.126 To the extent that chance affects test scores—the central concern of the Hall Court—some people will get a bonus over what they could earn based on their ability alone, and some will be penalized. Thus, chance will pull or push some people into the two extreme ends of the distribution of scores. On average, the truly high-ability people will stay high and the truly low-ability ones will stay low, but the moderate-ability people who happened to score very high or low on the test the first time around will drift back to the middle of the pack in the scenario of infinite retesting that produces a true score.127

As an example of applying these regression-based formulas for estimating the range of true scores that people with a given observed score have, consider an observed score of x = 75 on a test that has a reliability of ρ = 0.97. In the SEM-IS approach, the 95% CI would be 75 ± 1.96 × 15 √(1 – 0.97) = 70–80. A defendant could be said to fall into the potentially disabled zone of 70 or below. In the regression-based SEE-IS approach, the 95% CI is 75.75 ± 1.96 × 15 √0.97(1 – 0.97) = 71–81. The net effect of narrowing and raising the interval in this example is that offenders with single-measured IQ scores of 75 on this test no longer qualify for a more probing inquiry into their possible intellectual disability.

This procedure for correcting IQ scores for regression to the mean is related to more general statistical ideas about shrinkage estimators.128 These estimators129 compromise the properties of more traditional estimators ‘(maximum likelihood, minimum variance unbiased, least squares, etc.)’ to achieve better precision or other desirable qualities.130 They have been the subject of considerable study131 and application in many fields.132

The SEE-IS approach of using a conditional standard error permits statements about the frequency with which the true score would fall into the CI. For example, one can legitimately report that ‘[f]or people who have a score like yours, we find that X% of them truly fall somewhere between A and B.’133 Following Justice Alito’s guide to CIs, a defendant with a score of 71 on our test with a reliability of 0.98 could argue that about 68% of observed scores of 71 correspond to true scores in the 70–74 range; hence, 16% of them have true scores of 70 or below. Then, shifting to the majority’s reasoning, the defendant could urge that depriving 16% of people who might otherwise be put to death of the opportunity to bring forth proof of an intellectual disability ‘creates an unacceptable risk that persons with intellectual disability will be executed, and thus is unconstitutional’.134

5. Other statistical issues in and outside of Hall

The discussion up to this point does not exhaust the set of procedures for constructing CIs that statisticians have devised.135Hall should be read as directing experts to handle measurement error in IQ scores in the most professionally responsible manner rather than instantiating any one of the narrow alternatives listed in the dissenting opinion. This would leave room for combining multiple scores and for the consideration of Bayesian procedures that assign probabilities to all possible true scores. This section briefly discusses these two matters.

5.1 Multiple scores

Justice Alito complained that ‘the Court entirely ignores [the fact] that Florida … takes into account the inevitable risk of testing error by permitting defendants to introduce multiple scores’.136 Indeed, the trial court in Hall heard testimony about at least four IQ test scores for Hall—71, 72, 73, and 80137—and it appears that Hall had taken as many as seven IQ tests over a 40-year period.138

Although the majority did not ‘entirely ignore’ multiple scores, Justice Kennedy’s treatment of them is difficult to understand, statistically as well as legally. He wrote that ‘[e]ven when a person has taken multiple tests, each separate score must be assessed using the SEM’,139 and he seemed to be referring to the single-score SEM. If this is what ‘the SEM’ means, the dictum cannot be right. Uncertainty about the true score declines as more measurements are made.140 In technical terms, because the SEM for a single score is greater than the standard error of the average of several scores, using the single-score SEM as a measure of the probable error in the average score would be a mistake. An SEM-based score interval for the average can lie above a figure like 70 even though at least one—or even all—of an offender’s separate intervals dip below 70. Assuming that Hall’s IQ scores are independent, as the dissent implied,141 the 95% CI that applies to his average score is 73–75.142 Having accounted for the measurement error in the average of the IQ scores, the state should be able to enforce a 70-or-less rule for a true score, barring Hall from producing evidence of his clear deficits in adaptive functioning.

One might argue that the computation of the interval for Hall’s average is oversimplified, for it assumes fully independent scores. Indeed, the majority was worried that ‘the analysis of multiple IQ scores jointly is a complicated endeavor’.143 But this argument is a makeweight. Does the majority really believe that the Constitution requires the state to treat a defendant with 10 successive, carefully administered IQ scores of 71 the same as a defendant with but a single score of 71?144 Not only is this rule intuitively implausible, but analysis of multiple scores is hardly beyond the ken of statisticians.145 The chapter in a psychology handbook146 cited in Justice Kennedy’s majority opinion for the view that the problem is unduly complicated147 actually states that the solution ‘is not nearly so complicated as it might seem at first glance’.148 It shows how to combine scores with pencil and paper or a spreadsheet.149

Finally, Justice Kennedy added that ‘because the test itself may be flawed, or administered in a consistently flawed manner, multiple examinations may result in repeated similar scores, so that even a consistent score is not conclusive evidence of intellectual functioning’.150 Certainly, all IQ tests are imperfect in many ways. In addition to being unable to eliminate random error, they may be culturally biased or ‘biased in favor of neurotypical individuals’.151 Or, a defendant may be aiming to understate his true score.152 But the only thing that CIs can protect against is random error.153 If the test is so flawed in other ways that multiple scores are not usable, then, a fortiori, single scores cannot be used. If the possibility of repeated mistakes in test administration introduces serious bias, then serious bias is at least as great a problem as in interpreting a less-precise single test score.

Given the flimsiness of the Court’s comments about multiple scores, the best interpretation of them is that they merely mean that the option of retesting in the Florida law is not sufficient to handle the problem of measurement error. Under Hall, multiple scores do not obviate the need to consider measurement error through an appropriate interval estimate, but a statistically sound interval estimate based on multiple scores should be admissible to ascertain whether the defendant’s true score is 70 or less.

5.2 Credible regions (Bayesian credible region)

A final issue in the construction of CIs or equivalent decision rules is whether the Constitution prevents a state from going beyond the classical test theory discussed in the Hall opinions and briefs.154 The SEM-IS, SEM-AM, and SEE-IS approaches to measurement error all embody the ‘frequentist’155 or ‘classical’156 theory of statistical inference. Within this framework, a test-taker’s true score is a parameter in a statistical model of random error. The ‘[p]arameters are fixed, unknown constants’, which means that ‘no useful probability statements can be made about parameters’.157 The best we can do is postulate values for the true score and see how often decisions about the true scores of many defendants, based strictly on the observed score (or multiple scores), would be right or wrong. Suppose, for example, that every capital offender has a true score of 70 or less. If the observed scores are normally distributed with the estimated standard error, then the 95% SEM-IS rule would keep the expected rate of false alarms (decisions wrongly preventing a defendant from demonstrating disability with evidence of deficits in adaptive functioning) to no more than 2.5%. And, in no case will we misclassify a non-disabled offender as potentially disabled—because there are no such offenders. But suppose that 50% of capitally convicted offenders have true scores of exactly 70 and that 50% have true scores of 71. No more than approximately 2.5% of the half with IQs of 70 will be misjudged, but about 41% of the half with IQs of 71 will be deemed disabled (if there is sufficient evidence of deficient adaptive functioning) even though not one of them truly qualifies for the exemption.158

Calculations of this sort allow us to gauge the performance of any decision rule—its operating characteristics—conditional on the assumed values for the true scores of the offenders. But they cannot tell us the probability that is of interest to the law—the chance that an offender has a true score of 70 or less—or even the proportion of those who are expected to have such low true scores given a defendant’s observed score. With information on the distribution of true IQ scores among convicted capital offenders, however, it is feasible to estimate the latter quantities. Suppose a jurisdiction has recorded the IQ scores of convicted capital defendants over the past decade and that these scores are approximately normally distributed with mean 70 and SD 8. This distribution also describes the distribution of true scores. Some defendants’ true scores will be above their measured ones, but others will be below. Such differences will wash out, and the overall picture of true scores will mirror that of the measured ones. It will be centred at 70 and spread out so as to have most scores between 65 and 75.159

If this historical pattern applies to the next defendant, we should be suspicious of a measured score x like 80 (or 60) that is in a tail of the prior distribution of true scores τ. If we knew nothing of the past, our best guess for the defendant’s true score would be the measured one,160 but we ought to factor in the reality that low and high true scores are the exception and hence are less likely to be the explanation for the extreme score. A better estimate for this defendant’s true score τ would be closer to the middle of the pack of true scores—not at the very middle, but somewhere between the observed score x and the average of the previously encountered scores. Moreover, because we are working with more information than just the one newly observed score x, we can have more confidence in the blended estimate than in either the prior mean or the new observation as an estimate of τ. Thus, the interval estimate for the true score can be narrower than the classical CI.

The mathematical recipe for blending the prior distribution of true scores with a newly observed score is known as Bayes’ rule.161 Applying it to a normal prior distribution and an observed score with measurement error yields a ‘Bayesian credible region’162 (BCR) that has the properties noted above—integrating the prior information shifts the estimate towards the historical mean and produces a narrower interval.163 BCRs could be used in the same way as CIs to determine whether the credible range of true scores covers a true score cut-off such as 70. Or, more directly, one can look at the posterior true score distribution and easily compute whether the portion below this cut-off is sufficiently small to reject the claim of a disability.164 By way of illustration, if the empirical prior distribution were normal with mean 70 and SD 8, then individuals with a measured score of x ≥ 73 would have no more than a 10% risk of having a true score τ ≤ 70.165 If 10% is an ‘unacceptable risk’, a higher cut-score for measured IQs would have to be selected. A state willing to tolerate only a 1% risk would use a cut-score of at least 76 in this particular illustration.166 With modern computational methods, Bayes’ rule can be applied to almost any prior distribution, not just the normal ones mentioned here. This form of inference is well established in statistics,167 psychometrics,168 and social science.169 The Supreme Court once characterized it as a ‘more precise method’ than classical hypothesis testing.170 The approach would enable a state to attain a probability of, say, 95% for a correct decision with respect to the true score cut-off of 70, as the majority in Hall seemed to desire or demand.171

6. Summary and conclusion

The Hall Court holds that (1) a state can categorically deny the intellectual disability exemption to individuals whose measured IQ scores are above some cut-score, and (2) this lowest permissible cut-score must be greater than 70 (on a test normed to have a mean of 100 and a SD of 15). Justice Kennedy’s majority opinion further suggests that in choosing a cut-score, a state must attend in some manner to SEM, but the opinion does little more than gesture to publications on psychometrics, IQ scores, and intellectual disability that emphasize the importance of recognizing imprecision in IQ scores. To provide more specific guidance, this article has presented four procedures for coping with the one aspect of ‘measurement error’172—namely, random errors in measuring IQ scores—that the Court invoked to invalidate Florida’s ‘rigid rule’.173 The four procedures flow from the fundamental distinction between a test-taker’s unknown true score and a measured score. They all seek to control the ‘unacceptable risk’174 that an individual with a measured score has a true score that is less than or equal to the maximum true score consistent with a diagnosis of disability. The first three procedures, derived from classical test theory, accomplish this goal indirectly, if at all. The fourth attends directly to the most pertinent probability.

The first procedure (SEM-IS) is the one described in the two opinions. It uses an overall SEM derived from the estimated ‘reliability’ for a specific test and population of test-takers to construct an approximate 95% (or other) CI. If the CI exceeds the targeted true score, the state can deny the Atkins exemption without regard to deficits in adaptive functioning. As the dissent notes, the procedure can be applied with any desired confidence coefficient, although the desired level of ‘confidence’ does not have the meaning the dissent (and probably the majority) ascribed to it.

Secondly, I described an equivalent hypothesis-testing procedure that defines the state’s cut-score for the measured IQ as a legislatively targeted true IQ score (such as 70) plus some number of SEMs. Using this SEM-adjusted score (SEM-AS) to reject claims of disability, this procedure tests whether an observed score is statistically significantly larger than 70. It keeps the risk of categorically rejecting a disability claim from a defendant with a true IQ score of 70 or less to a predetermined maximum. This is exactly what the SEM-IS rule does. But without data on the distribution of true scores among defendants, it is impossible to say how well these procedures limit the risk that individuals with measured scores greater than 70 have true scores at or below 70.175

The third procedure uses an estimator for true IQ scores based on the distribution of those scores for all individuals with a defendant’s test score. Instead of the Court’s SEM, it uses a SEE that is larger for scores that are far from the population mean. The SEE-IS procedure also shrinks the estimated true score towards the mean.176

The fourth and final approach to handling measurement error goes beyond the classical test theory on which the first three were based. It would permit a state to use recent data on the IQ scores of convicted capital offenders together with test-reliability data to estimate the probability that a defendant with a measured IQ has a true score in any desired interval. The BCR for the highest posterior density could be inspected to see if it contains any true IQ scores that would permit defendant’s claim of disability to proceed. The meaning of the region would be clear because, unlike the frequentist CI, the BCR does describe probabilities for true scores. Alternatively, the probability of a true score in the qualifying range could be computed directly from the posterior distribution of true scores. This quantity is what Justice Alito thought he was computing. Although it is the quantity of the most legal interest, it cannot be derived from the CI.

In short, Hall requires legislatures and courts to take heed of the statistical principles and methods that should guide the interpretation of IQ scores by psychologists and psychiatrists. The essence of the case is the Court’s refusal to countenance ‘an unacceptable risk that persons with intellectual disability will be executed’.177 Understandably, the opinions do not strive to be comprehensive in their discussions of how to achieve this result. Unfortunately, to the extent that the Justices do articulate details of statistical procedures for quantifying the uncertainty of psychological measurements, the dicta are not always dependable.178 By offering some corrections and elaborations, and by demonstrating that there is more than one statistically acceptable way to try to control the risk of error, this article lays the groundwork for more informed use of the statistical properties of IQ scores in adjudicating claims of intellectual disability.

1

134 S.Ct. 1986 (2014).

2

Chief Justice Roberts and Justices Scalia and Thomas joined Justice Alito’s opinion. Id. at 2001.

3

Justices Breyer, Ginsburg, Sotomayor, and Kagan joined Justice Kennedy’s opinion.

4

Id. at 2003, 2010 (Alito, J., dissenting). But see id. at 2011 (‘it is unclear to me’ which rule the Court adopts).

5

Id. at 2010–11.

6

Id. at 2011.

7

Id.

8

Id. at 2009.

9

Id. at 2010.

10

Most commentary stops with the recognition that ‘standardized measures will always be imprecise’, Susan Unok Marks, Courts’ Elusive Search for the Meaning of Intellectual Disability for Evaluating Atkins Claims, 26 U. Fla. J.L. & Pub. Pol'y 347 (2015), or demands that ‘any IQ score be reported with an associated confidence interval’. Robert M. Sanger, IQ, Intelligence Tests, “Ethnic Adjustments” and Atkins, 65 Am. U. L. Rev. 87 (2015). But see David H. Kaye, ‘Unhinged from Legal Logic: Hall v. Florida’s Confidence Intervals and the Burden of Persuasion (15 August 2015) (unpublished manuscript).

11

See Timothy R. Saviello, The Appropriate Standard of Proof for Determining Intellectual Disability in Capital Cases: How High Is Too High?, 20 Berkeley J. Crim. L. 163 (2015) (arguing that Hall indicates that the state cannot impose a greater burden on the defendant than proving intellectual disability by a preponderance of the evidence); Eighth Amendment—Cruel and Unusual Punishments—Defendants with Intellectual Disability—Hall v. Florida, 128 Harv. L. Rev. 271, 280 (2014) (suggesting that Hall invites attacks on all death penalty procedures that are legislative outliers; on the significance of various types of legislative outliers in constitution al litigation more generally, see Justin Driver, Constitutional Outliers, 81 U. Chi. L. Rev. 929 (2014)); Lise E. Rahdert, Hall v. Florida and Ending the Death Penalty for Severely Mentally Ill Defendants, 124 Yale L.J. Forum 34 (2014) (proposing that the case can readily be extended to prohibit executions of every criminal suffering from a major mental illness).

12

Christopher Slobogin, Scientizing Culpability: The Implications of Hall v. Florida and the Possibility of a “Scientific Stare Decisis,” 23 Wm. & Mary Bill Rts. J. 415, 425 (2014).

13

Hall, 134 S.Ct. at 1995.

14

Id.

15

Scott E. Sundby, The True Legacy of Atkins and Roper: The Unreliability Principle, Mentally Ill Defendants, and the Death Penalty's Unraveling, 23 Wm. & Mary Bill Rts. J. 487, 493 (2014).

16

Penry v. Lynaugh, 492 U.S. 302, 307 (1989).

17

Hall v. Florida, 134 S.Ct. 1986, 1992 (2014). Earlier IQ scores as low as 60 were ruled inadmissible. Id.

18

492 U.S. 302 (1989), overruled in part, Atkins v. Virginia, 536 U.S. 304 (2002).

19

A majority of the Court held that the issues as framed in a special verdict for the jury did not allow proper consideration of Penry’s intellectual disability. Justice Scalia’s opinion, joined by Chief Justice Rehnquist and Justices White and Kennedy, dissented from this conclusion. Penry, 492 U.S. at 353–60 (dissenting and concurring opinion).

20

For many years, ‘mental retardation’ was the phrase that the legal, medical, and psychological communities used to denote certain deficits in cognitive functioning. See Am. Psychiatric Ass'n, Diagnosticand Statistical Manualof Mental Disorders 33 (5th ed. 2013) [hereinafter DSM-5]; Robert L. Schalock et al., The Renaming of Mental Retardation: Understanding the Change to the Term Intellectual Disability, 45 Intell. & Developmental Disabilities 116, 116 (2007). This essay generally uses what one court has called ‘the more politically correct phrase “intellectual disability” ’. Commonwealth v. Hackett, 99 A.3d 11, 14 n. 5 (Pa. 2014). In some instances where it is historically appropriate, however, I use the earlier terminology. Cf. Patrick McDonagh, Idiocy: A Cultural History 5 (2008) (‘[a]nyone wanting to understand the history of the idea of intellectual disability and its various genealogical precursors, such as idiocy, must contend with the slipperiness of the key terms’).

21

Penry, 492 U.S. at 335.

22

Id. at 343 (Brennan & Marshall, JJ., concurring and dissenting).

23

Justices Brennan and Marshall contended that the mentally retarded ‘inevitably lack the cognitive, volitional, and moral capacity to act with the degree of culpability associated with the death penalty’, id. at 343–44, and that ‘killing mentally retarded offenders does not measurably further the penal goals of either retribution or deterrence’. Id. at 348. Justices Stevens and Blackmun apparently agreed. See id. at 350. Justice O’Connor did not. Id. at 338. The remaining four Justices dismissed the entire inquiry as misguided. In their view, ‘if an objective examination of laws and jury determinations fails to demonstrate society's disapproval of it, the punishment is not unconstitutional even if out of accord with the theories of penology favored by the Justices of this Court’. Id. at 351.

A different criticism of Penry is that the Court failed to appreciate that the English and colonial common law protected not ‘only those who were “profoundly or severely retarded”, [but also many of] those who were moderately or mildly mentally retarded’. Michael Clemente, Note, A Reassessment of Common Law Protections for “Idiots”, 124 Yale L.J. 2746, 2751 (2015).

24

536 U.S. 304 (2002).

25

Id. at 310.

26

Id. at 316–17.

27

Id. at 316. A dissenting opinion of Justice Scalia, Chief Justice Rehnquist, and Justice Thomas found this conclusion ‘miraculously extract[ed]’ from ‘embarrassingly feeble evidence’. Id. at 342, 344.

28

Id. at 319–20. However, whereas the Penry dissenters offered this as a sufficient basis for a determination of cruel and unusual punishment, the Atkins majority proffered it only insofar as ‘independent evaluation of the issue reveals no reason to disagree with the judgment of the legislatures that have recently addressed the matter and concluded that death is not a suitable punishment for a mentally retarded criminal’. Id. at 320 (internal quotation marks omitted). Justice Scalia responded that the ‘discussion … does not bear analysis’. Id. at 350.

29

Id. at 320–21 (internal quotation marks and citation omitted). Justice Scalia scorned ‘this unsupported claim’ and its ‘pretty flabby language’. Id. at 352. For a sympathetic exploration of the radical implications of this ‘unreliability principle’ as a basis for holding capital sentences to be cruel and unusual, see Sundby, supra note 15.

30

Id. at 317.

31

Id. (quoting Ford v. Wainwright, 477 U.S. 399, 405, 416–17 (1986)).

32

Id. at 317 n. 22 (‘The statutory definitions of mental retardation are not identical, but generally conform to the clinical definitions …’.).

33

Id. at 308 n. 3 (quoting American Associationon Mental Retardation, Mental Retardation: Definition, Classification, and Systemsof Supports 5 (9th ed. 1992).

34

Id. (quoting DSM-4, at 42–43).

35

Id. at 316.

36

Id.

37

See id. at 314.

38

Hall, 134 S.Ct. at 1996. Kentucky’s statute defined ‘significantly subaverage general intellectual functioning … as an intelligence quotient (I.Q.) of seventy (70) or below’. Ky. Rev. Stat. § 532.130(2) (upheld in Bowling v. Commonwealth, 163 S.W. 3d 361 (Ky. 2005)). Virginia required ‘performance on a standardized measure of intellectual functioning administered in conformity with accepted professional practice, that is at least two standard deviations below the mean’. Va. Code Ann. § 19.2–264.3:1.1.

39

Id. at 1990. Like Virginia, Florida defined ‘significantly subaverage general intellectual functioning’ as ‘performance that is two or more standard deviations from the mean score on a standardized intelligence test’. Fla. Stat. § 921.137(1) (2002) (interpreted in Cherry v. State, 959 So.2d 702, 712–713 (Fla. 2007), to denote a score of 70 on an IQ test with a mean of 100 and a SD of 15).

40

Id. at 1990–91; Hall v. Florida, 614 So.2d 473, 479, 479–80 (1993) (dissenting opinion).

41

Hall, 134 S.Ct. at 1992.

42

Id. at 1998–99.

43

Id. at 1999.

44

See id. at 2001 (dissenting opinion); see also Slobogin, supra note 12, at 416.

45

Trop v. Dulles, 356 U.S. 86, 101 (1958).

46

Hall, 134 S.Ct. at 2002 (Alito, J., dissenting).

47

Justice Kennedy explained that ‘[i]n addition to the views of the States and the Court's precedent, [our] determination is informed by the views of medical experts. These views do not dictate the Court's decision, yet the Court does not disregard these informed assessments.’ Id. at 2000.

48

Id. (‘the professional community's teachings are of particular help in this case, where no alternative definition of intellectual disability is presented and where this Court and the States have placed substantial reliance on the expertise of the medical profession.’) (emphasis added).

49

See id. at 2000 (relying on ‘the unanimous professional consensus’).

50

Id. at 1995.

51

Id.

52

Id. at 1996–97.

53

Id. at 1999.

54

Section 4 provides the technical definition.

55

See Hall, 134 S.Ct. at 2000 (‘By failing to take into account the SEM and setting a strict cutoff at 70, Florida goes against the unanimous professional consensus.’) (internal quotation marks omitted). Whether all mental health professionals agree that the cutoff should be higher than 70 is doubtful. See Bryan H. King et al., Mental Retardation, in 2 Kaplanand Sadock’s Comprehensive Textbookof Psychiatry 2587, 2591 (Benjamin J. Sadock & Virginia A. Sadock eds., 7th ed. 2000) (referring to extending ‘the I.Q. criterion from “70 and below” to “70 or 75 and below” ’ as having been ‘hotly debated’).

56

Some commentary conveys a different impression. For example, Sanger, supra note 10, at 93 (attributing the Court the ‘conclusion that rigid reliance on IQ scores should not deprive people facing the death penalty of a chance to illustrate that their execution is unconstitutional’); Timothy R. Saviello, The Appropriate Standard of Proof for Determining Intellectual Disability in Capital Cases: How High Is Too High?, 20 Berkeley J. Crim. L. 163, 222 (2015) (reading Hall as resting on the premise that ‘having a fixed IQ cutoff makes the IQ score the single criteria for determining intellectual disability, and thus prevents consideration of other evidence that mental health professionals require prior to reaching a decision on intellectual disability’); Bryant Buechele, Note, Psychology's Role in Law: A Discussion of How the Supreme Court Views the Role of the DSM-V in Hall v. Florida, 68 SMU L. Rev. 275, 277 (2015) (because ‘IQ score and adaptive functioning, by themselves are not capable of determining whether an individual is intellectually disabled, … Hall found that a Florida statute that effectively did not account for factors beyond the somewhat faulty IQ test was unconstitutionally limited in scope’.).

57

On the relative performance of statistical methods and clinical judgement more generally, see R.M. Dawes et al., Clinical Versus Actuarial Judgment, 243 Science 1668 (1989) (‘Research comparing these two approaches shows the actuarial method to be superior.’); William M. Grove, & Paul E. Meehl, Comparative Efficiency of Informal (Subjective, Impressionistic) and Formal (Mechanical, Algorithmic) Prediction Procedures: The Clinical–statistical Controversy, 2 Psych., Pub. Pol’y& L. 293 (1996) (‘Empirical comparisons of the accuracy of the two methods (136 studies over a wide range of predictands) show that the mechanical method is almost invariably equal to or superior to the clinical method); Konstantinos V. Katsikopoulos et al., From Meehl to Fast and Frugal Heuristics (and Back): New Insights into How to Bridge the Clinical—Actuarial Divide, 18 Theory& Psych. 443 (2008); Steven Schwartz& Timothy Griffin, Medical Thinking: The Psychologyof Medical Judgmentand Decision Making (2012).

58

Id. at 2009 (Alito, J., dissenting). The phrase appears in Atkins, 536 U.S. at 308 n. 3. The Atkins Court took it from APA and AAMR publications. Id.

59

Justice Alito asserted that in the latest edition of its Diagnostic and Statistical Manual (the DSM-5), ‘the APA discards “significantly subaverage intellectual functioning” as an element of the intellectual disability test. Elevating the APA's current views to constitutional significance therefore throws into question the basic approach that Atkins approved and that most of the States have followed.’ Examination of the references in an accompanying footnote does not support this assertion. Subaverage intellectual functioning, as measured by low IQ scores, remains the first diagnostic criterion. See David H. Kaye, Quarreling and Quibbling over Psychometrics in Hall v. Florida (part 3) (4 June 2014), Forensic Sci., Stat. & L. http://for-sci-law-now.blogspot.com/2014/06/quarreling-and-quibbling-over_4.html (concluding that Justice Alito’s ‘claim that the [DSM-V] “dramatically illustrate[s a] fundamental[] alter[ation in] … the longstanding … definition of intellectual disability” seems, well, melodramatic’); APA, DSM-5 Intellectual Disability Fact Sheet, at 2 (2013), available at http://www.dsm5.org/Documents/Intellectual%20Disability%20Fact%20Sheet.pdf (‘In DSM-5, intellectual disability is considered to be approximately two standard deviations or more below the population, which equals an IQ score of about 70 or below.’).

60

The APA’s DSM-5 demands ‘[d]eficits in intellectual functions … confirmed by both clinical assessment and individualized, standardized intelligence testing.’ DSM-5, supra note 20 (Diagnostic Criterion A) (emphasis added). Likewise, the American Association on Intellectual and Developmental Disabilities (AAIDD) defines intellectual disability in terms of ‘significant limitations both in intellectual functioning and adaptive behaviour’. AAIDD, Intellectual Disability: Definition, Classification, and Systemsof Supports (11th ed. 2010) (as quoted in Chris Hatton, Intellectual Disabilities—Classification, Epidemiology and Causes, in Clinical Psychologyand Peoplewith Intellectual Disabilities 1, 4 (Eric Emerson et al. eds., 2d ed. 2012) (emphasis added). The organization then defines ‘intellectual functioning’ as ‘[a]n IQ score that is approximately two standard deviations below the mean, considering the standard error of measurement for the specific assessment instruments' used and the instruments' strengths and limitations.’ Id.

61

The more extreme and debilitating levels of disability are the product of known organic causes and occur more frequently than would be predicted from the normal curve. The severe cases elevate and fatten the tails of the distribution of scores. The much larger fraction of cases (perhaps 90%) probably results from interactions between quantitative trait loci that influence intellectual development in combination with environmental conditions. These cases would be expected to be normally distributed in the population. King et al., supra note 55, at 2592.

62

Many of the diagnostic categories in the latest version of the Diagnostic and Statistical Manual of Mental Disorders have been questioned. The DSM-5 ‘was published in May 2013 amid a storm of controversy and bitter criticism’. News Analysis: Controversial Mental Health Guide DSM-5, NHS Choices (15 August 2013) http://www.nhs.uk/news/2013/08august/pages/controversy-mental-health-diagnosis-and-treatment-dsm5.aspx. In general, critics maintain that ‘D.S.M.’s diagnostic categories lacked validity, that they were not “based on any objective measures”, and that, “unlike our definitions of ischemic heart disease, lymphoma or AIDS”, which are grounded in biology, they were nothing more than constructs put together by committees of experts.’ Gary Greenberg, The Rats of N.I.M.H., New Yorker, 16 May 2013, available at http://www.newyorker.com/tech/elements/the-rats-of-n-i-m-h (quoting Thomas Insel, the director of the National Institute of Mental Health).

63

Daniel J. Reschly, Assessing Mild Intellectual Disability: Issues and Best Practice, in The Oxford Handbookof Child Psychological Assessment 683, 687 (Donald H. Saklofske et al. eds., 2013). For a synopsis of the history of definitions, see Committeeon Disability Determinationfor Mental Retardation, Nat’l Research Council, Mental Retardation: Determining Eligibilityfor Social Security Benefits 22–24 (2002).

64

Reschly, supra note 63, at 687.

65

The more liberal criterion of one SD ‘markedly influenced ID criteria in schools and was the subject of much criticism in the courts and by researchers as being too inclusive and stigmatizing excessive numbers of persons’. Id. (citations omitted); NRC Committee, supra note 63, at 24 (noting that a significant factor in reducing ‘the upper criterion of scores on intelligence measures from 85 to 70’ in 1973 was ‘concern about the inappropriate overidentification of minority students as mentally retarded’); cf. King et al., supra note 55, at 2591 (noting the concern that raising the cut-score from 70 points (–2 SD) to 75 ‘will increase the size of the population with mental retardation, including increases in the overrepresentation of several minority groups’).

66

See Peter H. Westfall& Kevin S.S. Henning, Understanding Advanced Statistical Methods 272 (2013) (‘Natural processes everywhere are well modeled by the notmal distribution. It’s not just a figment of the imagination of a few deranged statisticians.’).

67

On why this might be so, see Aidan Lyon, Why Are Normal Distributions Normal?, 65 British J. Phil. Sci. 621 (2014).

68

Cherry v. State, 959 So.2d 702, 712 (Fla. 2007); cf. Atkins v. Virginia, 536 U.S. 304, 308 (2002) (‘significantly subaverage intellectual functioning’).

69

Hall, 134 S. Ct. at 1994. It would have been more precise to state that the standard deviation indicates not ‘how’, but rather ‘how much’ scores are dispersed in either a population or a sample. A batch of numbers could be highly concentrated around a single value, with outliers on the flanks. Their distribution could be flat, with an equal fraction of the numbers spread out everywhere on the number line. The distribution might show clustering at several locations.

70

The z is obtained by subtracting the mean IQ score μx from the IQ score x and dividing this departure from the mean by the standard deviation of x. In symbols, z = (xμx) / σx.

71

Establishing that IQ scores are valid in ascertaining who is eligible for capital punishment requires two steps. To begin with, one needs a list of the factors that make intellectual ability relevant to capital sentencing. These factors flow from the two sets of justifications for prohibiting capital punishment of individuals classified as ‘intellectually disabled’. The first set relate to the judgement that execution of the disabled is morally wrong. See Ronald Dworkin, Taking Rights Seriously (1977). In Hall, Justice Kennedy reiterated the Brennan-Marshall view that executing a ‘person with intellectual disability’ constitutes cruel and unusual punishment because it serves ‘[n]o legitmate penological purpose’ in that ‘those with intellectual disability are … likely unable to make … calculated judgments …’. Hall, 134 S. Ct. at 1992–93. The second set of justifications pertains to the ways in which intellectual disability degrades the accuracy of sentencers’ judgments of which offenders truly deserve to die. The Hall majority explained that ‘intellectually disabled … persons face “a special risk of wrongful execution” because they are more likely to give false confessions, are often poor witnesses, and are less able to give meaningful assistance to their counsel’. Id. at 1993. Thus, an ideal test for intellectual disability would measure abilities with regard to deliberation, communication, and as well as the tendency to submit to authority.

Hall did not discuss the extent to which IQ scores (and the other parts of the current clinical nosology) match the constitutionally derived personal characteristics. It assumed that the measures of intellectual disability adopted for a variety of other purposes—including ‘education, access to social programs, and medical treatment plans’, Hall, 134 S.Ct. at 1993—also validly advance the penological and trial-process objectives identified in Atkins. The dissent thought that 70 was a constitutionally reasonable line of demarcation under Atkins—even if it no longer conformed precisely to clinical practice. Indeed, the dissent complained that the clinical definition was unstable, id. at 2006 (Alito, J., dissenting), and that the one-size-fits-all definition might not be as valid as one designed for capital punishment purposes. Id. at 2006–07. In the dissent’s view, ‘[p]ractical problems like these call for legislative judgments, not judicial resolution’ (id. at 2007) based on the Diagnostic and Statistical Manual of the day. However, the divergence between the majority and the dissent is not really over ‘whether’ to use IQ scores in ascertaining intellectual disability. It is over ‘how’ states may use them.

72

See King et al., supra note 55, at 2592 (‘Given the Gaussian distribution of I.Q., as many people have an I.Q. between 70 and 74 as have an I.Q. between 0 and 69.’).

73

Of course, at some point, a proposed definition would be unreasonable. Situations like this constantly arise with vague predicates. Dominic Hyde, Sorites Paradox, in The Stanford Encyclopediaof Philosophy (Edward N. Zalta ed. 2014), http://plato.stanford.edu/archives/win2014/entries/sorites-paradox/. They are woven into the very fabric of the law. See Leo Katz, Whythe Law Is So Perverse 157 (2011).

74

See supra note 65.

75

Hall, 134 S. Ct. at 1990.

76

AAIDD (as quoted in Hatton, supra note 60, at 4).

77

Hall, 134 S.Ct. at 2006 (Alito, J., dissenting). Justice Alito elaborated that ‘[i]n a death-penalty case, intellectual functioning is important because of its correlation with the ability to understand the gravity of the crime and the purpose of the penalty, as well as the ability to resist a momentary impulse or the influence of others. By contrast, in determining eligibility for social services, adaptive functioning may be much more important.’ Id. at 2006–07 (citations omitted).

78

See supra note 72.

79

Compared to the first-order question of the extent of the deficit in cognitive functioning required for the death penalty exemption to serve the purposes identified in Atkins, the wobbliness in measured scores resembles a perturbation in the orbit of a planet. Because of Neptune’s pull, Uranus was not quite where it should have been (as predicted from its gravitational interaction with only the sun), but astronomers had no trouble locating it before they discovered Neptune. The discrepancies were significant for a different reason. Calculations based on the small perturbations led to the discovery of Neptune. A. Pannekoek, The Discovery of Neptune, 3 Centaurus 126 (1953).

80

Hall, 134 S. Ct. at 1995.

81

APA, Diagnosticand Statistical Manualof Mental Disorders 37 (5th ed. 2013).

82

Hall, 134 S. Ct. at 2010 (Alito, J., dissenting).

83

Id. at 2011.

84

The definition of ‘true score’ as a long-term average for a test-taker on a particular test does not imply that the unobserved (latent) true score actually measures the construct that the test is designed to measure. John C. Willse, Classical Test Theory, in 1 Encyclopediaof Research Design 149 (Neil J. Salkind ed. 2010). Validation of the test as a measure of ‘intelligence’ or any other construct is a separate exercise. Id. at 152.

85

The mean of 0 makes the observed score x an unbiased estimator of the true score—sometimes, it will underestimate the true score; other times, it will overestimate τ; in the long run, these errors will average out to zero. It also is a maximum likelihood estimator; no other value for τ has a higher probability of producing the measured value x. Lloyd Rosenberg, Statistical Reasoning 18–23 (1971).

86

Hall, 134 S.Ct. at 1995.

87

Won-Chan Lee et al., Interval Estimation for True Raw and Scale Scores under the Binomial Error Model, 31 J. Educ. & Behav. Stat. 261, 261 (2006) (‘The traditional definition of SEM (i.e., same for all examinees) is sometimes called the overall SEM in the sense that it is an average SEM for all examinees in the population.’).

88

A study finding major changes is Sue Ramsden et al., Verbal and Non-verbal Intelligence Changes in the Teenage Brain, 479 Nature 113 (2011).

89

See e.g. Michael Furr& Verne R. Bacharach, Psychometrics: An Introduction 119 (2d ed. 2014); David M. Lane, Measurement, in Online Statistics Education: A Multimedia Course of Study (David M. Lane ed.), http://onlinestatbook.com/2/research_design/measurement.html.

90

See Marley W. Watkins & Lourdes G. Smith, Long-Term Stability of the Wechsler Intelligence Scale for Children—Fourth Edition, 25 Psychol. Assessment 477, 480 (2013).

91

See I. C. McManus, The Misinterpretation of the Standard Error of Measurement in Medical Education: A Primer on the Problems, Pitfalls and Peculiarities of the Three Different Standard Errors of Measurement, 34 Med. Teacher 569, 571 (2012):

Conceptually, test reliability is typically considered as how a large group of candidates would perform if the identical assessments were taken on two separate occasions (the test–retest correlation). Most high-stakes assessments, though, are taken but once, and instead there are several ways in which reliability can be calculated from the internal structure of the test. The split-half correlation is analogous to test–retest correlation, comparing the performance of candidates on, say, odd-numbered items and even-numbered items. Cronbach’s alpha which statisticians prefer, is a generalisation of the split-half correlation across all possible ways of dividing a test.

92

The WAIS-IV test has a reliability of 0.97–0.98 based on internal consistency, although ‘[t]hese should be considered best-case estimates because they do not consider other major sources of error such as long-term temporal stability, administration errors, or scoring errors.’ Gary L. Carnivez, Review of Wechsler Adult Intelligence Scale—Fourth Edition, in The Eighteenth Mental Measurements Yearbook (Robert A. Spies et al. eds. 2010). By some internal measures, the reliability of full IQ scores on the Stanford–Binet test is between 0.91 and 0.98. Sherry K. Bain & Jessica D. Allin, Book Review: Stanford-Binet Intelligence Scales: Fifth Edition, 23 J. Psychoeducational Assessment 87, 90 (2005).

93

See e.g. Lane, supra note 89 (‘If a test were given in two populations for which the variance of the true scores differed, the reliability of the test would be higher in the population with the higher true-score variance. Therefore, reliability is not a property of a test per se … .’).

94

See e.g. McManus, supra note 91, at 571.

95

Hall, 134 S.Ct. at 2009 (Alito, J., dissenting) (quoting User’s Guide to Accompany AAIDD 11th ed.: Definition, Classification, and Systems of Supports 22 (2012)).

96

See generally David L. Faigman et al., Group to Individual (G2i) Inference in Scientific Expert Testimony, 81 U. Chi. L. Rev. 417 (2014); Nicholas Scurich & Richard S John, A Bayesian Approach to the Group Versus Individual Prediction Controversy in Actuarial Risk Assessment, 36 Law& Hum. Behav. 237 (2012).

97

The dissent suggested that an SEM tailored to the individual should be used. Hall, 134 S.Ct. at 2010 (Alito, J., dissenting) (referring to ‘the SEM for a particular test and a particular test-taker’).

98

Hall, 134 S.Ct. at 1995. Going into greater detail, an amicus brief from the American Psychological Association and other professional organizations repeatedly cited by the Court clumsily stated that ‘[t]he SEM … is … used to calculate confidence intervals. Thus, a full scale IQ “score of 70 is most accurately understood not as a precise score but as a range of confidence with parameters of at least one standard error of measurement’ ”. Brief for American Psychological Association et al. as Amici Curiae in Support of Petitioner, Hall v. Florida, No. 12-10882, at 23 (quoting AAIDD Manual at 24). This phrasing is statistically inept. Certainly, a score of 70 is not precise—the true score could be higher or lower—but how much higher or lower is not ‘a range of confidence’ accompanied by ‘parameters’. No ‘parameters’ of a statistical model accompany an interval estimate of ± k SEM around the measured score. The confidence coefficient is a single number that defines k. There is no fundamental reason to specify at ‘at least one SEM’ (k ≥ 1) as the relevant range.

99

Id. (quoting Brief for American Psychological Association et al., supra note 98). Calling the SEM a ‘unit of measurement’ is confusing—an IQ test’s unit of measurement is an IQ point. Of course, the fact that IQ scores can be transformed into other units such as numbers of SEMs from the mean permits the transformed score to be used instead of the reported IQ scores. In that sense, any transformation that can be inverted could be said to define a new unit of measurement.

100

Id. (emphasis added).

101

Hall, 134 S.Ct. at 2010 (Alito, J., dissenting).

102

A trivial but obvious correction (to readers familiar with the normal distribution) is that an interval of ± 1 SD covers 68%, not 66%, of the area under the curve.

103

Furr& Bacharach, supra note 89, at 169–70.

104

Hall, 134 S.Ct. at 2010 (Alito, J., dissenting) (‘the greater the degree of confidence demanded, the greater the range of scores that will fall within the confidence interval’).

105

For example, Morris H. DeGroot& Mark J. Schervish, Probabilityand Statistics 412 (3d ed. 2002) (‘Because of the distinction between confidence and probability, the meaning and relevance of confidence intervals in statistical practice is a somewhat controversial topic.’); David H. Kaye & David A. Freedman, Reference Guide on Statistics, in Reference Manualon Scientific Evidence 247 (Federal Judicial Center & National Research Council Committee on the Development of the Third Edition of the Reference Manual on Scientific Evidence eds., 3d ed. 2011) (‘the confidence level does not give the probability that the unknown parameter lies within the confidence interval… . According to the frequentist theory of statistics, probability statements cannot be made about population characteristics: Probability statements apply to the behaviour of samples. That is why the different term “confidence” is used’.).

106

For example, DeGroot& Schervish, supra note 105, at 412 (‘It should be emphasized that it is not correct to state that θ [the parameter being estimated] lies in the interval (a, b) with probability γ [the confidence coefficient]’) (emphasis in original); Kayeetal., The New Wigmoreon Evidence: Expert Evidence § 12.6.4, at 546–47 (2d ed. 2011) (‘confidence of 95 percent does not necessarily mean that the interval estimate has a 95 percent probability of being correct’.); Kaye & Freedman, supra note 105, at 247.

107

Kaye & Freedman, supra note 105, at 247–48.

108

See Kaye, supra note 10; infra Section 5.2.

109

Hall, 134 S.Ct. at 2011 (Alito, J., dissenting).

110

The exact number is 70.6.

111

The number on the continuous scale is 69.6.

112

Kaye, supra note 10.

113

Hall, 134 S.Ct. at 2011 (Alito, J., dissenting).

114

For example, the AAIDD refers to the requisite indication of a deficiency in ‘[i]ntellectual functioning’ as ‘[a]n IQ score that is approximately two standard deviations below the mean, considering the standard error of measurement for the specific assessment instruments' used and the instruments' strengths and limitations.’ AAIDD, supra note 60 (as quoted in Hatton, supra note 60, at 4 (‘Individuals with intellectual disability have scores of approximately two standard deviations or more below the population mean, including a margin for measurement error (generally +5 points). On tests with a standard deviation of 15 and a mean of 100, this involves a score of 65–75 (70 ± 5).’); Committeeon Disability Determinationfor Mental Retardation, supra note 63, at 115 (‘no matter how great the discrepancy between relevant subscales, individuals with total test scores greater than 75 should not be diagnosed as having mental retardation’.).

115

Hall, 134 S.Ct. at 2001.

116

Although it is not obvious how much of the theory of true score estimation the majority really understood, the dissent hardly seems justified in attributing to the Court ‘the mistaken understanding that the SEM for most tests is 5’. Justice Kennedy clearly stated that ‘the average SEM for the WAIS–IV is 2.16 IQ test points and the average SEM for the Stanford–Binet 5 is 2.30 IQ test points …’. Hall, 134 S.Ct. at 1995. Justice Alito seems convinced that using more than one SEM for the necessary ‘margin of error’, as the majority plainly did, somehow contravenes professional standards accepted in Atkins. But there is little basis for that view. According to the dissent:

[T]he Court misreads the authorities on which it relies to establish this cutoff IQ score of 75. It is true that certain professional organizations have advocated a cutoff of 75 and that Atkins cited those organizations' cutoff. See ante, at 12, 20. But the Court overlooks a critical fact: Those organizations endorsed a 75 IQ cutoff based on their express understanding that ‘one standard error of measurement [SEM]’ is ‘three to five points for well-standardized’ IQ tests. AAMR, Mental Retardation 37 (9th ed.1992) (hereinafter AAMR 9th ed.); Atkins, 536 U.S., 309, n. 5 (citing AAMR 9th ed.; 2 Kaplan & Sadock's 2592 (B. Sadock & V. Sadock eds., 7th ed. 2000)); see also AAMR 10th ed. 57; AAIDD 11th ed. 36. In other words, the number 75 was relevant only to the extent that a single SEM was “estimated” to be as high as 5 points. AAMR 9th ed. 37.

Id. at 2010–11. Given the long-established tradition of the 0.05 level of statistical significance and its cognate 95% CI in psychology, it would be surprising if these authorities favoured a ‘margin of error’ of 1 SEM in the 1990s.

117

The expected error rate of 2.5% applies to offenders with true IQ scores of exactly τleg. It ignores errors with respect to offenders with smaller true scores. Some proportion of offenders with true scores below τleg also will have a single IQ test score larger than xmax = τleg + k SEM. They too will be precluded from presenting evidence of adaptive functioning even though they meet the true score criterion for intellectual disability. For example, when the legislative choice of the true score is 70 and the SEM is 2.16 (yielding xmax = 74.23), 0.77% of defendants with true scores of 69 will have a single observed score greater than xmax. Likewise, 0.20% of defendants with true scores of 68 will be falsely deemed ineligible. For a true score of 67, we are down to a 0.04% conditional error rate. As these numbers suggest, we can say that 0.025 is an upper bound on the probability that the hypothesis test will misclassify any individual whose IQ is truly τleg or less as having a true score τ > τleg. The probability of misclassification given an observed score x > τleg is another story. See infra note 122.

118

For definitions or explanations of hypothesis testing, see, e.g. Rosenberg, supra note 85, at 34–38; Kayeetal., supra note 106.

119

For a graphical explanation, see Kaye, supra note 10.

120

For a general proof that ‘a coefficient γ confidence set … can be thought of as a set of null hypotheses that would be accepted at significance level 1 – γ’, see DeGroot& Schervish, supra note 105, at 457. In this sense, significance tests underlie confidence intervals. D. R. Cox, Principlesof Statistical Inference 40 (2006).

121

The factors of 0.025, 0.0077, 0.0020, and 0.0004 are the areas in the tail of normal curve up to 70, 71, 72, and 73, respectively. Section 5.2 makes more use of the frequency distribution of true scores.

122

For example, Frank J. Dudek, The Continuing Misinterpretation of the Standard Error of Measurement, 86 Psychol. Bull. 335 (1979); J. C. Nunnally, Psychometric Theory 218 (1978); Elazar J. Pedhazur& Liora Pedhazur Schmelkin, Measurement, Design, and Analysis: An Integrated Approach 111–12 (1991).

123

Richard A. Charter & Leonard S. Feldt, Confidence Intervals for True Scores: Is There a Correct Approach?, 19 J. Psychoeducational Assessment 350, 354–55 (2001); Furr& Bacharach, supra note 89, at 171; McManus, supra note 91, at, 572 (using the abbreviation ‘SEest’); Richard B. McHugh, The Interval Estimation of a True Score, 54 Psychol. Bull. 73, 73 n. 1 (1957) (not using an abbreviation for ‘standard error of estimate’).

124

See e.g. Charter & Feldt, supra note 123, at 354; Dudek, supra note 122, at 335.

125

Id.

126

McManus, supra note 91, at 573. See also David H. Kaye, The Disappearance that Wasn't? "Random Variation" in the Number of Women Supreme Court Clerks, 48 Jurimetrics J. 457 (2008) (noting that the phenomenon as a possible explanation for a highly publicized shortfall in women hired as clerks one year at the Supreme Court).

127

See David M. Lane, Regression toward the Mean, Online Statistics Education: A Multimedia Course, http://onlinestatbook.com/2/regression/regression_toward_mean.html.

128

G. K. Robinson, That BLUP is a Good Thing: The Estimation of Random Effects, 6 Stat. Sci.

15 (1991), presents as ‘far from new’ the following example:

Suppose that true intelligence quotient (IQ) is normally distributed with mean 100 and standard deviation 15. Two tests are available. Both tests give scores that are normally distributed with mean the true IQ. The first test score has standard deviation 10 given true IQ, while the second test score has standard deviation 5. A person scoring 130 on the first test would be estimated to have a true IQ of 120.8 and a person scoring 130 on the second test would be estimated to have a true IQ of 127. Features of these estimates worth noting are as follows.

   • They are shrunk towards the overall mean (100) from the data.

• The amount of shrinkage is greater when the data point is less informative.

• They are biased given true IQ. This is obvious since the raw scores are unbiased and the estimates are nontrivial linear functions of the raw scores.

• They have zero average bias when averaged over the distribution of possible true IQs.

• The expected value of true IQ given the data is equal to the BLUP [best linear unbiased prediction] estimate of IQ … .

Id. at 22.

129

Different types of shrinkage estimators are ‘ordinary shrinkage, preliminary test (shrinkage), Stein-type, ridge regression, empirical Bayes. estimators, etc.’ Hermanus H. Lemmer, Shrinkage Estimators, in Encyclopediaof Statistical Science (Samuel Kotz & Campbell B. Read eds., 2d ed. 2006).

130

Id. (‘to minimize (maximize) some desirable criterion function (mean square error, quadratic risk, bias, etc.)’).

131

For example, R. W. Farebrother, A Class of Shrinkage Estimators, 40 J. Royal Stat. Soc’y. Series B 47 (1978); Dominique Fourdrinier & Martin T. Wells, On Improved Loss Estimation for Shrinkage Estimators, 27 Stat. Sci. 61 (2012).

132

For example, Frank Harrell, Regression Modeling Strategies: With Applicationsto Linear Models, Logisticand Ordinal Regression, and Survival Analysis 75 (2d ed. 2015); G. S. Maddala et al., A Comparative Study of Different Shrinkage Estimators for Panel Data Models, 2 Annals Econ. & Finance 1 (2001); Yi-Hau Chen et al., Shrinkage Estimators for Robust and Efficient Inference in Haplotype-Based Case-Control Studies, 104 J. Am. Stat. Ass’n 220 (2009).

133

Richard A. Charter & Leonard S. Feldt, The Importance of Reliability as it Relates to True Score Confidence Intervals, 35 Measurement& Evaluationin Counseling& Development 104, 107 (2002).

134

Hall, 134 S.Ct. at 1990.

135

See e.g. Lee et al., supra note 87.

136

Hall, 134 S.Ct. at 2007; see also id. at 2008 (‘We have been presented with no solid evidence that the longstanding reliance on multiple IQ test scores as a measure of intellectual functioning is so unreasonable or outside the ordinary as to be unconstitutional.’).

137

Id. at 2007 n.9. It declined to consider a score of 69 on another test because the psychologist who administered and scored the test was dead and Hall’s counsel had violated an order ‘to provide the State with the [underlying] testing materials and raw data’. Brief for Respondent 19 n. 11.

138

Amended Order, Florida v. Hall, No. 1978-CF-0052, at ¶¶ 16, 20 (Fla. Cir. Ct. June 8, 2010) (Joint App. at 105, 108).

139

Hall, 134 S.Ct. at 1995.

140

Only if successive scores were completely dependent on one another would the additional scores fail to provide information as to the true score.

141

Hall, 134 S.Ct. at 2007 n. 9 (Alito, J., dissenting) (‘Hall does not allege that any potential “practice effect” skewed his scores.’).

142

The mean of Hall’s IQ scores of 71, 72, 73, and 80 is 74. The Hall Court’s 95% confidence SEM-IS for a single score of 74 for an IQ test with SEM 2.16 goes from 70 to 78. Because it does not lie entirely above 70, Hall is eligible for the disability exemption—the chance that he is disabled (assuming, as the record strongly indicates, that he has deficits in adaptive functioning), is too large under the 95% SEM-IS rule to permit the state to kill him. However, if the four scores are independent and if each test has the same SEM of 2.16, then the standard error ‘of the mean’ is less than 2.16. It is 1/4 of the square root of the sum of the squared SEMs for the four tests, which equals 0.73. The 95% confidence interval for the mean is, therefore, 74 ± 1.96 × 0.73, which goes from 73 to 75. This interval estimate puts Hall outside the IQ range that permits the diagnosis of intellectual disability.

143

Hall, 134 S.Ct. at 1995.

144

In remanding the case, the Court emphasized that ‘Freddie Lee Hall may or may not be intellectually disabled, but the law requires that he have the opportunity to present evidence of his intellectual disability, including deficits in adaptive functioning over his lifetime.’ Hall, 134 S. Ct. at 2001. It would have been more appropriate to remand for a determination of (1) whether the measurement error associated with all of Hall’s measured scores (taken together) was such his true score very probably exceeded 70, and if not, (2) whether the scores, combined with the evidence on adaptive functioning, established an intellectual disability.

145

Courts that are not advised of the appropriate statistical analysis may be tempted to reject a claim of intellectual disability if any score is two SEMs above 70. For example, Williams v. Stephens, 761 F.3d 561, 573 (5th Cir. 2014) (‘even with the recognized, five-point standard error of measurement, Williams scored over 70 on two of these [six] tests’.). That approach is no better than accepting the claim if any score intervals include 70.

146

W. Joel Schneider, Principles of Assessment of Aptitude and Achievement, in The Oxford Handbookof Child Psychological Assessment 286 (Donald H. Saklofske et al. eds. 2013).

147

Hall, 134 S.Ct. at 1997.

148

Schneider, supra note 146, at 290.

149

Id. (describing a procedure for ‘treat[ing] each IQ test as a subtest of a much larger “Mega-IQ Test’ ”.). Another complication, however, is whether later IQ scores should be reduced to account for a population trend towards higher scores over time (the ‘Flynn effect’). See e.g., Ex parte Cathey, 451 S.W.3d 1, 5–6 (Tex. Crim. App. 2014); Nancy Haydt et al., Advantages of DSM-5 in the Diagnosis of Intellectual Disability: Reduced Reliance on IQ Ceilings in Atkins (Death Penalty) Cases, 82 UMKC L. Rev. 359 (2014).

150

Hall, 134 S.Ct. at 1995–96.

151

Emily Young, Intelligence Testing: Accurate or Extremely Biased?, The Neuroethics Blog (Sept. 24, 2013), http://www.theneuroethicsblog.com/2013/09/intelligence-testing-accurate-or.html. The DSM-5 specifies that ‘[i]nstruments must be normed for the individual’s sociocultural background and native language.’ DSM-5, supra note 20 (Diagnostic Features). But see Sanger, supra note 10 (criticizing post-Hall efforts by some experts to infer higher true IQ scores).

152

Hall, 134 S.Ct. at 2011 n. 13 (Alito, J., dissenting).

153

Kayeetal., supra note 106, § 12.6.4; Kaye & Freedman, supra note 105.

154

An alternative to classical test theory for the construction of mental tests is item response theory, which posits a functional relationship between a test-taker’s latent ability and the probability of answering a question correctly. Frank B. Baker, The Basicsof Item Response Theory (2d ed. 2001). This article deals solely with the classical test theory that the Hall Court invoked.

155

Larry Wasserman, Allof Statistics 175 (2004).

156

Vic Barnett, Comparative Statistical Inference 123 (3d ed. 1999).

157

Wasserman, supra note 155, at 175.

158

The area under the normal curve with mean 71 and SD 2.16 in the region below 70.5 is 0.41.

159

Almost 51% of the area under the normal curve with mean 70 and SD 8 falls into the region from 64.5 to 75.49.

160

See supra note 86. A frequentist shrinkage adjustment is appropriate when using the SEE rather than the SEM. See supra Section 4.6.

161

See e.g. Barnett, supra note 156, at 201–50; Jeff Gill, Bayesian Methods: A Socialand Behavioral Sciences Approach 15–18 (3d ed. 2015). Discussions of Bayes’ rule in the legal literature began with Michael 0. Finkelstein & William B. Fairley, A Bayesian Approach to Identification Evidence, 83 Harv. L. Rev. 489 (1970). For a discussion of its judicial acceptance, see Kayeetal., supra note 106, § 12.8.5.

162

Paul H. Gartwaiteetal., Statistical Inference 154 (2d ed. 2002). It also is called a ‘highest posterior density interval’. G. A. Young& R. L. Smith, Essentialsof Statistical Inference 30 (2005).

163

See e.g. Andrew Gelmanetal., Bayesian Data Analysis (2d ed. 2004). For example, with IQ scores, see Kaye, supra note 10.

164

Kaye, supra note 10.

165

Id.

166

Id.

167

For a sampling of the latest generation of textbooks devoted entirely to the subject, see Gelmanetal., supra note 163; Gill, supra note 161; Simon Jackman, Bayesian Analysisforthe Social Sciences (2009); John K. Kruschke, Doing Bayesian Data Analysis: A Tutorialwith R, JAGS, and Stan 22 (2d ed. 2015). A popular history is Sharon Bertsch McGrayne, The Theory That Would Not Die (2011).

168

See e.g., Melvin R. Novick, Bayesian Methods in Psychological Testing, 1969 ETS Research Bulletin Series 1; Hariharan Swaminathan & Janice A. Gifford, Bayesian Estimation in the Rasch Model, 7 J. Educ. Stat. 175 (1982).

169

See e.g., Gill, supra note 161; Jackman, supra note 167; Simon Jackman, Bayesian Analysis for Political Research, 7 Ann. Rev. Pol. Sci. 483 (2004).

170

Hazelwood Sch. Dist. v. United States, 433 U.S. 299, 312 n. 17 (1977).

171

Id.

172

Hall, 134 S.Ct. at 1994.

173

Id. at 1990, 2001.

174

Id. at 1990.

175

The professional guidelines to which Justice Kennedy deferred use the 95% CI, which is equivalent to the 0.05 two-sided significance level, although the choice is largely conventional.

176

The Court apparently was not aware of the virtues of the SEE, perhaps because its reading of the psychometrics literature was limited to rather basic texts.

177

The opinion opens with two sentences describing the Florida law and its background and the announcement that ‘[t]his rigid rule, the Court now holds, creates an unacceptable risk that persons with intellectual disability will be executed, and thus is unconstitutional.’ 134 S.Ct. at 1990.

178

Hall is not the only case in which the Justices’ efforts to describe statistical inference or to rely on statistical reasoning display deficiencies. See Kayeetal., supra note 106, § 7.3.2(c)(1), at 323–24 (criticizing opinions in Barefoot v. Estelle, 463 U.S. 880 (1983)); David H. Kaye, Trapped in the Matrixx: The U.S. Supreme Court and the Need for Statistical Significance, 39 Prod. Safety& Liab. Rep. (BNA) 1007 (2011) (identifying potentially problematic dicta in Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309 (2011)); David H. Kaye, And Then There Were Twelve: The Supreme Court, Statistical Reasoning, and the Size of the Jury, 68 Cal. L. Rev. 401 (1980) (criticizing Justice Blackmun’s opinion in Ballew v. Georgia, 463 U.S. 880 (1983)); Richard O. Lempert, Uncovering "Nondiscernible" Differences: Empirical Research and the Jury-Size Cases, 73 Mich. L. Rev. 643 (1975) (criticizing the majority opinion in Williams v. Florida, 399 U.S. 78 (1970)).

Acknowledgements

I am grateful to Johannes Fredderke, Jay Kadane, Mae Quinn, and an anonymous referee for comments on a draft and to Jim Greiner, Jay Kadane, Jay Koehler, and Hal Stern for comments on a related paper.

Author notes

Preliminary versions of this article were presented at the Seventh International Conference on Inference and Forensic Statistics, Leiden, August 2014, a Penn State Law Faculty Colloquium, January 2015, and the Joint Statistical Meetings, Seattle, August 2015.