Abstract

When students answer test questions incorrectly, we often assume they don't understand the content; instead, they may struggle with certain cognitive skills or with how questions are asked. Our goal was to look beyond content to understand what makes assessment questions most challenging. On the basis of more than 76,000 answers to multiple-choice questions in a large, introductory biology course, we examined three question components—cognitive skills, procedural knowledge, and question forms—and their interactions. We found that the most challenging questions require the students to organize information and make meaning from it—skills that are essential in science. For example, some of the most challenging questions are presented as unstructured word problems and require interpretation; to answer correctly, the students must identify and extract the important information and construct their understanding from it. Our results highlight the importance of teaching students to organize and make meaning from the content we teach.

Critically reflective teachers can use student assessments to refine their teaching but only if they correctly interpret what the results indicate about the students’ knowledge and understanding. In large, introductory science courses—where multiple-choice assessments are a common way to manage the significant grading workload—it can be particularly difficult to interpret student performance. When the students incorrectly answer a question on a test or quiz, we often assume that they don't understand the content (i.e., they lack conceptual knowledge). Instead, their struggle may be due to a lack of proficiency with cognitive skills needed to answer successfully, procedural knowledge required to solve problems, how questions are asked, or some combination of these. As a result, assessments of learning may not align with conceptual understanding and, consequently, may not provide accurate feedback to the instructor about what is challenging for the students and why. Therefore, in addition to our need to understand what content is challenging for the students, we also need to understand how that difficulty is moderated by assessment methods.

There is a large body of research that investigates why assessment questions may be challenging, independent of content. One obvious reason is that the students use different cognitive skills for different types of problem solving; these cognitive skills are commonly categorized using Bloom's taxonomy (Crowe et al. 2008). Bloom formulated six learning objectives in the cognitive domain: knowledge plus five skills that depend on the understanding and use of knowledge—namely, comprehension, application, analysis, synthesis, and evaluation (Bloom 1956). Bloom's taxonomy of cognitive skills assumes a hierarchical relationship—that is, that mastery of lower-order critical thinking skills (i.e., knowledge and comprehension) is a necessary prerequisite for mastery of higher-order critical thinking skills such as analysis, application, evaluation, and synthesis (Anderson et al. 2001, Krathwohl 2002, Willingham 2008, Green 2010, Lemons and Lemons 2013). Bloom's taxonomy does not always capture the full picture, however, with studies finding that the level of difficulty according to Bloom's taxonomy is only weakly correlated with difficulty as reflected in student performance (e.g., Momsen et al. 2013). One reason for this weak correlation is that the skills students engage in often defy the hierarchical assumption of Bloom's taxonomy. For example, a question involving a calculation might be categorized as an application problem if the students have seen similar question before but with different values; however, this type of question does not necessarily require comprehension of the underlying concepts and is often not challenging for the students who can successfully insert values and perform calculations.

Beyond the hierarchical assumption of Bloom's taxonomy, a more fundamental problem is that learning objectives typically consist of two parts: the subject matter content and what is to be done with or to that content (Krathwohl 2002). In Bloom's original taxonomy, the knowledge category embodied both parts. Anderson and colleagues (2001) transformed the original one-dimension classification into two dimensions: a knowledge dimension and a cognitive dimension. The revised classification's cognitive dimension retained Bloom's original six cognitive skills (with some renaming and reordering). The knowledge dimension, however, is expanded across all cognitive dimensions, creating four subcategories of the knowledge dimension: factual knowledge, conceptual knowledge, procedural knowledge, and metacognitive knowledge (Anderson et al. 2001, Krathwohl 2002). The element of procedural knowledge is of particular interest in our present discussion because, in addition to understanding content, the students need to know the procedures involved in solving scientific problems, such as specific methods of inquiry, as well as criteria for using various skills, algorithms, techniques, and methods. Process skills of science are tools for gathering information, generating and testing new ideas, for building new knowledge, and for learning scientific concepts and constructing scientific explanations of the world; examples include observing, hypothesizing, interpreting, and predicting (Turiman et al. 2012, Abell et al. 2018).

Finally, another reason assessment questions can be challenging is that the students will be at a disadvantage if they lack experience answering certain types of questions (Lemons and Lemons 2013). A question's form (both the structure of the question and how material is presented) can range from unstructured word problems to categorized lists, tables, figures, or pictures. For example, if the students are asked to analyze data, the question's difficulty is, in part, a function of how the data are presented, because the question's form influences which cognitive skills and procedural knowledge the students must implement to answer correctly.

In the present article, we present an approach to identify which question components are most challenging for the students. Specifically, we asked, independent of content, which question components (i.e., cognitive skills, procedural knowledge, and question forms) and combinations of question components were most challenging. Our approach presumes that all assessment questions are asked in a particular question form and about specific content and that answering the questions requires cognitive skills and procedural knowledge (figure 1). Our goal is to delineate the main and interaction effects of these three components, which could subsequently be examined as a function of content.

Four components of assessment questions. Our approach presumes that every question asks about some content in a particular question form and that answering correctly requires procedural knowledge and cognitive skills. These four question components result in some questions that are more challenging than others. The source of this challenge could be a single component (e.g., content) or an interaction between components. Content and content-specific interactions (dark gray) may be course specific, but the remaining question components and interactions between them (light gray) are likely to be generalizable.
Figure 1.

Four components of assessment questions. Our approach presumes that every question asks about some content in a particular question form and that answering correctly requires procedural knowledge and cognitive skills. These four question components result in some questions that are more challenging than others. The source of this challenge could be a single component (e.g., content) or an interaction between components. Content and content-specific interactions (dark gray) may be course specific, but the remaining question components and interactions between them (light gray) are likely to be generalizable.

Methods

Course context

Introduction to Genetics and Evolution (Biology 202 L) is one of two introductory biology courses at an R1: Doctoral University (i.e., very high research activity; the Carnegie Classification of Institutions of Higher Education, http://carnegieclassifications.iu.edu) and is a requirement for biology majors and recommended for the prehealth students (see the supplemental file S1 for course syllabus). The course includes three 1.25-hour weekly meetings with in-class, active learning exercises (representing 2% of the total course grade) plus a 2.5-hour lab each week (25% of the course grade). In addition to in-class activities, the students are assigned weekly problem sets (for 9% of their grade), frequent individual and group quizzes (for 16% of their grade), and three noncumulative exams (16% of the grade each). For this study, we collected data on every assessment question (i.e., eight individual quizzes and three course exams) from the spring semester in 2018. All assessments were composed exclusively of multiple-choice, multiple-response questions with four answer choices per question (i.e., “select all that apply”). The answer choices were removed from the study if credit had been given back to the students because of unclear wording that became apparent from the students’ feedback. Therefore, our data include 514 answer choices to 133 assessment questions.

Participant data

Of the 397 students enrolled in the course, 227 consented to participate in this study; 148 of these students completed all questions in our study, yielding a data set of 76,072 student answers to the assessment questions. There were more female-identifying students (66%) than male-identifying students (34%) in our sample. Students from populations that have been historically underrepresented in STEM (i.e., Black, African American, Indigenous, and people of color) made up 26% of the sample. Finally, 57% of the students were not biological science majors. IRB approval for this research was obtained through Duke University (protocol no. 2017–0455).

Categorizing answer choices to assessment questions

Each of the 514 answer choices were categorized in three dimensions: cognitive skill, procedural knowledge, and question form (table 1). The cognitive skill required by each answer choice was determined using the Blooming Biology Tool (Crowe et al. 2008). The answer choices were categorized into one of six levels: knowledge, comprehension, application, analysis, synthesis, or evaluation; however, none of the answer choices in our sample involved synthesis or evaluation.

Table 1.

List of question components used in our models.

Cognitive skillProcedural knowledge of science 
cognitive processesQuestion form
Knowledge 
ComprehensionConnectData
ApplicationCalculateGraph
AnalysisInterpretVisual
PredictStructured word problem
CreateUnstructured word problem
Cognitive skillProcedural knowledge of science 
cognitive processesQuestion form
Knowledge 
ComprehensionConnectData
ApplicationCalculateGraph
AnalysisInterpretVisual
PredictStructured word problem
CreateUnstructured word problem
Table 1.

List of question components used in our models.

Cognitive skillProcedural knowledge of science 
cognitive processesQuestion form
Knowledge 
ComprehensionConnectData
ApplicationCalculateGraph
AnalysisInterpretVisual
PredictStructured word problem
CreateUnstructured word problem
Cognitive skillProcedural knowledge of science 
cognitive processesQuestion form
Knowledge 
ComprehensionConnectData
ApplicationCalculateGraph
AnalysisInterpretVisual
PredictStructured word problem
CreateUnstructured word problem

Our procedural knowledge of science cognitive processes dimension (which we call procedural knowledge for conciseness) is an elaboration of Anderson and colleagues’ (2001) variable with which they considered the methods of inquiry, scientific reasoning, and meaning making required to solve problems correctly. Using a yes–no rubric (table 2), we characterized five procedures that the students would have to implement to answer our assessment questions correctly: connect, calculate, interpret, predict, and create. Each answer choice was assigned a maximum of two procedural knowledge categories.

Table 2.

Procedural knowledge of science cognitive processes.

Procedural knowledge of science cognitive processExample
Connect: Does correctly responding to the answer choice require students to identify relationships between concepts?The assessment question presents a phylogeny; an answer choice that requires connection asks if this phylogeny is consistent with the principle of parsimony
Calculate: Does correctly responding to this answer choice require students to manipulate numbers to derive a mathematical answer?The assessment question presents allele frequencies; an answer choice that requires calculation asks what the expected equilibrium genotype frequency is
Interpret: Does correctly responding to this answer choice requires students to explain the meaning of a specific value, set of values, data, graph, or depiction?The assessment question asks students to calculate expected equilibrium genotype frequency and compare that to an actual genotype frequency; an answer choice that requires interpretation asks whether this population is most likely to be randomly mating
Predict: Does correctly responding to the answer choice require students to predict the implications of altering the original scenario?The assessment question presents genotype frequencies and fitness values; an answer choice that requires prediction asks what would happen to the allele frequency if gene flow began from a different population
Create: Does correctly responding to the answer choice require students to create an intermediate interpretation based on information provided?The assessment question presents the disease statuses in a family and asks the student the probability that certain individuals grandchildren are affected by the disease; an answer choice that requires creation requires students to create the family pedigree to determine the correct answer
Procedural knowledge of science cognitive processExample
Connect: Does correctly responding to the answer choice require students to identify relationships between concepts?The assessment question presents a phylogeny; an answer choice that requires connection asks if this phylogeny is consistent with the principle of parsimony
Calculate: Does correctly responding to this answer choice require students to manipulate numbers to derive a mathematical answer?The assessment question presents allele frequencies; an answer choice that requires calculation asks what the expected equilibrium genotype frequency is
Interpret: Does correctly responding to this answer choice requires students to explain the meaning of a specific value, set of values, data, graph, or depiction?The assessment question asks students to calculate expected equilibrium genotype frequency and compare that to an actual genotype frequency; an answer choice that requires interpretation asks whether this population is most likely to be randomly mating
Predict: Does correctly responding to the answer choice require students to predict the implications of altering the original scenario?The assessment question presents genotype frequencies and fitness values; an answer choice that requires prediction asks what would happen to the allele frequency if gene flow began from a different population
Create: Does correctly responding to the answer choice require students to create an intermediate interpretation based on information provided?The assessment question presents the disease statuses in a family and asks the student the probability that certain individuals grandchildren are affected by the disease; an answer choice that requires creation requires students to create the family pedigree to determine the correct answer
Table 2.

Procedural knowledge of science cognitive processes.

Procedural knowledge of science cognitive processExample
Connect: Does correctly responding to the answer choice require students to identify relationships between concepts?The assessment question presents a phylogeny; an answer choice that requires connection asks if this phylogeny is consistent with the principle of parsimony
Calculate: Does correctly responding to this answer choice require students to manipulate numbers to derive a mathematical answer?The assessment question presents allele frequencies; an answer choice that requires calculation asks what the expected equilibrium genotype frequency is
Interpret: Does correctly responding to this answer choice requires students to explain the meaning of a specific value, set of values, data, graph, or depiction?The assessment question asks students to calculate expected equilibrium genotype frequency and compare that to an actual genotype frequency; an answer choice that requires interpretation asks whether this population is most likely to be randomly mating
Predict: Does correctly responding to the answer choice require students to predict the implications of altering the original scenario?The assessment question presents genotype frequencies and fitness values; an answer choice that requires prediction asks what would happen to the allele frequency if gene flow began from a different population
Create: Does correctly responding to the answer choice require students to create an intermediate interpretation based on information provided?The assessment question presents the disease statuses in a family and asks the student the probability that certain individuals grandchildren are affected by the disease; an answer choice that requires creation requires students to create the family pedigree to determine the correct answer
Procedural knowledge of science cognitive processExample
Connect: Does correctly responding to the answer choice require students to identify relationships between concepts?The assessment question presents a phylogeny; an answer choice that requires connection asks if this phylogeny is consistent with the principle of parsimony
Calculate: Does correctly responding to this answer choice require students to manipulate numbers to derive a mathematical answer?The assessment question presents allele frequencies; an answer choice that requires calculation asks what the expected equilibrium genotype frequency is
Interpret: Does correctly responding to this answer choice requires students to explain the meaning of a specific value, set of values, data, graph, or depiction?The assessment question asks students to calculate expected equilibrium genotype frequency and compare that to an actual genotype frequency; an answer choice that requires interpretation asks whether this population is most likely to be randomly mating
Predict: Does correctly responding to the answer choice require students to predict the implications of altering the original scenario?The assessment question presents genotype frequencies and fitness values; an answer choice that requires prediction asks what would happen to the allele frequency if gene flow began from a different population
Create: Does correctly responding to the answer choice require students to create an intermediate interpretation based on information provided?The assessment question presents the disease statuses in a family and asks the student the probability that certain individuals grandchildren are affected by the disease; an answer choice that requires creation requires students to create the family pedigree to determine the correct answer

Multicollinearity tests between cognitive skill and procedural knowledge showed no significant correlations, which makes sense, given that procedural knowledge elements could be found at both low and high cognitive skill levels. For example, a calculation question may only require comprehension (a low-level cognitive skill) if the students have seen the same type of problem before and if they are provided the equation and all values needed to solve the problem. In contrast, a calculation question may require higher-order cognitive skills if the students must apply what they know to a novel scenario.

The question form dimension describes how information in a question is presented. Our sample had five question forms: data, graphs, visuals, structured word problems, and unstructured word problems (table 1). Each question in our study was categorized as a single question form. Data questions present the students with a table of quantitative or sequence data; graph questions present the students with a visual representation of quantitative data; visual questions present concepts or processes as a picture or illustration (e.g., diagrams, flow charts); structured word problems present important information organized in a way that correlates with how the question should be answered; and, finally, unstructured word problems present the students with a cohesive paragraph of information without the scaffolding of a structured word problem. Examples of each question form are available in supplemental file S2.

In addition to categorizing each answer choice in the three dimensions of question components, we also accounted for two dimensions that could, potentially, influence the students’ performance: time and assessment stakes. We accounted for time by coding for early semester (i.e., questions from the first four quizzes and first exam) or late semester (i.e., questions from the last four quizzes and two exams). To account for possible differences in student motivation or stress levels, the questions on the exams were coded as high stakes, and the questions on the quizzes were coded as low stakes.

Interrater reliability of coding

To determine the accuracy of coding for cognitive skills and procedural knowledge, two independent raters coded a sample of 100 answer choices (approximately 20% of the sample). Interrater reliability for cognitive skills had a Cohen's kappa (κ) = .77, indicating that our data accurately reflects the cognitive skill variable. Each item in procedural knowledge also had high inter-rater reliability coefficients: κ = .82 for connect, κ = .90 for interpret, κ = .96 for calculate, κ = 1.00 for predict, and κ = .90 for create. An example of our coding is available in supplemental file S3.

Difficulty and discrimination

We examined how challenging a question is by looking at two dimensions of each answer choice: difficulty and discrimination. The difficulty score is the proportion of the students who answered correctly (Kelley 1939). The discrimination score is a measure of how discriminating the answer choice was between high-performing and low-performing students (Kelley 1939); discrimination scores for each answer choice were calculated using the difference between the number of correct responses by the high-scoring students (i.e., the students with scores in the top 27% on the assessment) and the number of correct responses by the low-scoring students (i.e., the students with scores in the bottom 27% on the assessment) divided by the number of students that compose 27% of the respondents (Kelley 1939, Kalas et al. 2013). We removed answer choices from our analysis if they had a difficulty score between 0.3 and 0.7 but a discrimination score below .3 (n = 12); we did this because a high difficulty question with low discrimination may indicate a problem with the question, such as confusing wording (Davis 2009).

Data analysis plan

We ran two series of linear regression models, one with the dependent variable difficulty and the other with the dependent variable discrimination. When examining which question components are challenging in our course, we built our models using three steps. The independent variables of model 1 included all cognitive skills, procedural knowledge, and question forms, plus time and stakes (each as a dummy variable). Model 2 includes all independent variables from model 1, as well as all possible interactions between cognitive skill, procedural knowledge, and question form. Model 3 refines model 2 through Akaike information criterion (AIC) selection, allowing an investigation into the possibility that interactions explained more of the variation in the dependent variables than each independent variable alone.

Significant interactions were further probed using least square means (lsmeans). The lsmean difficulty score is the estimated mean difficulty (i.e., percentage correct); lower lsmean scores for difficulty indicate that a lower percentage of the students answered the question correctly (i.e., the question was more difficult). The lsmean discrimination score is the estimated mean difference between the scores of the high- versus those of the low-performing students; higher lsmean scores indicate increased discrimination, i.e., greater differences between the scores of the high- versus the low-performing students. For all analyses, we used SAS version 9.4 software. Raw data and model SAS code are available from the corresponding author on emailed request.

Challenging Question Components

We found that some question components are inherently difficult (table 3). Although no cognitive skill, procedural knowledge, or question forms alone was inherently difficult for our students, several interactions between question components pose significant challenges. For example, the interaction between the cognitive skill analysis and the question form unstructured word problem is associated with increased difficulty (lsmean difficulty score = .80).

Table 3.

Linear regression models describing the effect of question components on answer choice difficulty.

Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester−.015−1.30.1939−.015−1.30.1952−.021−1.90.0585
Low stakes−.016−1.39.1640−.016−1.34.1807
Knowledge−.093−2.57.0105−.026−.61.5392
Comprehension−.057−2.17.0304−.089−2.13.0334
Application−.051−1.69.0908−.114−2.62.0090
Analysis−.125−3.89.0001−.173−3.84.0001
Connect−.037−1.55.1210.023.73.4662
Calculate−.033−1.47.1434.1241.66.0983
Interpret−.047−2.25.0247.0501.45.1489
Predict−.042−1.45.1487−.097−1.55.1211
Create.006.21.8322.0331.01.3116
Data−.045−1.68.0934.0851.34.1798
Visual−.019−.94.3478−.011−.54.5906
Structured word problem.002.09.9299.0291.13.2601.0311.90.0577
Unstructured word problem−.038−1.88.0612−.052−1.21.2279
Analysis × not an unstructured word problem.0701.12.2623−.048−1.41.1605
Analysis × unstructured word problem−.063−3.29.0011
Interpret × calculate−.106−2.34.0198−.071−2.64.0085
Interpret × unstructured word problem−.089−2.51.0123−.066−2.26.0244
Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester−.015−1.30.1939−.015−1.30.1952−.021−1.90.0585
Low stakes−.016−1.39.1640−.016−1.34.1807
Knowledge−.093−2.57.0105−.026−.61.5392
Comprehension−.057−2.17.0304−.089−2.13.0334
Application−.051−1.69.0908−.114−2.62.0090
Analysis−.125−3.89.0001−.173−3.84.0001
Connect−.037−1.55.1210.023.73.4662
Calculate−.033−1.47.1434.1241.66.0983
Interpret−.047−2.25.0247.0501.45.1489
Predict−.042−1.45.1487−.097−1.55.1211
Create.006.21.8322.0331.01.3116
Data−.045−1.68.0934.0851.34.1798
Visual−.019−.94.3478−.011−.54.5906
Structured word problem.002.09.9299.0291.13.2601.0311.90.0577
Unstructured word problem−.038−1.88.0612−.052−1.21.2279
Analysis × not an unstructured word problem.0701.12.2623−.048−1.41.1605
Analysis × unstructured word problem−.063−3.29.0011
Interpret × calculate−.106−2.34.0198−.071−2.64.0085
Interpret × unstructured word problem−.089−2.51.0123−.066−2.26.0244

Note: No interactions were included in model 1, therefore those cells are blank. Cells containing dashes are variables that were included but not selected by Akaike information criterion (AIC). The interactions are only presented here if they were selected in the AIC model, although all possible interactions were tested for model 2 and, therefore, influence the R2 value for model 2. Higher difficulty values (β) are interpreted to mean a higher percentage of students answered those questions correctly, so negative β values are more difficult. The missing variable, “graph,” in models 1 and 2 is the reference category for the dummy variable (i.e., the variable that all other question forms are in comparison to). Model 1, R2 = .096, F(15) = 3.51, p ≤ .001. odel 2, R2 = .146, F(26) = 3.19, p ≤ .001. odel 3, R2 = .112, F(9) = 7.06, p ≤ .001.

Table 3.

Linear regression models describing the effect of question components on answer choice difficulty.

Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester−.015−1.30.1939−.015−1.30.1952−.021−1.90.0585
Low stakes−.016−1.39.1640−.016−1.34.1807
Knowledge−.093−2.57.0105−.026−.61.5392
Comprehension−.057−2.17.0304−.089−2.13.0334
Application−.051−1.69.0908−.114−2.62.0090
Analysis−.125−3.89.0001−.173−3.84.0001
Connect−.037−1.55.1210.023.73.4662
Calculate−.033−1.47.1434.1241.66.0983
Interpret−.047−2.25.0247.0501.45.1489
Predict−.042−1.45.1487−.097−1.55.1211
Create.006.21.8322.0331.01.3116
Data−.045−1.68.0934.0851.34.1798
Visual−.019−.94.3478−.011−.54.5906
Structured word problem.002.09.9299.0291.13.2601.0311.90.0577
Unstructured word problem−.038−1.88.0612−.052−1.21.2279
Analysis × not an unstructured word problem.0701.12.2623−.048−1.41.1605
Analysis × unstructured word problem−.063−3.29.0011
Interpret × calculate−.106−2.34.0198−.071−2.64.0085
Interpret × unstructured word problem−.089−2.51.0123−.066−2.26.0244
Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester−.015−1.30.1939−.015−1.30.1952−.021−1.90.0585
Low stakes−.016−1.39.1640−.016−1.34.1807
Knowledge−.093−2.57.0105−.026−.61.5392
Comprehension−.057−2.17.0304−.089−2.13.0334
Application−.051−1.69.0908−.114−2.62.0090
Analysis−.125−3.89.0001−.173−3.84.0001
Connect−.037−1.55.1210.023.73.4662
Calculate−.033−1.47.1434.1241.66.0983
Interpret−.047−2.25.0247.0501.45.1489
Predict−.042−1.45.1487−.097−1.55.1211
Create.006.21.8322.0331.01.3116
Data−.045−1.68.0934.0851.34.1798
Visual−.019−.94.3478−.011−.54.5906
Structured word problem.002.09.9299.0291.13.2601.0311.90.0577
Unstructured word problem−.038−1.88.0612−.052−1.21.2279
Analysis × not an unstructured word problem.0701.12.2623−.048−1.41.1605
Analysis × unstructured word problem−.063−3.29.0011
Interpret × calculate−.106−2.34.0198−.071−2.64.0085
Interpret × unstructured word problem−.089−2.51.0123−.066−2.26.0244

Note: No interactions were included in model 1, therefore those cells are blank. Cells containing dashes are variables that were included but not selected by Akaike information criterion (AIC). The interactions are only presented here if they were selected in the AIC model, although all possible interactions were tested for model 2 and, therefore, influence the R2 value for model 2. Higher difficulty values (β) are interpreted to mean a higher percentage of students answered those questions correctly, so negative β values are more difficult. The missing variable, “graph,” in models 1 and 2 is the reference category for the dummy variable (i.e., the variable that all other question forms are in comparison to). Model 1, R2 = .096, F(15) = 3.51, p ≤ .001. odel 2, R2 = .146, F(26) = 3.19, p ≤ .001. odel 3, R2 = .112, F(9) = 7.06, p ≤ .001.

Similarly, some question components are inherently discriminating (table 4). Questions requiring higher-order cognitive skill (i.e., application and analysis) were significantly more discriminating than questions not requiring these skills. The procedural knowledge predict was significantly discriminating as was the question form visual. Finally, low-stakes assessments (quizzes) are significantly more discriminating than high-stakes assessments (exams).

Table 4.

Linear regression models describing the effects of question components on answer choice discrimination.

Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester.0181.24.2147.0191.36.1731
Low stakes.0433.04.0025.0412.87.0042.0433.36.0008
Knowledge.1162.66.0080.0691.36.1743
Comprehension.0471.50.1355.0861.69.0910.0311.42.1558
Application.0842.33.0203.1312.48.0134.0802.86.0045
Analysis.1423.64.0003.1733.16.0017.1334.31<.0001
Connect.0903.11.0020.0501.26.2076
Calculate.0923.38.008−.010−1.07.2830
Interpret.1084.23<.0001.011.25.7993
Predict.0972.73.0065.1381.81.0710.0732.16.0314
Create.0080.22.8258−.0154−.40.6924
Data.0521.60.1097−.086−1.13.2604
Visual.0602.41.0164.0572.22.0272.0512.71.0070
Interpret × unstructured word problem.1162.69.0073.1264.23<.0001
Interpret × calculate.0751.36.1733.1143.23.0013
Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester.0181.24.2147.0191.36.1731
Low stakes.0433.04.0025.0412.87.0042.0433.36.0008
Knowledge.1162.66.0080.0691.36.1743
Comprehension.0471.50.1355.0861.69.0910.0311.42.1558
Application.0842.33.0203.1312.48.0134.0802.86.0045
Analysis.1423.64.0003.1733.16.0017.1334.31<.0001
Connect.0903.11.0020.0501.26.2076
Calculate.0923.38.008−.010−1.07.2830
Interpret.1084.23<.0001.011.25.7993
Predict.0972.73.0065.1381.81.0710.0732.16.0314
Create.0080.22.8258−.0154−.40.6924
Data.0521.60.1097−.086−1.13.2604
Visual.0602.41.0164.0572.22.0272.0512.71.0070
Interpret × unstructured word problem.1162.69.0073.1264.23<.0001
Interpret × calculate.0751.36.1733.1143.23.0013

Note: No interactions were included in model 1, therefore those cells are blank. Cells containing dashes are variables that were included but not selected by Akaike information criterion (AIC). Interactions are only presented here if they were selected in the AIC model although all possible interactions were tested for model 2 and, therefore, influence the R2 value for model 2. Higher discrimination values (β) are more discriminating, so a positive β value indicates that the question is harder for low–performing students. The missing variable, “graph,” in models 1 and 2 
is the reference category for the dummy variable (i.e., the variable that all other question forms are in comparison to). Model 1, R2 = .158, F(15) = 6.23, p ≤ .001. odel 2, R2 = .195, F(26) = 4.54, p ≤ .001. odel 3, R2 = .175, F(11) = 9.68, p ≤ .001.

Table 4.

Linear regression models describing the effects of question components on answer choice discrimination.

Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester.0181.24.2147.0191.36.1731
Low stakes.0433.04.0025.0412.87.0042.0433.36.0008
Knowledge.1162.66.0080.0691.36.1743
Comprehension.0471.50.1355.0861.69.0910.0311.42.1558
Application.0842.33.0203.1312.48.0134.0802.86.0045
Analysis.1423.64.0003.1733.16.0017.1334.31<.0001
Connect.0903.11.0020.0501.26.2076
Calculate.0923.38.008−.010−1.07.2830
Interpret.1084.23<.0001.011.25.7993
Predict.0972.73.0065.1381.81.0710.0732.16.0314
Create.0080.22.8258−.0154−.40.6924
Data.0521.60.1097−.086−1.13.2604
Visual.0602.41.0164.0572.22.0272.0512.71.0070
Interpret × unstructured word problem.1162.69.0073.1264.23<.0001
Interpret × calculate.0751.36.1733.1143.23.0013
Independent variables (question components)Model 1Model 2Model 3
βtpβtpβtp
Early semester.0181.24.2147.0191.36.1731
Low stakes.0433.04.0025.0412.87.0042.0433.36.0008
Knowledge.1162.66.0080.0691.36.1743
Comprehension.0471.50.1355.0861.69.0910.0311.42.1558
Application.0842.33.0203.1312.48.0134.0802.86.0045
Analysis.1423.64.0003.1733.16.0017.1334.31<.0001
Connect.0903.11.0020.0501.26.2076
Calculate.0923.38.008−.010−1.07.2830
Interpret.1084.23<.0001.011.25.7993
Predict.0972.73.0065.1381.81.0710.0732.16.0314
Create.0080.22.8258−.0154−.40.6924
Data.0521.60.1097−.086−1.13.2604
Visual.0602.41.0164.0572.22.0272.0512.71.0070
Interpret × unstructured word problem.1162.69.0073.1264.23<.0001
Interpret × calculate.0751.36.1733.1143.23.0013

Note: No interactions were included in model 1, therefore those cells are blank. Cells containing dashes are variables that were included but not selected by Akaike information criterion (AIC). Interactions are only presented here if they were selected in the AIC model although all possible interactions were tested for model 2 and, therefore, influence the R2 value for model 2. Higher discrimination values (β) are more discriminating, so a positive β value indicates that the question is harder for low–performing students. The missing variable, “graph,” in models 1 and 2 
is the reference category for the dummy variable (i.e., the variable that all other question forms are in comparison to). Model 1, R2 = .158, F(15) = 6.23, p ≤ .001. odel 2, R2 = .195, F(26) = 4.54, p ≤ .001. odel 3, R2 = .175, F(11) = 9.68, p ≤ .001.

Finally, some interactions between question components are both more difficult and more discriminating (tables 3 and 4). Although the procedural knowledge interpret was neither inherently difficult nor discriminating for our students, answer choices became significantly more difficult and more discriminating when questions required the students to interpret an unstructured word problem (lsmean difficulty score = .81, lsmean discrimination score = .32). In contrast, questions were less difficulty and discriminating if they required interpretation but were not unstructured word problems (lsmean difficulty score = .90, lsmean discrimination score = .18), if they were unstructured word problems that didn't require interpretation (lsmean difficulty score = .90, lsmean discrimination score = .15), and if the questions required neither (lsmean difficulty score = .91, lsmean discrimination score = .15).

Another difficult and discriminating combination is questions that require students to interpret a calculation (lsmean difficulty score = .79, lsmean discrimination score = .33). Questions are easier and less discriminating if they only require interpretation (lsmean difficulty score = .90, lsmean discrimination score = .17), only required calculation (lsmean difficulty score = .92, lsmean discrimination score = .17), or required neither (lsmean difficulty score = .89, lsmean ­discrimination score = .15).

Teaching Students To Think like a Scientist: Make Meaning from Unstructured Information

Our results reinforce the idea that the most difficult types of assessment questions are those that require students to organize information and make meaning from it. This is important because organizing and making meaning of information is at the heart of what scientists do and, therefore, should be one of the main learning goals in science courses. Our results also suggest that instructors should scaffold this learning goal, especially in introductory science courses where not all students will have had the same opportunities to develop these necessary skills.

We identified two challenging cognitive skills (application and analysis), one procedural knowledge skill (predict), and one question form (visual) that were significantly more discriminating but not more difficult. Although it is not news that higher-order cognitive skills are challenging for students (see, e.g., Freeman et al. 2011, Momsen et al. 2013, Barral et al. 2018), the fact that these question components discriminate between high- and low-performing students reminds us that instructors who want to equalize opportunities for success among their students should offer students opportunities to practice these skills.

Low-stakes assessments such as quizzes are excellent opportunities to build skills. Although we found that quiz questions were more challenging for low-performing students than for high-performing students, this difference disappears for exams. One interpretation is that the quizzes gave our struggling students the practice that they needed to perform well on exams. Additional evidence that scaffolding supports skill development is the fact that students seemed to struggle more on assessment that occurred early in the semester versus those that occurred in late in the semester (p = .0585); in other words, assessment questions seem to become easier for all students with practice. Finally, the fact that structured word problems were trending toward being significantly less difficult than other question forms (p = .0577) bolsters claims that additional scaffolding supports students.

For those of us who write test and quiz questions, it is important to understand which interactions between question components are particularly challenging for our students. The most striking interaction we found was unstructured word problems that required interpreting values or underlying meanings from the information presented. The fact that this interaction is not only difficult but also discriminating suggests that the challenge is not a product of unclear wording. We hypothesize that the reason this combination is so challenging is that it requires students to create meaning from unstructured information. This explanation aligns with previous research in mathematics that shows that unstructured word problems are challenging for all ages, especially when the problem describes a “numerical relationship between two variables” (Hegarty et al. 1995). Researchers believe much of the challenge of these types of problems is related to how students determine what information is useful and how they organize it to find a solution (Edens and Potter 2010). Students who are more successful problem solvers tend to create a mental model of the situation before adding terms and values; in contrast, less experienced students tend to refer to key terms and values first (Hegarty et al. 1995). Given this, it also makes intuitive sense that unstructured word problem that require analysis are significantly more difficult.

In addition, even though our content-specific results are not likely to be generalizable (and are therefore not presented in the present article), it is valuable to note that we found that some content is only challenging when combined with specific question components, particularly with unstructured word problems and those requiring interpretation, similarly implying that students’ struggles are likely to be related to organizing information to make meaning of content.

Another significantly discriminating interaction is questions that combine two procedural knowledge skills: calculating and interpretating. This suggests that although students may easily perform a calculation, students are often challenged by understanding why the calculation works and what it means. Students struggling with interpretating a calculation may lack a conceptual framework underlying the equations they are implementing. Similar results exist in physics where problem solving often involves calculations (Byun and Lee 2014) and in mathematics where novice learners tend to begin solving problems by choosing equations with relevant variables whereas experts begin by coming up with a conceptual framework for the problem (Kuo et al. 2013).

Inherent in studies of existing courses are limitation with the data. We acknowledge that our sample had more questions with some components than others. Although this is not ideal, we have confidence in our results because they are based on such a large sample size (at least 76,000 student answers). Another limitation with our data is that our assessment questions only represent five types of procedural knowledge necessary to solve problems. This list is certainly not exhaustive; rather, this work serves as a proof of concept that procedural knowledge of science cognitive processes is distinct from the cognitive skills identified in Bloom's taxonomy.

We also recognize that our results may be limited because of sampling from a diverse student population. Our regression models had low R2 values, indicating that there is considerable variance in the data that we have not explained. This makes sense because this course has a tremendously heterogeneous student population, which our models don't account for. Similarly, our student population may not be representative of students at other colleges and universities. Future studies could create more comprehensive models by adding variables such as demographics and the number and type of previous biology courses taken. Nonetheless, our broadest result—that it is challenging for introductory students to make meaning of complex scientific information—is very likely generalizable.

Our approach could readily be applied by both instructors who want to understand their students better and by researchers who want to build on our models. For instructors, our results suggest specific ways to support low-performing students by intentionally helping them to develop mental models of situations before teaching them the tools to solve the problems. Some course management software (such as Sakai) calculates the discrimination index, making it even easier for instructors to identify which questions are most challenging for their lowest performing students. For researchers, our work delineating the main and interaction effects of question components sets the stage for subsequent studies that could investigate them as a function of content. Using this model to investigate additional aspects of procedural knowledge of science cognitive processes would also be a fruitful avenue for future research.

Acknowledgments

This study was supported by National Science Foundation grant no. DUE-1525602. We would like to thank Zoë Isabella for invaluable participation with interrater reliability testing and feedback on the project.

Disclosure statement

There is no conflict of interest to disclose.

Author Biographical

Sarah B. Marion ([email protected]) is a graduate student, co-first author Julie A. Reynolds is an associate professor of the practice, and John H. Willis is a professor in the Biology Department at Duke University in Durham, North Carolina, USA. Lorrie Schmid is the lead of data management and analysis at the Social Science Research Institute at Duke University, and Robert J. Thompson Jr. is a professor emeritus in the Department of Psychology and Neuroscience at Duke University. B. Elijah Carter an assistant professor in the Department of Biology at Ithaca College in Ithaca, New York, USA. Laurie Mauger is a university-level curriculum and assessment specialist for the 21st Century Partnership for STEM Education in Conshohocken, Pennsylvania, USA.

References

Abell
ML
,
Linda
B
,
Doug
E
,
Lewis
L
,
Hortensia
S
.
2018
.
MAA instructional practices guide
.
Mathematical Association of American Press
.

Anderson
LW
,
Krathwohl
D
,
Airasian
P
,
Cruikshank
KA
,
Mayer
R
,
Pintrich
P
,
Raths
J
,
Wittrock
M
.
2001
.
A taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives
.
Longman
.

Barral
AM
,
Ardi-Pastores
VC
,
Simmons
RE
.
2018
.
Student learning in an accelerated introductory biology course is significantly enhanced by a flipped-learning environment
.
CBE—Life Sciences Education
17
:
38
.

Bloom
BS
.
1956
.
Taxonomy of Education Objectives, Handbook 1: Cognitive Domain
.
Longman
.

Byun
T
,
Lee
G
.
2014
.
Why students still can't solve physics problems after solving over 2000 problems
.
American Journal of Physics
82
:
906
913
.

Crowe
AJ
,
Dirks
C
,
Wenderworth
MP
.
2008
.
Biology in Bloom: Implementing Bloom's taxonomy to enhance student learning in biology
.
CBE—Life Sciences Education
7
:
368
381
.

Davis
BG
.
2009
.
Tools for Teaching
, 2nd ed.
Wiley
.

Edens
K
,
Potter
E
.
2010
.
How students “unpack” the structure of a word problem: Graphic representations and problem solving
.
School Science and Mathematics
108
:
184
196
.

Freeman
S
,
Haak
D
,
Wenderoth
MP
.
2011
.
Increased course structure improves performance in introductory biology
.
CBE—Life Sciences Education
10
:
175
186
.

Green
KH
.
2010
.
Matching functions and graphs at multiple levels of Bloom's revised taxonomy
.
Primus
20
:
204
216
.

Hegarty
M
,
Mayer
RE
,
Monk
CA
.
1995
.
Comprehension of arithmetic word problems: a comparison of successful and unsuccessful problem solvers
.
Journal of Educational Psychology
87
:
18
32
.

Kalas
P
,
O'Neill
A
,
Pollock
C
,
Birol
G
.
2013
.
Development of a meiosis concept inventory
.
CBE—Life Sciences Education
12
:
655
664
.

Kelley
TL
.
1939
.
The selection of upper and lower groups for the validation of test items
.
Journal of Educational Psychology
30
:
17
24
.

Krathwohl
DR
.
2002
.
A revision of Bloom's taxonomy: an overview
.
Theory Into Practice
41
:
212
218
.

Kuo
E
,
Hull
MM
,
Gupta
A
,
Elby
A
.
2013
.
How students blend conceptual and formal mathematical reasoning in solving physics problems
.
Science Education
97
:
32
57
.

Lemons
PP
,
Lemons
JD
.
2013
.
Questions for assessing higher-order cognitive skills: It's not just Bloom's
.
CBE—Life Sciences Education
12
:
47
58
.

Momsen
J
,
Offerdahl
E
,
Kryjevskaia
M
,
Montplaisir
L
,
Anderson
E
,
Grosz
N
.
2013
.
Using assessments to investigate and compare the nature of learning in undergraduate science courses
.
CBE—Life Sciences Education
12
:
239
249
.

Turiman
P
,
Omar
J
,
Daud
AM
,
Osman
K
.
2012
.
Fostering the 21st century skills through scientific literacy and science process skills
.
Procedia—Social and Behavioral Sciences
59
:
110
116
.

Willingham
DT
.
2008
.
Critical thinking: Why is it so hard to teach?
Arts Education Policy Review
109
:
21
32
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data