-
PDF
- Split View
-
Views
-
Cite
Cite
Bartosz Ptasznik, Sascha Wolfer, Robert Lew, A Learners’ Dictionary Versus ChatGPT in Receptive and Productive Lexical Tasks, International Journal of Lexicography, Volume 37, Issue 3, September 2024, Pages 322–336, https://doi.org/10.1093/ijl/ecae011
- Share Icon Share
Abstract
This study assesses the effectiveness of ChatGPT versus the Longman Dictionary of Contemporary English (LDOCE) in supporting English language learners in lexically challenging receptive and productive lexical tasks. With a sample of 223 university students at B2 to C1 proficiency levels, this research investigates whether a leading AI-driven chatbot or a high-quality learners’ dictionary better assists learners in accurately understanding and producing English. The results reveal ChatGPT’s superior performance in both task types. Efficiency, in terms of consultation speed, also favoured ChatGPT, though only in the production task. This study advocates for an integrated approach that leverages both AI, with its interactive and immediate feedback, and more traditional lexicographic tools that may foster learner autonomy and linguistic proficiency.
1. Introduction
1.1. Traditional dictionaries
The trusted traditional lexical tool in the English classroom is the dictionary, especially one made for learners (Lew & Adamska-Sałaciak, 2015). Dictionaries have long remained the reference of choice of both teachers and learners (Levy & Steel, 2015), even as they have evolved from print books to digital tools (see below). For centuries, dictionaries have been respected sources of lexical information, and in recent decades they have increasingly drawn on extensive textual evidence found in digital corpora. Their usefulness in the context of language pedagogy, and particularly in English, is fairly well established (Dziemianko, 2018).
What has definitely evolved is the medium of the dictionary. Printed paper had ruled for centuries before digital formats gradually took over. Among the digital formats, bigger computer screens seem to have been favoured around Europe as late as July 2017 (Kosem et al., 2019), with only about 11 percent (1059) out of 9379 participants from all around Europe naming smartphones as the device of choice for dictionary consultation. Inevitably, the preference is changing in favour of the smartphone, with more than a quarter (43 out of 168) claiming to prefer smartphone-based dictionaries in late 2020, that is three years later (Ptasznik, 2022). A quick computation reveals this preference trend to be significant (OR = 2.27, CI95 = [1.59 – 3.19]). Extrapolating from this tendency, the smartphone is likely to dominate as of this writing (2024). In addition, the preference for small screens has been stronger in Asia, with its tradition of handheld digital dictionaries (Chen, 2010), and in many African nations, where the phone arrived as the first digital device in general use.
The dictionary has largely survived the technological changes as a trusted (Kosem et al., 2019) source of language reference. And yet, dictionaries are not ideal, because—as a rule—they organize lexical information around individual orthographic words, which is not a particularly faithful reflection of how language works: language is not a bag of words (Bolinger, 1985). To find information on word combinations, learners as dictionary users need to scan the complete entry in search of the relevant section (assuming it is there in the dictionary: it may not be); at least that had been the case until very recently, until some leading digital dictionaries made it possible to search for multi-words by typing in at least two component words, in any order. Despite their quirks and deficiencies, dictionaries have retained a central role in literate societies. In late 2022, however, a new player emerged on the scene, making a dramatic entrance that could well upset the long-standing balance: this was ChatGPT, along with other Generative Transformers. In the following sections, we introduce ChatGPT, before moving on to discuss the evidence of effectiveness of dictionaries and Generative AI systems in lexically oriented tasks. This discussion sets the stage for our original research, where we compare the performance of a leading learner’s dictionary with ChatGPT in reception and production tasks.
1.2. The birth and rise of ChatGPT
In December 2022, the launch of ChatGPT marked a significant milestone in the field of artificial intelligence and language technology. Developed by OpenAI, this large language model was made publicly available, provoking a spectrum of reactions from various communities, including those involved in English Language Teaching (ELT). The release of ChatGPT was not just a technological event; it represented a paradigm shift in the way digital tools interact with human language, especially English, which made up the bulk of GPT’s training data.
The core of the mixed reactions stemmed from ChatGPT’s advanced capabilities in processing and generating natural language. Unlike any of its predecessors, ChatGPT demonstrated an unprecedented fluency in English. It exhibited an unexpected ability to produce responses that were not just grammatical, but natural-sounding in the choice of words, collocations, and structures, as well as (for the most part) logically coherent and contextually appropriate. For some teachers of English, this meant a tool that could potentially revolutionize language learning. On the other hand, the novelty and complexity of the technology raise questions about its pedagogical implications and the appropriateness of its integration into traditional language learning frameworks.
1.3. Dictionaries in reception and production: literature review
It is generally assumed that dictionary use can support text reception and production in an additional language. However, not all early studies confirmed this assumption. A case in point was Bensoussan et al. (1984), which showed no positive effect: a finding the authors themselves characterized as surprising. The reasons behind this somewhat counterintuitive finding were later convincingly explained by Nesi (2000, Chapter 2). More recent studies found clear positive effects of dictionary use in reception and production alike. Below we offer an overview of some studies that were characterized by robust design and assessed the effect of dictionary use for English as an additional language.
Laufer (2011) explored the impact of dictionary use on the production and retention of collocations in English as a second language using two student cohorts: (1) intermediate college students with Hebrew or Arabic as their first language and English as their second; and (2) pre-intermediate middle school students with Hebrew as their first language and English as their second. The research was structured into three stages: an initial pre-test to assess the students’ knowledge of English collocations without the aid of dictionaries, a subsequent exercise that examined the contribution of dictionary use to the production of collocations, and a final retention test to evaluate the learning effect. The results revealed a significant improvement in the ability to produce correct collocations from the pre-test (no dictionary) to the exercise (with dictionary) phase, with mean scores (which we convert here to relative percent scores for the readers’ convenience) increasing approximately twofold: from 18% to 40% in the college student group and from 18% up to 35% for the middle school students.
Chan (2013) tested advanced learners of English with Chinese as their first language in the context of using dictionaries for meaning determination and sentence construction. The research focused on the use of two dictionaries, LDOCE5 and COBUILD6, in meaning recognition and productive sentence construction. The study employed a stimulated recall task immediately following each item within two lexical-search tasks: a Meaning Determination Task and a Sentence Construction Task. The findings indicated a significant advantage of dictionaries: for reception tasks, scores without dictionaries averaged at 77.8%, as against 86.7% with dictionaries. For production tasks, the mean score was 24.1% without dictionaries compared to 70.4% with dictionaries.
Li and Xu (2015) examined the effectiveness of the Macmillan English Dictionary Online (MEDO) in aiding Chinese EFL learners at the college intermediate level in identifying the meanings of English verb phrases. The research focused on a receptive task designed for meaning recognition, specifically targeting verb phrase chunks. Utilizing a meaning determination task for verb phrases, the study aimed to assess the accuracy rate of learners in understanding these phrases with and without the aid of MEDO. The findings revealed a significant difference in comprehension accuracy, with the rate nearly doubling from 26% without the use of the dictionary to 50% when MEDO was employed.
Zou (2016) looked into the efficacy of dictionary-induced vocabulary learning compared to inferencing within a reading context among intermediate-level college students who are Chinese speakers learning English. The focus was on the acquisition of single words as target linguistic structures through tasks aimed at meaning recall and productive assessment. Two specific tasks were administered: the first involved reading comprehension coupled with dictionary consultation using LDOCE5, and the second entailed reading comprehension with inferencing. The outcomes demonstrated a marked improvement in vocabulary acquisition through dictionary use, with mean scores for Task 1 showing a significant leap from a pretest score of 0.16 to an immediate posttest score of 12.11, and a delayed posttest score of 8.98. In contrast, Task 2 results were lower, with mean scores rising from a pretest score of 0.18 to an immediate posttest score of 7.91, and a delayed posttest score of 5.45. The conclusion drawn from these findings was that dictionary-induced vocabulary learning substantially outperforms inferencing in the context of reading, highlighting the important role dictionaries play in facilitating effective comprehension and vocabulary acquisition for second language learners.
1.4. Generative AI systems
Generative AI systems are a recent innovation, so it is not surprising that the body of empirical studies available is much smaller than for dictionaries. In fact, the first wave of studies looked more into the option of harnessing Generative AI for the creation of dictionary content. De Schryver (2023) offers a comprehensive overview as well as including an original study; others are Lew (2023) and Rees & Lew (2024). At this time, it appears that the technology is quite effective when it comes to creating definitions and examples, but is more challenged by some less verbal task such as sense division and clustering or identifying syntactic patterns.
Yet chatbots could—rather than assisting in the creation of traditional learning materials—be used by language learners directly. A very recent and competent review of the use of Generative AI technology in language learning is Łodzikowski et al. (2024). There appear to be quite a few position papers on the role of chatbots in (English) language teaching (e.g. Kostka & Toncelli, 2023), but we could not yet find published reliable empirical evidence of their effectiveness in this context. One recent small-scale study (Bašić et al., 2023) found no positive effect of the use of ChatGPT in academic essay composition, but in this study participants were writing in their strongest (native) language, Croatian, a language in which ChatGPT cannot be expected to perform nearly as well as in English. We are more interested in how the technology could assist learners of English, and our focus is on local lexical problems rather than essay composition, where structure and argument are at least as important as lexical and collocational choices.
We have not been able to trace any systematic studies of the effectiveness of chatbots for lexically oriented problems. Presumably, the option is too recent for experimental studies to run their course: a process where cutting corners is not a good idea. To help answer this question, we decided to compare ChatGPT, the most popular such chatbot, against the gold standard of LDOCE, a leading dictionary for learners of English.
2. The study
2.1. Aim
The main aim of the study was to compare the effectiveness of ChatGPT-3.5 (free version) and a leading English monolingual learner’s dictionary (the Longman Dictionary of Contemporary English (LDOCE 2023) consulted on each participant’s own smartphone via a mobile browser at https://www.ldoceonline.com), in lexically oriented receptive and productive tasks. A secondary aim was to compare the efficiency (speed) of consultation. These aims may be enumerated as the following research questions:
RQ 1. Does ChatGPT-3.5 contribute to higher success than LDOCE in a lexically oriented reception task?
RQ 2. Does ChatGPT contribute to higher success than LDOCE in a lexically oriented production task?
RQ 3. Does ChatGPT contribute to faster consultation time than LDOCE in a lexically oriented reception task?
RQ 4. Does ChatGPT contribute to faster consultation time than LDOCE in a lexically oriented production task?
2.2. Participants
In order to compare dictionary use with ChatGPT, we tested the two tools with 223 students at a large Polish university attending a BA in English programme, aged between 19 and 22, with some overrepresentation of female students (about 65 percent) typical of this study programme in Poland (and many other countries). Their English proficiency was regularly monitored using standardized internal testing, and it ranged between B2 and C1 levels according to the CEFR, depending on the student group. The study used existing student groups at study years from 1 to 3, and reasonable care was taken to distribute groups equally between the dictionary and ChatGPT conditions with respect to proficiency level, although there was no random assignment of individual participants to conditions: it was not logistically possible to break up existing student groups. The testing was done anonymously, no personal information of any kind was collected, and informed consent to participate was obtained. Participants were given the option of discontinuing their participation at any stage of the test. Personal information was protected through data anonymization. The study was conducted in accordance with the provisions of Polish law.
2.3. Materials
Approximately half of our participants (N = 113) used an online (mobile web app) version of the Longman Dictionary of Contemporary English (LDOCE online). The other half (N = 110) used ChatGPT 3.5 in its non-subscription (i.e., free) version, as available in November 2023 on desktop computers provided by the university. In a classroom setting, participants were given two challenging lexically orientated tasks on paper: a reception test and production test (see Supplementary Data). Twenty test items each were used in the production and reception tests. Hence, in total there were forty different test items. Items in the reception task were low-frequency nouns not listed in the Longman Communication 9000. Items selected for the production task were highly polysemous verbs appearing in complex verb complementation patterns, mostly but not exclusively in phrasal verb uses. The items were meant to present a challenge and make participants consult the tools assigned to them (that is, either LDOCE or ChatGPT) for the task at hand.
The target sentences in the production test as well as the context (English sentences) provided for the participants in the reception test were, in their majority, adapted from the Cambridge Advanced Learner’s Dictionary (CALD 2023) (available as part of aggregated content at https://dictionary.cambridge.org/dictionary/english/), Collins COBUILD Advanced Learner’s (COBUILD 2023) (available as part of aggregated content at https://www.collinsdictionary.com/dictionary/english) and Diki.pl dictionary (Diki.pl 2023) (https://www.diki.pl). The remaining sources are available online as part of Supplementary Data to this article. One sentence was lightly modified (original: The profits are not high, but the company turns over more than $3.5 million every year; modified to: The profits may not be high, but the company turns over more than a million dollars every year.) Three example sentences modelled on existing example sentences were written by the first author, as none of the original examples were fully suitable.
2.4. Procedure
The study involved two paper-based tests: a production test and a reception test (see Supplementary Data), each in two versions with different but randomly generated item order. One of these versions would be assigned to each participant at random. In the production test, participants translated twenty Polish sentences into English, with the English verbs presented in boldface. They were permitted to alter the verb forms as necessary. For the reception test, participants were required to understand the context of twenty English sentences, focusing on the underlined words (test items), and then translate these specific items into Polish.
Prior to the tests, participants received 15 minutes of instructions. They were assured of the confidentiality of their data through anonymization techniques. It was specified that assistance would come from either ChatGPT (for the ChatGPT group) or an online version of the LDOCE, a monolingual English pedagogical dictionary (for the dictionary group) consulted via the participants’ own phones, without revealing the study’s experimental design regarding the two distinct groups. A 90-minute time limit was set for completing both the production and reception tests. The participants, already familiar with using monolingual dictionaries and ChatGPT due to prior discussions and practical sessions at the university, were required to record the time spent on each test item. Those using smartphones utilized their device’s stopwatch function, whereas participants in the ChatGPT group used a free online timer and stopwatch available on computers.
Instructions on how to approach the tasks were clearly outlined. The experimenter verbally communicated the guidelines, followed by a demonstration using a sample task for both tests. Participants were instructed to attempt every question, noting the time required for each. The ChatGPT group was encouraged to interact with ChatGPT in any suitable manner to find the correct answers, while the Dictionary group was directed to use only LDOCE and refrain from consulting any other dictionaries or corpus query tools. Every session was monitored by the experimenter (the first author), who took special care to ensure that each participant would only use the tool assigned for the task. Although both experimental groups were in the same classroom, the seating arrangement was such that it would be very difficult for a participant to see content displayed on someone else’s screens.
Success was recorded for a correct answer on each test item, while an incorrect response was recorded as failure (lack of success). All grading was done strictly by the key, which had been finalized before the actual test. For reception, the key listed all accepted translation equivalents, taken from two well-known online dictionaries: Diki.pl and Bab.la, and success was determined based on the match between the response and the key. For production, the key listed the expected target structures for each item, and again success was recorded if there was a match between the response and the target structure, ignoring minor grammatical mistakes such as in article use.
2.5. Data analysis
All data analysis was conducted in the R environment for statistical computing (R Core Team, 2023) (see Supplementary Data). To estimate success on the reception and production tasks, mixed binary logistic regression models were fitted on the success (versus failure) data at the basic item by participant level, using the lme4::glmer function (Bates et al., 2015) on a total of 8903 non-missing observations: 4451 for reception and 4452 for production. Only nine items overall (across all participants) were left blank in the reception task, and eight in the production task.
Model selection was guided by theoretical conceptualization, research design, and information criteria, as far as allowed by model convergence and dispersion (Burnham & Anderson, 2004; Meteyard & Davies, 2020). Model dispersion was computed using the blmeco::dispersion_glmer function (Korner-Nievergelt et al., 2015). For effect estimation, we also drew on effects (Fox & Weisberg, 2019) and ggeffects (Lüdecke, 2018). Since the use of dictionaries for lexical reference is sanctioned by long tradition, the dictionary condition was set as the reference level against which the newer chatbot tool was gauged. Details of the specific models are given in the Results section.
Time-on-task, as is typical for reaction time or completion time, exhibited a left-skewed distribution, therefore all time measurements were logged, which normalized the distribution. Participants failed to record a small number of item times: 12 for reception and 29 for production, which corresponds to 0.3 percent and 0.7 percent missing data. For logged time-on-task, we fitted mixed regression models with lme4::lmer and used model selection criteria as above except dispersion.
2.6. Results
2.6.1 Success: reception task
In the reception task, mean observed success rate was 73 percent in the LDOCE Dictionary group against 87 percent in the ChatGPT group. For the reception task data, a binary logistic regression model was fitted with reception task success (that is, success or failure on each individual combination of item and participant) as outcome and tool (Dictionary or ChatGPT) as a fixed-effect predictor, and item and participant as random effects as per the following model formula: Success ~ Tool + (1 + Tool | Item) + (1 | Participant). This means random intercepts were estimated for both item (accounting for the fact that some items might be more difficult than others overall) and participant (some participants might be more proficient than others overall), and random slope for item, thus allowing that a given item might be easy to do with the dictionary but difficult for ChatGPT, or the other way round. Parameter estimates are given in Table 1. Model dispersion was 0.93, showing no evidence of overdispersion.
Model parameters for success in the reception task. Dictionary is the reference level. Model formula: Success ~ Tool + (1 + Tool | Item) + (1 | Participant).
Predictors (Reception) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.046 [2.263 – 4.099] | < 0.001 |
ChatGPT | 3.462 [2.483 – 4.826] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant | 0.144 | |
τ00 Item | 0.379 | |
τ11 Item.Tool[ChatGPT] | 0.327 | |
ρ01 Item | 0.508 | |
ICC | 0.208 | |
Observations | 4451 | |
Marginal R2 / Conditional R2 | 0.085 / 0.275 |
Predictors (Reception) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.046 [2.263 – 4.099] | < 0.001 |
ChatGPT | 3.462 [2.483 – 4.826] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant | 0.144 | |
τ00 Item | 0.379 | |
τ11 Item.Tool[ChatGPT] | 0.327 | |
ρ01 Item | 0.508 | |
ICC | 0.208 | |
Observations | 4451 | |
Marginal R2 / Conditional R2 | 0.085 / 0.275 |
Model parameters for success in the reception task. Dictionary is the reference level. Model formula: Success ~ Tool + (1 + Tool | Item) + (1 | Participant).
Predictors (Reception) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.046 [2.263 – 4.099] | < 0.001 |
ChatGPT | 3.462 [2.483 – 4.826] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant | 0.144 | |
τ00 Item | 0.379 | |
τ11 Item.Tool[ChatGPT] | 0.327 | |
ρ01 Item | 0.508 | |
ICC | 0.208 | |
Observations | 4451 | |
Marginal R2 / Conditional R2 | 0.085 / 0.275 |
Predictors (Reception) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.046 [2.263 – 4.099] | < 0.001 |
ChatGPT | 3.462 [2.483 – 4.826] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant | 0.144 | |
τ00 Item | 0.379 | |
τ11 Item.Tool[ChatGPT] | 0.327 | |
ρ01 Item | 0.508 | |
ICC | 0.208 | |
Observations | 4451 | |
Marginal R2 / Conditional R2 | 0.085 / 0.275 |
ChatGPT enjoyed significantly greater success rates in the reception task than LDOCE (OR = 3.462 [2.483 – 4.826], p < 0.001). Figure 1 shows predicted success rates for Dictionary and ChatGPT conditions with their confidence intervals against the distribution of observed by-participant scores rendered as violin plots. Of note is that almost all of the participants using ChatGPT were able to get higher scores than a mean score with the help of LDOCE.

Predicted probability of success (black dots) and 95% CI range (black lines) in the reception task using the LDOCE dictionary and ChatGPT, with the distribution of observed by-participant means represented as the violins.
In Figure 2, we plot model-predicted probabilities of success for the individual lexical items in the reception task, with the two tools. It will be seen that the model predicts higher success for every single item except backlash, for which the model predicts the same level of success. This suggests a consistent advantage of ChatGPT over LDOCE as a tool assisting reception: participants from the ChatGPT group are considerably more likely to succeed in the reception task than participants from the dictionary group, even after controlling for inter-participant and inter-item variation.

Predicted probabilities of success with their 95% CI ranges for the individual items in the reception task using the LDOCE dictionary and ChatGPT.
2.6.2 Success: production task
In the production task, mean observed success rate was 53 percent in the LDOCE Dictionary group compared to 81 percent in the ChatGPT group: lower than for reception—as one would expect—though not that much lower with ChatGPT. For the production task data, a binary logistic regression model was fitted with production task success as outcome and tool (Dictionary or ChatGPT) as a fixed-effect predictor, and item and participant as random effects as per the following model formula: Success ~ Tool + (1 + Tool | Item) + (1 | Version/Participant). This means random intercepts were estimated for both item and participant, and random slope for item, thus allowing that a particular item might be easy to do with the dictionary but difficult for ChatGPT, or vice versa (just as in the model for the reception task). In addition, participant was modelled as nested within test version, as per actual design (see Section 2.4). Computed parameter estimates are given in Table 2. Model dispersion was 0.99, thus with no evidence of overdispersion.
Model parameters for success on the production task. Dictionary is the reference level. Model formula: Success ~ Tool + (1 + Tool | Item) + (1 | Version/Participant).
Predictors (Production) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 1.198 [0.737 – 1.949] | 0.466 |
ChatGPT | 5.380 [3.545 – 8.164] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant:Version | 0.217 | |
τ00 Item | 0.560 | |
τ00 Version | 0.059 | |
τ11 Item.ToolChatGPT | 0.680 | |
ρ01 Item | 0.232 | |
ICC | 0.285 | |
Observations | 4452 | |
Marginal R2 / Conditional R2 | 0.133 / 0.381 |
Predictors (Production) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 1.198 [0.737 – 1.949] | 0.466 |
ChatGPT | 5.380 [3.545 – 8.164] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant:Version | 0.217 | |
τ00 Item | 0.560 | |
τ00 Version | 0.059 | |
τ11 Item.ToolChatGPT | 0.680 | |
ρ01 Item | 0.232 | |
ICC | 0.285 | |
Observations | 4452 | |
Marginal R2 / Conditional R2 | 0.133 / 0.381 |
Model parameters for success on the production task. Dictionary is the reference level. Model formula: Success ~ Tool + (1 + Tool | Item) + (1 | Version/Participant).
Predictors (Production) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 1.198 [0.737 – 1.949] | 0.466 |
ChatGPT | 5.380 [3.545 – 8.164] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant:Version | 0.217 | |
τ00 Item | 0.560 | |
τ00 Version | 0.059 | |
τ11 Item.ToolChatGPT | 0.680 | |
ρ01 Item | 0.232 | |
ICC | 0.285 | |
Observations | 4452 | |
Marginal R2 / Conditional R2 | 0.133 / 0.381 |
Predictors (Production) . | Odds Ratios [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 1.198 [0.737 – 1.949] | 0.466 |
ChatGPT | 5.380 [3.545 – 8.164] | < 0.001 |
Random Effects | ||
σ2 | 3.290 | |
τ00 Participant:Version | 0.217 | |
τ00 Item | 0.560 | |
τ00 Version | 0.059 | |
τ11 Item.ToolChatGPT | 0.680 | |
ρ01 Item | 0.232 | |
ICC | 0.285 | |
Observations | 4452 | |
Marginal R2 / Conditional R2 | 0.133 / 0.381 |
Note that the intercept expressed as an Odds Ratio is close to one: this just reflects the finding that the success rate on the production task with the dictionary was close to 50%, thus the odds of getting an item right versus failing were approximately one-to-one. However, the odds of getting an item right using ChatGPT were over five times better than with a dictionary, which is a very strong effect in favour of the chatbot tool (OR = 5.380 [3.545 – 8.164], p < 0.001).
Figure 3 plots predicted production success rates for Dictionary and ChatGPT conditions with their confidence intervals against the distribution of observed by-participant scores rendered as violins. The range of success rates is greater than for reception, but the advantage of ChatGPT is even more pronounced in the production task. In other words, participants in the ChatGPT group were considerably more likely to succeed in the production task than participants from the dictionary group. This still holds if we control for inter-participant and inter-item variation via the statistical models.

Predicted probability of success (black dots) and 95% CI range (black lines) in the production task using the LDOCE dictionary and ChatGPT, with the distribution of observed by-participant means represented as the violins.
Figure 4 displays model-predicted probabilities of success for the individual lexical items in the production task, using either the LDOCE dictionary or ChatGPT. The model predicts higher success (usually much higher) of ChatGPT for nearly all items. Two exceptions are items requiring the production of sentences with the phrasal verbs see out the storm and run up against (an opponent), respectively. These two items turned out to be very hard with either tool. Other than that, the per-item results suggest a consistent and non-trivial advantage of ChatGPT over LDOCE in assisting production.

Predicted probabilities of success with their 95% CI ranges for the individual items in the production task using the dictionary and ChatGPT.
2.7. Time: reception task
Median time-on-task per item in the reception task was 20 seconds for both ChatGPT and dictionary conditions. A mixed regression model was fit on logTime, using the model formula logTime ~ Tool + (1 + Tool | Item) + (1 | Participant). Model parameters are given in Table 3.
Model parameters for logTime on the reception task. Dictionary is the reference level. Model formula: logTime ~ Tool + (1 + Tool | Item) + (1 | Participant).
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.070 [2.943 – 3.197] | < 0.001 |
ChatGPT | -0.066 [-0.224 – 0.092] | 0.416 |
Random Effects | ||
σ2 | 0.188 | |
τ00 Participant | 0.271 | |
τ00 Item | 0.034 | |
τ11 Item.ToolChatGPT | 0.029 | |
ρ01 Item | -0.933 | |
ICC | 0.607 | |
Observations | 4448 | |
Marginal R2 / Conditional R2 | 0.002 / 0.608 |
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.070 [2.943 – 3.197] | < 0.001 |
ChatGPT | -0.066 [-0.224 – 0.092] | 0.416 |
Random Effects | ||
σ2 | 0.188 | |
τ00 Participant | 0.271 | |
τ00 Item | 0.034 | |
τ11 Item.ToolChatGPT | 0.029 | |
ρ01 Item | -0.933 | |
ICC | 0.607 | |
Observations | 4448 | |
Marginal R2 / Conditional R2 | 0.002 / 0.608 |
Model parameters for logTime on the reception task. Dictionary is the reference level. Model formula: logTime ~ Tool + (1 + Tool | Item) + (1 | Participant).
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.070 [2.943 – 3.197] | < 0.001 |
ChatGPT | -0.066 [-0.224 – 0.092] | 0.416 |
Random Effects | ||
σ2 | 0.188 | |
τ00 Participant | 0.271 | |
τ00 Item | 0.034 | |
τ11 Item.ToolChatGPT | 0.029 | |
ρ01 Item | -0.933 | |
ICC | 0.607 | |
Observations | 4448 | |
Marginal R2 / Conditional R2 | 0.002 / 0.608 |
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.070 [2.943 – 3.197] | < 0.001 |
ChatGPT | -0.066 [-0.224 – 0.092] | 0.416 |
Random Effects | ||
σ2 | 0.188 | |
τ00 Participant | 0.271 | |
τ00 Item | 0.034 | |
τ11 Item.ToolChatGPT | 0.029 | |
ρ01 Item | -0.933 | |
ICC | 0.607 | |
Observations | 4448 | |
Marginal R2 / Conditional R2 | 0.002 / 0.608 |
There is no evidence that the reception task should be completed faster using either tool. Model-predicted times by tool are shown in Figure 5, with the time scale back-transformed to seconds. The two violins represent the distributions of observed per-participant times. Note that the distributions are fairly similar, though the dictionary condition appears to exhibit somewhat greater variation. Overall, this means that the higher probability of success in the reception task of the participants in the ChatGPT group is not linked to a clear advantage in the time taken to complete the task.

Predicted time per item (black dots) in the reception task (in seconds), with 95% confidence intervals (black lines) and the distribution of observed by-participant means (violins).
2.8. Time: production task
Median time per item in the production task was 52 seconds for the dictionary condition, compared to 34 seconds for ChatGPT. A mixed regression model was fit on logTime, with model formula logTime ~ Tool + (1 + Tool | Item) + (1 | Participant). Model parameters are given in Table 4.
Model parameters for logTime on the production task. Dictionary is the reference level. Model formula: logTime ~ Tool + (1 + Tool | Item) + (1 | Participant).
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.920 [3.737 – 4.102] | < 0.001 |
ChatGPT | -0.405 [-0.593 – -0.216] | < 0.001 |
Random Effects | ||
σ2 | 0.277 | |
τ00 Participant | 0.124 | |
τ00 Item | 0.149 | |
τ11 Item.ToolChatGPT | 0.136 | |
ρ01 Item | -0.903 | |
ICC | 0.435 | |
Observations | 4431 | |
Marginal R2 / Conditional R2 | 0.077 / 0.479 |
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.920 [3.737 – 4.102] | < 0.001 |
ChatGPT | -0.405 [-0.593 – -0.216] | < 0.001 |
Random Effects | ||
σ2 | 0.277 | |
τ00 Participant | 0.124 | |
τ00 Item | 0.149 | |
τ11 Item.ToolChatGPT | 0.136 | |
ρ01 Item | -0.903 | |
ICC | 0.435 | |
Observations | 4431 | |
Marginal R2 / Conditional R2 | 0.077 / 0.479 |
Model parameters for logTime on the production task. Dictionary is the reference level. Model formula: logTime ~ Tool + (1 + Tool | Item) + (1 | Participant).
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.920 [3.737 – 4.102] | < 0.001 |
ChatGPT | -0.405 [-0.593 – -0.216] | < 0.001 |
Random Effects | ||
σ2 | 0.277 | |
τ00 Participant | 0.124 | |
τ00 Item | 0.149 | |
τ11 Item.ToolChatGPT | 0.136 | |
ρ01 Item | -0.903 | |
ICC | 0.435 | |
Observations | 4431 | |
Marginal R2 / Conditional R2 | 0.077 / 0.479 |
Predictors (Reception) . | logTime Estimates [95% CI] . | p-level . |
---|---|---|
Intercept (= Dictionary) | 3.920 [3.737 – 4.102] | < 0.001 |
ChatGPT | -0.405 [-0.593 – -0.216] | < 0.001 |
Random Effects | ||
σ2 | 0.277 | |
τ00 Participant | 0.124 | |
τ00 Item | 0.149 | |
τ11 Item.ToolChatGPT | 0.136 | |
ρ01 Item | -0.903 | |
ICC | 0.435 | |
Observations | 4431 | |
Marginal R2 / Conditional R2 | 0.077 / 0.479 |
There is evidence that the production task was completed in significantly shorter time using ChatGPT than with the dictionary (p < 0.001). Predicted times with 95% confidence intervals are shown in Figure 6, with the time scale back-transformed to seconds. The violins plot the distributions of mean per-participant times: the thickness of the violin corresponds to how many participants needed a given time range (in seconds) to complete a typical item. The high negative correlation (ρ01 Item) between the intercepts and slopes suggests that ChatGPT might be able to speed up consultations of those items that are otherwise particularly slow. This means that although no time advantage can be observed for ChatGPT in the reception task, it can, indeed, be observed for the production task.

Predicted time per item (black dots) in the production task (in seconds), with 95% confidence intervals (black lines) and the distribution of observed by-participant means (violins).
3. Discussion, limitations, and suggestions for further research
ChatGPT is significantly more effective in reception (Research Question 1), as well as production (Research Question 2) than the LDOCE dictionary (on mobile phones). It is also marginally faster to use in production (Research Question 4), though not in reception (Research Question 3).
The present study set out to explore some practical advantages and potential drawbacks of using ChatGPT in an educational setting of advanced students of English. Our results suggest that ChatGPT-3.5 is able to provide immediate, context-relevant lexical assistance. We expect that this is thanks to its ability to engage learners through interactive and adaptive dialogue. An intelligent chatbot largely relieves the learner seeking lexical information from the error-prone and time-consuming task of locating relevant information in a ‘generic’ dictionary entry. Instead, they can get immediately relevant feedback without the accompanying noise. Thanks to this, students are able to resolve practical lexical problems in both reception and production quite efficiently. Possibly, this can potentially lead to enhanced retention of lexical information, and this aspect calls for further study. In addition, chatbots with more advanced underlying LLM’s than the freely offered ChatGPT-3.5 may be expected to result in even better performance. Also, most of the data that GPT models had been trained on were in English, and this means that the performance of ChatGPT on those parts of the tasks that involved Polish might be improved in the near future, as more balanced multilingual language models become more generally available.
Results for time-on-task suggest that ChatGPT is a more efficient tool in terms of the time needed for participants to answer questions in the production task, but not in reception. This suggests that the instantaneity and interactive dialogue offered by ChatGPT enables learners to process and use lexical items more readily than with conventional dictionary searches, thereby streamlining the consultation process while trying to produce written language. However, no analogous advantage was found for a reception task. Part of the reason may be that reception in general is faster (compare the time values in Figure 5 and Figure 6). The relatively low responsiveness of the free ChatGPT-3.5 version may also have slowed students down.
Like any study, the present study is not free from limitations. The two conditions (Dictionary and ChatGPT) differed in devices used, and—in consequence—input methods and screen size. This was a difficult decision. Avoiding this confound would have been difficult without either jeopardising ecological validity, i.e., statements about the real-life use of these resources, or the participants’ privacy. At the time of data collection, ChatGPT could only be accessed on the phone via an app, and we did not want to coerce participants to install apps on their phones: instead, they were invited to use institution-provided desktop computers. By contrast, students usually consulted dictionaries on their phones, and LDOCE works very well indeed as a well-designed, responsive mobile web application, requiring no special apps, just a mobile browser that all students obviously had. It would have been unnatural (ecological validity) to force students into using LDOCE on the computer, which they normally did not do. Having said that, our design decision meant that those using ChatGPT worked with bigger screens and physical keyboards.
The study measured reception and production success using partial translation from/to the students’ first language. However, the tasks that we used do not exhaust the possible range of tasks that students might need help with in real life. A greater variety of tasks might be tested in the future. Another design limitation was the use of self-reported measurements of time.
The present study is strictly quantitative in the sense that we did not undertake any qualitatively oriented individual observations for specific items or individual research processes. In this context, however, it must be emphasized that the statistical models control for inter-item variation in our data. We also captured this variation in Figures 2 and 4. Nonetheless, it might be interesting in future studies to examine individual test items that stand out and possibly also individual research processes that could explain such differences.
No lexical retention was measured, no long-term learning effect, but real-time performance. This is not necessarily a problem if dictionaries and chatbots are seen as tools for assisting with tasks as they happen. Nevertheless, apart from the efficiency of the tools when used in real time, their potential for promoting learning is also of interest. A follow-up study that also tests retention would surely be of interest, and we are in fact currently working on such a study.
4. Conclusion
The superior performance in both reception and production tasks underscores the importance of considering AI-driven resources as serious competitors to traditional dictionaries in language-learning contexts. However, we would still like to stress the need for a balanced approach, incorporating AI while maintaining essential strategies for independent language learning and critical thinking. We should be aware of limitations such as the risk of overreliance on AI for language inputs, reduced memorization efforts, as well as the challenge of ensuring accurate and pedagogically appropriate responses from the chatbot. Our study advocates for further exploration into the optimized integration of AI-based tools into the language learning context, ensuring they complement rather than replace learner autonomy and traditional pedagogical resources such as dictionaries.
Data availability
Supplementary Data are available at https://osf.io/wqc6t which include raw data, analysis scripts and research instruments.