Beyond the surface: stylometric analysis of GPT-4o’s capacity for literary style imitation

Key topic words associated with common themes in Shelley’s “Frankenstein” and Hemingway’s “The Old Man and the Sea.”

Common themes	Shelley	Hemingway
Nature’s Dual Role as Friend and Foe	sea, ocean, snow, whale, wild	shark, fish, wind, turtle, breeze
Isolation	unvisited, fearful, secret, silent, far	shack, lost, fear, gone, deep
The Price of Ambition	glory, prize, success, enterprise, perseverance	yankees, league, dimaggio, team, big
Man Versus Nature	voyage, discovery, pole, dangerous, kill	harpoon, caught, boat, skiff, fishermen
The Search for Identity	mind, know, brother, sister, understand	boy, man, remember, i’m, thought
Love and Loss	love, loved, dear, friend, heart	wife, sad, remembered, home, share
The Weight of Guilt	evil, brutality, woeful, soul, dreadfully	bad, truly, scars, took, black
Mortality	life, fate, days, winter, men	old, winter, eighty, time, passed
Knowledge as Power and Peril	study, read, explore, modern, projects	think, newspaper, faith, means, league

Common themes	Shelley	Hemingway
Nature’s Dual Role as Friend and Foe	sea, ocean, snow, whale, wild	shark, fish, wind, turtle, breeze
Isolation	unvisited, fearful, secret, silent, far	shack, lost, fear, gone, deep
The Price of Ambition	glory, prize, success, enterprise, perseverance	yankees, league, dimaggio, team, big
Man Versus Nature	voyage, discovery, pole, dangerous, kill	harpoon, caught, boat, skiff, fishermen
The Search for Identity	mind, know, brother, sister, understand	boy, man, remember, i’m, thought
Love and Loss	love, loved, dear, friend, heart	wife, sad, remembered, home, share
The Weight of Guilt	evil, brutality, woeful, soul, dreadfully	bad, truly, scars, took, black
Mortality	life, fate, days, winter, men	old, winter, eighty, time, passed
Knowledge as Power and Peril	study, read, explore, modern, projects	think, newspaper, faith, means, league

Table 1.

Key topic words associated with common themes in Shelley’s “Frankenstein” and Hemingway’s “The Old Man and the Sea.”

Common themes	Shelley	Hemingway
Nature’s Dual Role as Friend and Foe	sea, ocean, snow, whale, wild	shark, fish, wind, turtle, breeze
Isolation	unvisited, fearful, secret, silent, far	shack, lost, fear, gone, deep
The Price of Ambition	glory, prize, success, enterprise, perseverance	yankees, league, dimaggio, team, big
Man Versus Nature	voyage, discovery, pole, dangerous, kill	harpoon, caught, boat, skiff, fishermen
The Search for Identity	mind, know, brother, sister, understand	boy, man, remember, i’m, thought
Love and Loss	love, loved, dear, friend, heart	wife, sad, remembered, home, share
The Weight of Guilt	evil, brutality, woeful, soul, dreadfully	bad, truly, scars, took, black
Mortality	life, fate, days, winter, men	old, winter, eighty, time, passed
Knowledge as Power and Peril	study, read, explore, modern, projects	think, newspaper, faith, means, league

Common themes	Shelley	Hemingway
Nature’s Dual Role as Friend and Foe	sea, ocean, snow, whale, wild	shark, fish, wind, turtle, breeze
Isolation	unvisited, fearful, secret, silent, far	shack, lost, fear, gone, deep
The Price of Ambition	glory, prize, success, enterprise, perseverance	yankees, league, dimaggio, team, big
Man Versus Nature	voyage, discovery, pole, dangerous, kill	harpoon, caught, boat, skiff, fishermen
The Search for Identity	mind, know, brother, sister, understand	boy, man, remember, i’m, thought
Love and Loss	love, loved, dear, friend, heart	wife, sad, remembered, home, share
The Weight of Guilt	evil, brutality, woeful, soul, dreadfully	bad, truly, scars, took, black
Mortality	life, fate, days, winter, men	old, winter, eighty, time, passed
Knowledge as Power and Peril	study, read, explore, modern, projects	think, newspaper, faith, means, league

These nine themes represent the common thematic ground that both Hemingway and Shelley engage with, enabling the creation of prompts and generation tasks that do not favor one author’s narrative concerns over the other’s.

3.4 Prompt design and text generation

Three prompts were developed to elicit stylistically imitative texts from the GPT-4o model and are described below:

Zero-shot: We asked GPT-4o to produce short novels on the nine specific topics outlined above using its native style. This would produce a baseline of the model’s stylistic signature. The generated texts were assigned the prefix “GPT” in their filenames. The prompts used are the following:
Write a 2,000-word short novel on the topic “[Topic Name]”.
Zero-shot (imitation): In this prompt, we specifically asked for the imitation of the authors’ styles. Imitation will be achieved by exploiting the overall model’s knowledge about these authors and the stylistic characteristics of their works. The generated texts were assigned the prefixes “GPT_Hemingway” and “GPT_Shelley,” respectively, in their filenames. The prompt used is the following:
- Write a 2,000-word short novel on the topic “[Topic Name],” imitating the literary style of Ernest Hemingway.
- Write a 2,000-word short novel on the topic “[Topic Name],” imitating the literary style of Mary Shelley.
In-context learning: For this prompt, we used a 15,000-word excerpt from Hemingway’s “The Old Man and the Sea” and Shelley’s “Frankenstein.” This excerpt was taken from the middle of each source book and was used as a reference to the original authors’ style. The generated texts were assigned the prefixes “EnhGPT_Hemingway” and “EnhGPT_Shelley,” respectively, in their filenames. The prompt used is the following:
You are tasked with reading a given text, imitating its style, and then writing a 2,000-word short novel using the learned style with the topic “[Topic Name].” Follow these instructions carefully:
- First, carefully read the following text: <text_to_imitate> </text_to_imitate>
- Analyze the text thoroughly, paying close attention to:
  - Sentence structure and length
  - Vocabulary choices
  - Tone and mood
  - Narrative voice (first person, third person, etc)
  - Use of literary devices (metaphors, similes, alliteration, etc)
  - Pacing and rhythm
  - Dialogue style (if present)
  - Descriptive techniques
- Now, write a 2,000-word short novel that imitates the style of the given text. Your novel should:
  - Have a clear beginning, middle, and end
  - Include well-developed characters
  - Have a coherent plot or central theme
  - Maintain the identified stylistic elements throughout

Remember, your goal is to create a new, original story that feels as if it could have been written by the author of the text you are imitating. Focus on capturing the essence of their style rather than copying specific plot elements or characters.

The linguistic measurements reported in Table 2 represent corpus-level calculations for each author category rather than averages of individual texts. To provide additional insight into the variation within each category, Table 2 presents the mean and standard deviation of the following key linguistic measure calculated at the individual text level:

Table 2.

Mean and standard deviation of linguistic measures for individual texts by author category.

Author	N	Tokens	Types	stTTR	Mean word length (in characters)	Mean sentence length (in words)
Overall mean	47	3,710.17 ± 393.83	756.91 ± 979.03	43.77 ± 3.42	4.53 ± 0.24	15.83 ± 3.29
EnhGPT.Hemingway	9	1,587.78 ± 393.83	537.78 ± 74.93	42.70 ± 0.48	4.29 ± 0.17	12.95 ± 2.29
EnhGPT.Shelley	9	1,993.78 ± 651.67	662.89 ± 147.07	38.57 ± 3.48	4.53 ± 0.17	19.75 ± 2.85
GPT.Hemingway	9	1,499 ± 219.27	541.56 ± 57.30	47.79 ± 0.95	4.49 ± 0.18	13.47 ± 1.44
GPT.Shelley	9	1,577.22 ± 163.46	599.56 ± 68.53	45.57 ± 0.53	4.71 ± 0.13	17.70 ± 1.94
GPT	9	1,399.22 ± 115.71	550.22 ± 50.76	44.01 ± 0.29	4.72 ± 0.17	15.01 ± 1.07
Hemingway_The Old Man and The Sea	1	26,663	2,539	34.62	3.84	14.15
Shelley_Frankenstein	1	75,143	7,008	44.33	4.42	41.51

Author	N	Tokens	Types	stTTR	Mean word length (in characters)	Mean sentence length (in words)
Overall mean	47	3,710.17 ± 393.83	756.91 ± 979.03	43.77 ± 3.42	4.53 ± 0.24	15.83 ± 3.29
EnhGPT.Hemingway	9	1,587.78 ± 393.83	537.78 ± 74.93	42.70 ± 0.48	4.29 ± 0.17	12.95 ± 2.29
EnhGPT.Shelley	9	1,993.78 ± 651.67	662.89 ± 147.07	38.57 ± 3.48	4.53 ± 0.17	19.75 ± 2.85
GPT.Hemingway	9	1,499 ± 219.27	541.56 ± 57.30	47.79 ± 0.95	4.49 ± 0.18	13.47 ± 1.44
GPT.Shelley	9	1,577.22 ± 163.46	599.56 ± 68.53	45.57 ± 0.53	4.71 ± 0.13	17.70 ± 1.94
GPT	9	1,399.22 ± 115.71	550.22 ± 50.76	44.01 ± 0.29	4.72 ± 0.17	15.01 ± 1.07
Hemingway_The Old Man and The Sea	1	26,663	2,539	34.62	3.84	14.15
Shelley_Frankenstein	1	75,143	7,008	44.33	4.42	41.51

Table 2.

Open in new tab Download slide

Mean and standard deviation of linguistic measures for individual texts by author category.

Author	N	Tokens	Types	stTTR	Mean word length (in characters)	Mean sentence length (in words)
Overall mean	47	3,710.17 ± 393.83	756.91 ± 979.03	43.77 ± 3.42	4.53 ± 0.24	15.83 ± 3.29
EnhGPT.Hemingway	9	1,587.78 ± 393.83	537.78 ± 74.93	42.70 ± 0.48	4.29 ± 0.17	12.95 ± 2.29
EnhGPT.Shelley	9	1,993.78 ± 651.67	662.89 ± 147.07	38.57 ± 3.48	4.53 ± 0.17	19.75 ± 2.85
GPT.Hemingway	9	1,499 ± 219.27	541.56 ± 57.30	47.79 ± 0.95	4.49 ± 0.18	13.47 ± 1.44
GPT.Shelley	9	1,577.22 ± 163.46	599.56 ± 68.53	45.57 ± 0.53	4.71 ± 0.13	17.70 ± 1.94
GPT	9	1,399.22 ± 115.71	550.22 ± 50.76	44.01 ± 0.29	4.72 ± 0.17	15.01 ± 1.07
Hemingway_The Old Man and The Sea	1	26,663	2,539	34.62	3.84	14.15
Shelley_Frankenstein	1	75,143	7,008	44.33	4.42	41.51

Author	N	Tokens	Types	stTTR	Mean word length (in characters)	Mean sentence length (in words)
Overall mean	47	3,710.17 ± 393.83	756.91 ± 979.03	43.77 ± 3.42	4.53 ± 0.24	15.83 ± 3.29
EnhGPT.Hemingway	9	1,587.78 ± 393.83	537.78 ± 74.93	42.70 ± 0.48	4.29 ± 0.17	12.95 ± 2.29
EnhGPT.Shelley	9	1,993.78 ± 651.67	662.89 ± 147.07	38.57 ± 3.48	4.53 ± 0.17	19.75 ± 2.85
GPT.Hemingway	9	1,499 ± 219.27	541.56 ± 57.30	47.79 ± 0.95	4.49 ± 0.18	13.47 ± 1.44
GPT.Shelley	9	1,577.22 ± 163.46	599.56 ± 68.53	45.57 ± 0.53	4.71 ± 0.13	17.70 ± 1.94
GPT	9	1,399.22 ± 115.71	550.22 ± 50.76	44.01 ± 0.29	4.72 ± 0.17	15.01 ± 1.07
Hemingway_The Old Man and The Sea	1	26,663	2,539	34.62	3.84	14.15
Shelley_Frankenstein	1	75,143	7,008	44.33	4.42	41.51

Tokens: The total count of individual word occurrences in the text, including repeated words.
Types: The count of unique words in the text, with each distinct word included in the count.
stTTR (standardized Type-Token Ratio): A measure of lexical diversity calculated as (Types ÷ Tokens) × 100, standardized across fixed-length text segments (1,000 words) to control for text length effects. Higher values indicate greater lexical diversity.
Mean Word Length: The average number of characters per word, which can indicate complexity of vocabulary or writing style.
Mean Sentence Length: The average number of words per sentence, which often correlates with syntactic complexity and writing style. Shorter sentences are typically associated with Hemingway’s style, while longer, more complex sentences align with Shelley’s gothic prose.

The linguistic measurements outlined in the report reflect corpus-level calculations for each author category, derived from the average of the nine text segments. To enhance our understanding of the variation within each category, we also included the standard deviation:

It is important to note that although we requested 2,000-word texts in our prompts to GPT-4o, the actual output varied. The nine texts in each generated category collectively range from 12,593 words (GPT) to 17,917 words (EnhGPT.Shelley), averaging approximately 1,399 to 1,991 words per text—slightly below the requested 2,000 words per text. This variation occurred because GPT-4o occasionally terminated text generation before reaching the exact word count specified in the prompt, particularly when reaching narrative closure points. We chose to preserve these natural termination points rather than forcing additional content, as this approach better maintained the narrative integrity of the generated texts.

Moreover, Table 2 reveals distinctive linguistic patterns across author categories. EnhGPT.Shelley texts show the highest token count (1,993.78 ± 651.67), approaching our 2,000-word target, while generic GPT outputs contain the fewest (1,399.22 ± 115.71). Stylistic patterns align with expectations: Hemingway imitations feature shorter sentences (12.95-13.47 words) and words (4.29-4.49 characters) compared to Shelley imitations (17.70-19.75 words, 4.53-4.71 characters). However, lexical diversity results are unexpected, with GPT.Hemingway showing the highest diversity (47.79 ± 0.95) and EnhGPT.Shelley the lowest (38.57 ± 3.48), contrary to the original authors’ patterns (Hemingway: 34.62, Shelley: 44.33). This suggests GPT-4o successfully captures sentence-level stylistic traits but struggles with vocabulary richness imitation.

4. Stylometric analysis

4.1 Profile-based authorship attribution

Our analysis aims to systematically examine whether LLMs (specifically GPT-4o) can accurately replicate an author’s stylometric profile. More specifically, we seek to determine whether it is possible to distinguish between a text written by GPT-4o imitating an author and one authored by the original writer. To achieve this, we treated each generated subcorpus as a distinct author, resulting in five artificial authors (GPT, GPT_Hemingway, GPT_Shelley, EnhGPT_Hemingway, EnhGPT_Shelley) alongside two human authors (Hemingway and Shelley).

In our study, we evaluate imitation success using authorship attribution results in the above-mentioned corpus. By assigning each imitation sample a virtual author label, we assess the model’s ability to mimic distinct stylistic features by comparing the classification accuracy of the imitation samples against those of the original human authors.

In our stylometric analysis, we employed the Cosine Delta distance measure using the 1,000 most frequent words as features. This approach offers substantial robustness against text length variations, as it measures the angular similarity between feature vectors rather than their magnitudes. The frequency vectors are normalized during calculation, which further mitigates potential biases from unequal text lengths (Evert et al. 2017). While dividing texts into equal-sized chunks can be valuable in certain stylometric contexts, our use of normalized frequency vectors with Cosine Delta distance provides reliable comparative analysis despite the modest variations in text length across our corpus. This methodological decision is further supported by previous research demonstrating the Cosine Delta distance’s effectiveness with frequency-based stylometric features in comparative analyses of texts with varying lengths (Eder, Rybicki, and Kestemont 2016; Evert et al. 2017).

For visualization, we employed hierarchical cluster analysis (HCA) and principal component analysis (PCA) based on matrix correlations. These methods will allow us to visually represent the cosine distances, where texts with similar styles are positioned closer under the same branch, and those with distinct styles are farther apart. The results of both analyses can be seen in Fig. 1.

Figure 1

HCA and PCA graphs of the 1,000 most frequent words using the cosine distance.

HCA reveals that while GPT-4o can approximate an author’s stylistic profile, it does not fully replicate the nuanced stylometric signatures of human-authored texts. The original works by Hemingway and Shelley form a distinct cluster, indicating that their unique styles are preserved. Generated texts imitating Hemingway (GPT_Hemingway and EnhGPT_Hemingway) and Shelley (GPT_Shelley and EnhGPT_Shelley) show consistency within their respective groups but remain separate from the originals, suggesting partial replication of their styles. Generic GPT outputs (GPT) are the most distinct, lacking the stylistic features of both the original and imitated texts. Enhanced imitations (EnhGPT) show some improvement in stylistic alignment but do not bridge the gap to the original authors. Overall, these results demonstrate the LLM’s capability to imitate stylistic elements but highlight its limitations in fully capturing the complexity of human-authored texts.

The PCA analysis replicates the HCA findings. Hemingway’s The Old Man and the Sea and Shelley’s Frankenstein occupy distinct positions on the plot, highlighting their unique stylistic signatures. The imitations of Shelley (GPT_Shelley and EnhGPT_Shelley) cluster closer to Frankenstein, indicating some degree of alignment with Shelley’s style, though differences remain evident. Similarly, the imitations of Hemingway (GPT_Hemingway and EnhGPT_Hemingway) group together but are positioned away from The Old Man and the Sea, suggesting partial but incomplete stylistic replication. The generic GPT output (GPT) is located between the imitations, demonstrating its lack of alignment with either author’s stylistic features. The PCA plot further underscores the limitations of LLMs in fully capturing the detailed stylometric profiles of human-authored texts while showing some capability for stylistic approximation.

4.2 Instance-based authorship attribution

We repeated the authorship attribution experiment using an instance-based approach. We trained a Random Forest algorithm using the same authors set as the previous profile-based analysis. To enhance the dataset and improve the robustness of the analysis, the original texts were segmented into 500-word chunks, resulting in a total sample of 372 texts. Each task employed a distinct group of stylometric features to capture the broadest possible range of stylistic variations in the texts. The specific feature groups utilized in the analysis are as follows:

TCR features: We calculated seventy-four stylometric features organized in the following wider thematic areas:
- Word and Sentence Lengths and standard deviations (measured in characters, words, and syllables)
- Word Frequency distribution indices such as hapax legomena and dis legomena
- Quantitative aspects of text, such as entropy, perplexity, and the corresponding standard deviations
- Lexical diversity indices:
  - ○ Text size neutralized TTR measures: log-transformed TTR, root-transformed TTR, Mass TTR, Mean Segmental TTR, Moving Average TTR
  - ○ Measure of Textual Lexical Diversity (MTLD)
  - ○ Hypergeometric Distribution D measure
  - ○ Functional diversity (the ratio of content to functional words)
- Readability indices:
  - ○ Flesch- Kincaid reading ease and grade level
  - ○ SMOG
  - ○ Automated Readability Index
  - ○ Dale-Chall
  - ○ Linsear Write formula
  - ○ Gunning-Fog score
  - ○ Coleman-Liau index
- Metrics related to coherence, text quality, syntactic complexity, and Part-of-Speech (PoS) tags frequencies
Author Multilevel N-gram Profiles (AMNP): Mikros and Perifanos (2013) introduced the AMNP approach, which has demonstrated significant effectiveness in authorship attribution and author profiling research (Mikros 2013a, 2013b; Mikros and Perifanos 2015; Mikros 2018). AMNP provides a comprehensive document representation framework that incorporates progressively larger n-gram units at both character and word levels.
This dual-dimension approach ensures coverage across multiple linguistic strata. The methodology draws its theoretical foundation from the Prague School of Linguistics and specifically the concept of “double articulation” (Nöth 1995: 238). According to this principle, language operates on two distinct structural layers: the first containing meaningful units (morphemes) that form grammatical patterns, and the second comprising minimal functional units (phonemes) that lack independent meaning. The integration of elements from both layers produces grammatically sound linguistic output.
Stylistic analysis can be approached through a similar framework. To effectively capture the multidimensional nature of stylistic features, researchers must identify and systematically integrate characteristics across diverse linguistic levels to accurately represent an author’s distinctive style.
For our current research, we constructed a feature vector of 900 elements by extracting the 300 most frequent character bigrams (n = 2), character trigrams (n = 3), and word unigrams. This comprehensive vector constitutes the Author’s Multilevel N-gram Profile (AMNP), providing a document representation that simultaneously encapsulates both character and word sequence patterns. To eliminate potential text-length biases in our subsequent analyses, we normalized the frequency calculations across all features.
Word Embeddings (WE): Word Embeddings are a popular representation of a text’s vocabulary. Words or phrases from the lexicon are translated into vectors of real numbers and can subsequently be used in language modeling and feature-learning approaches. These numerical vectors can capture text representations in an n-dimensional space, where words with the same meaning are represented similarly. This indicates that two comparable words are located extremely closely together in vector space and have almost identical vector representations. Therefore, the objective of creating a word embedding space is to record some form of relationship in that space, be it based on meaning, morphology, context, or another type of link. Our study utilized OpenAI’s pre-trained text-embedding-3-small model to generate word embeddings for each text. By averaging these embeddings, we represented each document with a single vector that encapsulates the mean of its word embeddings.
Linguistic Inquiry and Word Count (LIWC) features: LIWC is a text analysis tool that quantifies linguistic and psychological attributes by categorizing words into predefined groups. It assesses dimensions such as emotional tone, cognitive processes, social concerns, and perceptual processes. In this study, we utilized LIWC-22 (Boyd et al. 2022) to extract 115 features related to affective processes (e.g., positive and negative emotions), cognitive processes (e.g., insight, causation), social processes (e.g., family, friends), and perceptual processes (e.g., see, hear, feel). These features provide insights into the psychological and emotional aspects of the texts, complementing the other stylometric analyses.

For our random forest classification evaluation, we employed a rigorous fivefold cross-validation approach rather than relying solely on training performance metrics. The corpus was randomly partitioned into five equal subsets, with four subsets (80 per cent of data) used for training and one subset (20 per cent) reserved for testing in each fold. This process was repeated five times, with each subset serving once as the test data. The performance metrics reported in Table 3 represent the averages across all five folds, thus providing a more reliable estimate of the model’s generalization capabilities. This cross-validation strategy helps mitigate concerns about overfitting and ensures that our evaluation reflects the model’s ability to accurately classify unseen texts. All classifications were performed independently for each feature group (TCR, AMNP, WE, and LIWC) to assess their individual discriminative power. The results of the authorship attribution tasks can be found in Table 3.

Table 3.

Classification metrics in authorship attribution using RF.

	Accuracy	Recall	Precision	F1
TCR	0.7984	0.7984	0.8037	0.7896
AMNP	0.8329	0.8329	0.8447	0.8213
WE	0.8364	0.8364	0.8637	0.8201
LIWC	0.8403	0.8403	0.8474	0.8282

	Accuracy	Recall	Precision	F1
TCR	0.7984	0.7984	0.8037	0.7896
AMNP	0.8329	0.8329	0.8447	0.8213
WE	0.8364	0.8364	0.8637	0.8201
LIWC	0.8403	0.8403	0.8474	0.8282

Table 3.

Open in new tab Download slide

Classification metrics in authorship attribution using RF.

	Accuracy	Recall	Precision	F1
TCR	0.7984	0.7984	0.8037	0.7896
AMNP	0.8329	0.8329	0.8447	0.8213
WE	0.8364	0.8364	0.8637	0.8201
LIWC	0.8403	0.8403	0.8474	0.8282

	Accuracy	Recall	Precision	F1
TCR	0.7984	0.7984	0.8037	0.7896
AMNP	0.8329	0.8329	0.8447	0.8213
WE	0.8364	0.8364	0.8637	0.8201
LIWC	0.8403	0.8403	0.8474	0.8282

To thoroughly examine the classification results for each category, we generated a detailed classification report for each class (author) across the different feature groups. The results are presented in Fig. 2.

Figure 2

Detailed classification report for each class in each feature group. Class labels: EnhGPTHemingway: 0, EnhGPTShelley: 1, GPT: 2, GPTHemingway: 3, GPTShelley: 4, Hemingway: 5, Shelley: 6.

From the detailed analysis of the classification metrics for each author, we can explore whether these imitations successfully capture the unique stylistic characteristics of the original authors. More specifically, the original works by Shelley and Hemingway are consistently well-identified across all feature groups, with high precision and perfect or near-perfect recall. However, GPT-generated imitations, especially those using in-context learning, show varied performance. In-context learning imitations of Hemingway (EnhGPTHemingway) are particularly difficult to classify, with low recall in AMNP and WE features, indicating significant overlap with both original texts and other imitative categories. Shelley’s in-context learning imitations (EnhGPTShelley) perform slightly better but still demonstrate inconsistencies in recall and precision, suggesting only partial replication of her stylistic features.

Generic GPT outputs perform moderately, particularly in TCR and WE, where they show noticeable overlap with other categories, indicating that they lack clear stylistic alignment with either original or imitative profiles. LIWC features capture some psychological and emotional details but are less effective for distinguishing complex stylistic patterns in imitative texts.

Overall, the analysis reveals that GPT-generated texts approximate the original authors’ styles to some degree but cannot fully replicate their unique stylometric profiles. This highlights the current limitations of GPT in accurately imitating the depth and complexity of human-authored stylistic patterns, particularly for enhanced imitations. While GPT can mimic surface-level stylistic elements, it struggles to capture the full richness of an author’s stylometric signature, making it distinguishable from the originals.

4.3 t-SNE analysis

As a final step, we visualized the author discrimination results by applying dimensionality reduction using the t-SNE (t-distributed Stochastic Neighbor Embedding) method for each feature group analyzed. t-SNE is a non-linear dimensionality reduction technique particularly effective for visualizing high-dimensional data by preserving the local structure of the data points in a lower-dimensional space, making it well-suited for identifying patterns and clusters within the stylometric features of the texts (Van der Maaten and Hinton 2008). The results of the analysis can be found in Fig. 3.

Figure 3

t-SNE plots for each feature group.

Open in new tab Download slide

The results reveal that GPT-generated imitations capture some stylistic elements of the original texts, but their fidelity varies significantly depending on the feature group analyzed. AMNP features were the most effective at differentiating between original and imitation texts. Hemingway’s and Shelley’s original texts formed distinct and well-defined clusters, indicating that their multilevel stylistic traits, including n-grams and word patterns, were difficult for GPT-generated texts to replicate fully. Enhanced in-context learning models (EnhGPTHemingway and EnhGPTShelley) showed moderate success in aligning with the original styles but still exhibited noticeable overlap with other categories, particularly for Hemingway.

The LIWC features demonstrated that GPT imitations could capture some emotional and cognitive processes but struggled with more intricate stylistic nuances. Shelley’s original texts showed greater dispersion, and the enhanced GPT-generated texts only partially overlapped with her cluster. Hemingway’s texts, on the other hand, were more distinct but still showed some alignment with enhanced GPT outputs, suggesting that certain psychological aspects of his style are easier for GPT to emulate.

The TCR features highlighted the complexity and readability aspects of the texts. Hemingway’s texts were more distinctly clustered, reflecting his characteristic readability and sentence-level simplicity, which the GPT models approximated to a moderate degree. In contrast, Shelley’s stylistic complexity resulted in more dispersed clustering, with her texts overlapping with both generic GPT and in-context learning imitations, suggesting challenges in replicating her unique structural and lexical richness.

The WE features offered insights into the semantic alignment of the texts. Hemingway’s cluster was tightly grouped and minimally overlapped with GPT outputs, indicating his lexical and semantic style was challenging for GPT to replicate fully. Shelley’s texts exhibited slightly more dispersion, with enhanced imitative outputs clustering closer to her original works than generic GPT outputs. However, overlap across categories persisted, particularly with generic GPT outputs, reflecting limitations in achieving high semantic fidelity.

Overall, the effectiveness of GPT-generated texts in producing stylometrically similar outputs to the originals was moderate but incomplete. Across all feature groups, the original texts by Hemingway and Shelley consistently formed distinct clusters, confirming the uniqueness and coherence of their stylometric profiles. In-context learning outputs demonstrated closer alignment to the originals compared to generic GPT outputs, highlighting the value of contextual guidance in improving stylistic fidelity. However, significant dispersion and overlap across categories underscore the ongoing limitations of GPT models in fully replicating the intricacies of human-authored styles.

GPT-generated imitations better approximated Hemingway’s style, characterized by simplicity and structural consistency. Shelley’s more intricate and complex style posed more significant challenges, as evidenced by more dispersed clustering and higher overlap with GPT-generated texts. This suggests that the effectiveness of GPT-generated imitations varies depending on the stylistic complexity and distinctiveness of the original author.

The findings demonstrate that while GPT-generated texts can approximate surface-level stylistic features, they struggle to replicate the depth, detail, and coherence of human-authored writing. In-context learning improves stylistic alignment but cannot fully emulate the unique stylometric profiles of authors like Shelley and Hemingway. These limitations are particularly evident in the more complex and semantically rich features, such as AMNP and WE, which require capturing subtler patterns of linguistic expression.

5. Conclusions

This study examined the extent to which GPT-4o can produce stylometrically similar texts to the original works of Ernest Hemingway and Mary Shelley. Using various prompting strategies, zero-shot generation, zero-shot imitation, and in-context learning, we generated stylistic imitations and analyzed them through multiple stylometric methodologies, including profile-based and instance-based authorship attribution and t-SNE visualization across four feature groups. The findings reveal that while GPT-4o demonstrates the ability to approximate surface-level stylistic features, it falls short of fully replicating the unique stylometric profiles of the human authors. Across all analyses, original texts by Hemingway and Shelley consistently formed distinct and well-defined clusters, emphasizing the coherence and distinctiveness of their stylistic signatures. In contrast, GPT-generated imitations, particularly those using in-context learning, showed moderate improvement in stylistic alignment but still exhibited significant overlap with generic GPT outputs, underscoring their limitations.

Hemingway’s concise and structurally consistent style proved somewhat easier for GPT to emulate, as reflected in the tighter clustering and greater alignment of imitative outputs in TCR and WE analyses. However, Shelley’s complex, richly descriptive style posed more significant challenges, with her imitative outputs showing more dispersion and overlap with generic GPT outputs. These findings suggest that the effectiveness of GPT-generated imitations depends heavily on the stylistic complexity and distinctiveness of the original author.

The t-SNE analyses demonstrate that the fidelity of GPT’s stylistic imitation is closely tied to the discriminative power of the feature group. AMNP and WE features, which were the most effective at distinguishing between original texts and imitations, also captured the most complex and multilevel stylistic traits, making them the hardest for GPT-generated texts to replicate. In contrast, LIWC and TCR features, while offering complementary insights, were less effective in differentiating original and imitative texts, suggesting that the stylistic elements they capture are more easily mimicked by GPT.

Overall, GPT-4o demonstrates a promising but incomplete ability to imitate human literary styles. Enhanced prompting strategies, such as in-context learning, show potential for improving stylistic fidelity but remain insufficient for fully capturing the depth, coherence, and innovation of human-authored writing. These limitations highlight the ongoing challenges in developing LLMs capable of high-fidelity stylistic replication.

This research underscores the need for further exploration into advanced modeling techniques, such as integrating richer contextual frameworks, improving feature representations, and leveraging more sophisticated learning paradigms. Additionally, the ethical implications of stylistic imitation, including concerns over intellectual property, authenticity, and trust, warrant careful consideration as LLMs continue to advance. Future studies should aim to refine both the technical and ethical dimensions of AI-driven stylistic imitation, bridging the gap between human creativity and machine-generated text.

This research has several limitations. First, the study focused on only two authors with highly distinct styles, limiting the generalizability of the findings to other authors or genres with more subtle stylistic differences. Second, the thematic control imposed on the text generation process may have influenced the stylistic outputs, potentially constraining the model’s natural stylistic adaptation. Third, the study employed a single LLM (GPT-4o) and specific analytical methods, which may not fully capture the potential of other LLMs or alternative approaches to stylistic imitation.

Future research should explore broader authorial datasets, incorporating a wider variety of styles and genres to evaluate the generalizability of these findings. Additionally, future studies could examine the effect of thematic constraints on stylistic outputs and explore the potential of advanced fine-tuning or retrieval-augmented techniques for improving imitation fidelity. Finally, conducting systematic user evaluations to assess human perceptions of GPT-generated stylistic imitations could provide valuable insights into the authenticity and acceptability of these texts.

Overall, while GPT-4o demonstrates the ability to approximate certain stylistic traits, this study highlights the challenges in replicating the full depth and richness of human writing, paving the way for future advancements in stylistic imitation research.

Acknowledgments

Open Access funding provided by the Qatar National Library.

Author contributions

George Mikros (Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing—original draft, Writing—review and editing).

Data availability

The data underlying this article will be shared on reasonable request to the corresponding author.

References

Amirjalili

Neysani

Nikbakht

(

2024

). ‘

Exploring the boundaries of authorship: a comparative analysis of AI-generated text and human academic writing in English literature [Original Research]’,

Frontiers in Education

9, 1–11

10.3389/feduc.2024.1347421

Berryman

Ziegler

(

2024

Prompt Engineering for LLMs

. Sebastopol, CA:

O’Reilly Media, Inc

Bhandarkar

et al. (

2024

). ‘Emulating author style: a feasibility study of prompt-enabled text stylization with off-the-shelf LLMs’, in A. Deshpande et al. (Eds.), Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024), pp. 76–82, St. Julians, Malta: Association for Computational Linguistics. https://aclanthology.org/2024.personalize-1.6

Boyd

R. L.

et al. (

2022

The Development and Psychometric Properties of LIWC-22.

Austin, TX:

University of Texas at Austin

. https://www.liwc.app/

Chakrabarty

et al. (

2024

). ‘Creativity support in the age of large language models: An empirical study involving professional writers’, in Proceedings of the 16th Conference on Creativity and Cognition, pp.

132

–

. Chicago, IL: Association for Computing Machinery.

10.1145/3635636.3656201

Chen

et al. (

2024

). The Oscars of AI theater: A survey on role-playing with language models. arXiv preprint arXiv : 2407.11484.

Chen

et al. (

2024

). Apollonion: Profile-centric dialog agent. arXiv preprint arXiv : 2404.08692.

Chen

Moscholios

(

2024

). Using prompts to guide Large Language Models in imitating a real person’s language style. arXiv preprint arXiv : 2410.03848.

Eder

(

2011

). ‘

Style-Markers in Authorship Attribution. A Cross-Language Study of the Authorial Fingerprint

’,

Studies in Polish Linguistics

–

114

. https://journal.r-project.org/archive/2016/RJ-2016-007/index.html

Eder

Rybicki

Kestemont

(

2016

). ‘

Stylometry with R: A Package for Computational Text Analysis

’,

R Journal

107

–

10.18653/v1/2024.acl-long.740

Evert

et al. (

2017

). ‘

Understanding and Explaining Delta Measures for Authorship Attribution

’,

Digital Scholarship in the Humanities

ii4

–

ii16

Guo

Ritter

(

2024

). “Meta-tuning LLMs to leverage lexical knowledge for generalizable language style understanding,” in L.-W. Ku, A. Martins, and V. Srikumar (eds), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.

13708

–

. Bangkok, Thailand: Association for Computational Linguistics.

Hicke

R. M.

Mimno

(

2023

). ‘T5 meets Tybalt: Author attribution in early modern English drama using Large Language Models’, in A. Šeļa, F. Jannidis, and I. Romanowska (eds), Computational Humanities Research 2023 (CHR 2023). Proceedings of the Computational Humanities Research Conference 2023, Paris, France, December 6-8, 2023, Vol. 3558, pp.

274

–

302

. Paris: CEUR Workshop Proceedings (CEUR-WS.org). https://ceur-ws.org/Vol-3558/paper2757.pdf

Jiang

et al. (

2022

). ‘PromptMaker: Prompt-based prototyping with large language models’, in Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, Article 35). New Orleans, LA: Association for Computing Machinery.

10.1145/3491101.3503564

Kostopolus

(

2025

). ‘

Student use of generative AI as a composing process supplement: Concerns for intellectual property and academic honesty

’,

Computers and Composition

102894

10.1016/j.compcom.2024.102894

10.1093/oso/9780198870944.003.0001

Lee

J.-A.

Hilty

R. M.

Liu

K.-C.

(

2021

). ‘Roadmap to Artificial Intelligence and Intellectual Property. An Introduction’, in

Lee

J.-A.

Hilty

R. M.

Liu

K.-C.

(eds),

Artificial Intelligence and Intellectual Property

, pp.

–

. Oxford:

Oxford University Press

et al. (

2024

). ‘Learning to rewrite prompts for personalized text generation’, in Proceedings of the ACM Web Conference 2024, pp.

3367

–

. Singapore: Association for Computing Machinery.

10.1145/3589334.3645408

Liu

Diddee

Ippolito

(

2024

). Customizing large language model generation style using parameter-efficient finetuning. arXiv preprint arXiv : 2409.04574.

Liu

et al. (

2023

). TAIL: Task-specific adapters for imitation learning with large pretrained models. in Proceedings of the Twelfth International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=RRayv1ZPN3.

Luo

et al. (

2024

). Stylus: Automatic adapter selection for diffusion models. in Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems. Retrieved from https://openreview.net/forum?id=3Odq2tGSpp

Marvin

et al. (

2024

). ‘Prompt Engineering in Large Language Models’, in

Jacob

I. J.

Piramuthu

Falkowski-Gilski

(eds),

Data Intelligence and Cognitive Informatics

, pp.

387

–

402

. Singapore:

Springer Nature

10.1007/978-981-99-7962-2_30

Mikros

(

2013a

). ‘Authorship attribution and gender identification in greek blogs’, In I. Obradović, E. Kelih, and R. Köhler (Eds.), Selected papers of the VIIIth International Conference on Quantitative Linguistics (QUALICO) in Belgrade, Serbia, April 16-19, 2012, pp.

–

. Belgrade: Academic Mind.

Mikros

(

2013b

). ‘Systematic stylometric differences in men and women authors: a corpus-based study’, in R. Köhler and G. Altmann (eds), Issues in Quantitative Linguistics 3. Dedicated to Karl-Heinz Best on the occasion of his 70th birthday, pp.

206

–

223

. Lüdenscheid: RAM—Verlag.

Mikros

(

2018

). ‘Blended authorship attribution: Unmasking Elena Ferrante combining different author profiling methods’, in A. Tuzzi and M. Cortelazzo (eds), Drawing Elena Ferrante’s Profile. Workshop Proceedings, Padova, 7 September 2017, pp.

–

. Padova: Padova University Press.

Mikros

Perifanos

(

2011

). ‘Authorship identification in large email collections: Experiments using features that belong to different linguistic levels—Notebook for PAN at CLEF 2011’, in V. Petras, P. Forner, and P. D. Clough (Eds.), CLEF 2011 Labs and Workshop, Notebook Papers, 19-22 September 2011, Amsterdam, The Netherlands, Vol. 1177, pp. 1–6. Amsterdam, The Netherlands: CEUR-WS.org. http://ceur-ws.org/Vol-1177/CLEF2011wn-PAN-MikrosEt2011.pdf

Mikros

Perifanos

(

2013

). ‘Authorship attribution in greek tweets using multilevel author’s n-gram profiles’, in E. Hovy, V. Markman, C. H. Martell, and D. Uthus (eds), Papers from the 2013 AAAI Spring Symposium “Analyzing Microtext”, 25-27 March 2013, Stanford, California, pp.

–

. Palo Alto, CA: AAAI Press.

Mikros

Perifanos

(

2015

). ‘Gender Identification in Modern Greek Tweets’, in

Tuzzi

Benešová

Macutek

(eds),

Recent Contributions to Quantitative Linguistics

, Vol.

, pp.

–

. Berlin:

De Gruyter

Mou

et al. (

2024

). From Individual to Society: A Survey on social simulation driven by Large Language Model-based agents. arXiv preprint arXiv : 2412.03563.

Neelakanteswara

Chaudhari

Zamani

(

2024

). ‘RAGs to Style: Personalizing LLMs with style embeddings’, in A. Deshpande et al. (eds), Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024), pp.

119

–

. St. Julians, Malta: Association for Computational Linguistics. https://aclanthology.org/2024.personalize-1.11

Nöth

(

1995

Handbook of Semiotics

. Bloomington:

Indiana University Press

10.1038/s41598-024-76900-1

Porter

Machery

(

2024

). ‘

AI-Generated Poetry is Indistinguishable from Human-Written Poetry and Is Rated More Favorably

’,

Scientific Reports

26133

10.1016/j.ijinfomgt.2024.102824

Radivojevic

et al. (

2024

). Human perception of llm-generated text content in social media environments. arXiv preprint arXiv : 2409.06653.

Riemer

Peter

(

2024

). ‘

Conceptualizing Generative AI as Style Engines: Application Archetypes and Implications

’,

International Journal of Information Management

102824

Rigney

(

2010

The Matthew Effect: How Advantage Begets Further Advantage

. New York:

Columbia University Press

10.1016/j.aiopen.2024.08.003

Shao

et al. (

2024

). ‘

Authorship Style Transfer with Inverse Transfer Data Augmentation

’,

AI Open

–

103

10.1093/pnasnexus/pgae035

Yang

Hashimoto

(

2024

). Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. in Proceedings of the Thirteenth International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=M23dTGWCZy.

Simchon

Edwards

Lewandowsky

(

2024

). ‘

The persuasive Effects of Political Microtargeting in the Age of Generative Artificial Intelligence

’,

PNAS Nexus

3, 1–5

Soto

R. R.

et al. (

2024

). Few-shot detection of machine-generated text using style representations. in Proceedings of the Twelfth International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=cWiEN1plhJ.

Tang

et al. (

2024

). ERABAL: Enhancing role-playing agents through boundary-aware learning. arXiv preprint arXiv: 2409.14710.

Terreau

Gourru

Velcin

(

2024

). Capturing style in author and document representation. arXiv preprint arXiv : 2407.13358.

Van der Maaten

Hinton

(

2008

). ‘

Visualizing data using t-SNE

’,

Journal of Machine Learning Research

2579

–

605

10.18653/v1/2024.findings-acl.878

Wang

et al. (

2024

). ‘RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models’, in

L.-W.

Martins

Srikumar

(eds),

Findings of the Association for Computational Linguistics: ACL 2024

, pp.

14743

–

. Bangkok, Thailand:

Association for Computational Linguistics

Wegmann

Schraagen

Nguyen

(

2022

). ‘Same author or just same topic? Towards content-independent style representations’, in S. Gella et al. (eds), Proceedings of the 7th Workshop on Representation Learning for NLP, pp.

249

–

. Dublin, Ireland: Association for Computational Linguistics.

10.18653/v1/2022.repl4nlp-1.26

et al. (

2023

). ‘Specializing Small Language Models towards complex style transfer via latent attribute pre-training’, in 26th European Conference on Artificial Intelligence (

ECAI 2023)

, pp.

2802

–

. Amsterdam:

IOS Press