Knowledge-guided generative artificial intelligence for automated taxonomy learning from drug labels

24 high-level categories of the drug indication taxonomy (DISTILL) and related summary statistics.

Index	Category	Number of indication terms	Number of drugs^a
0	Cardiovascular diseases	234	189
1	Respiratory diseases	209	219
2	Digestive system diseases	311	267
3	Nervous system diseases	368	303
4	Musculoskeletal diseases	135	137
5	Endocrine system diseases	269	175
6	Immune system diseases	353	296
7	Infectious diseases	521	312
8	Mental disorders	85	101
9	Neoplasms (cancer)	532	192
10	Skin diseases	265	257
11	Eye diseases	101	80
12	Ear, nose, and throat diseases	107	150
13	Genitourinary system diseases	314	260
14	Blood diseases	311	225
15	Congenital, hereditary, and neonatal diseases	267	142
16	Nutritional and metabolic diseases	206	156
17	Pregnancy complications	154	221
18	Substance-related disorders	42	36
19	Injuries, wounds, and traumas	82	101
20	Poisoning, toxicity, and environmental exposure	56	26
21	Rare diseases	880	350
22	Aging-related diseases	228	355
23	Others	250	252

Index	Category	Number of indication terms	Number of drugs^a
0	Cardiovascular diseases	234	189
1	Respiratory diseases	209	219
2	Digestive system diseases	311	267
3	Nervous system diseases	368	303
4	Musculoskeletal diseases	135	137
5	Endocrine system diseases	269	175
6	Immune system diseases	353	296
7	Infectious diseases	521	312
8	Mental disorders	85	101
9	Neoplasms (cancer)	532	192
10	Skin diseases	265	257
11	Eye diseases	101	80
12	Ear, nose, and throat diseases	107	150
13	Genitourinary system diseases	314	260
14	Blood diseases	311	225
15	Congenital, hereditary, and neonatal diseases	267	142
16	Nutritional and metabolic diseases	206	156
17	Pregnancy complications	154	221
18	Substance-related disorders	42	36
19	Injuries, wounds, and traumas	82	101
20	Poisoning, toxicity, and environmental exposure	56	26
21	Rare diseases	880	350
22	Aging-related diseases	228	355
23	Others	250	252

Statistics on drugs are based on the available 1177 distinct RxNorm terms that are linked to the indications in the high-level category of interest.

Table 1.

Open in new tab Download slide

24 high-level categories of the drug indication taxonomy (DISTILL) and related summary statistics.

Index	Category	Number of indication terms	Number of drugs^a
0	Cardiovascular diseases	234	189
1	Respiratory diseases	209	219
2	Digestive system diseases	311	267
3	Nervous system diseases	368	303
4	Musculoskeletal diseases	135	137
5	Endocrine system diseases	269	175
6	Immune system diseases	353	296
7	Infectious diseases	521	312
8	Mental disorders	85	101
9	Neoplasms (cancer)	532	192
10	Skin diseases	265	257
11	Eye diseases	101	80
12	Ear, nose, and throat diseases	107	150
13	Genitourinary system diseases	314	260
14	Blood diseases	311	225
15	Congenital, hereditary, and neonatal diseases	267	142
16	Nutritional and metabolic diseases	206	156
17	Pregnancy complications	154	221
18	Substance-related disorders	42	36
19	Injuries, wounds, and traumas	82	101
20	Poisoning, toxicity, and environmental exposure	56	26
21	Rare diseases	880	350
22	Aging-related diseases	228	355
23	Others	250	252

Index	Category	Number of indication terms	Number of drugs^a
0	Cardiovascular diseases	234	189
1	Respiratory diseases	209	219
2	Digestive system diseases	311	267
3	Nervous system diseases	368	303
4	Musculoskeletal diseases	135	137
5	Endocrine system diseases	269	175
6	Immune system diseases	353	296
7	Infectious diseases	521	312
8	Mental disorders	85	101
9	Neoplasms (cancer)	532	192
10	Skin diseases	265	257
11	Eye diseases	101	80
12	Ear, nose, and throat diseases	107	150
13	Genitourinary system diseases	314	260
14	Blood diseases	311	225
15	Congenital, hereditary, and neonatal diseases	267	142
16	Nutritional and metabolic diseases	206	156
17	Pregnancy complications	154	221
18	Substance-related disorders	42	36
19	Injuries, wounds, and traumas	82	101
20	Poisoning, toxicity, and environmental exposure	56	26
21	Rare diseases	880	350
22	Aging-related diseases	228	355
23	Others	250	252

Statistics on drugs are based on the available 1177 distinct RxNorm terms that are linked to the indications in the high-level category of interest.

The topological characteristics of the DISTILL sub-taxonomies for 3 high-level categories (ie, “Cardiovascular diseases,” “Endocrine system diseases,” and “Genitourinary system diseases”) are presented in Table 2. The depth of a sub-taxonomy refers to the number of levels from the root node to the deepest node. All 3 categories have a depth of greater than or equal to 10. The width of a sub-taxonomy refers to the number of child nodes of a node in the structure. All 3 categories had a median width of 3 and the maximum widths were all greater than 10. Our primary goal was to create a taxonomy specifically for drug indication, not a comprehensive condition taxonomy. Therefore, nodes lacking associated indication terms were removed from the taxonomy. Out of the 3 categories, “Genitourinary system diseases” has the most complex hierarchical structure with the largest number of nodes (n = 516) and leaf nodes (n = 349), followed by “Endocrine system diseases.” The median number of indication terms per node was consistently 1 across the 3 categories. Figure 3 displays the sub-taxonomy for “Cardiovascular diseases.” The entire sub-taxonomies for the 3 categories are in Table S1.

Figure 3.

DISTILL sub-taxonomy for the high-level category: Cardiovascular diseases. Identifiers beginning with the letter “C” represent the concept IDs. Names of the first 40 concepts are displayed in the upper right corner of the figure. Numerical values prefixed with the symbol “#” denote the number of indication terms that are linked to the concept nodes. A concept may have multiple parent concepts in the taxonomy. For example, C144 hypertensive heart disease has 2 parent concepts C7 hypertensive diseases and C77 non-ischemic heart failure.

Table 2.

Topological characteristics of the structure of the drug indication taxonomy (DISTILL).

Index			0	5	13
High-level category name			Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
Depth			11	14	10
Width	Median [Q1, Q3]		3 [2, 5]	3 [2, 5]	3 [2, 5]
	Min		2	2	2
	Max		12	14	16
Statistics of all unique nodes	Count		242	339	516
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	0	0	0
		Max	23	78	36
	Number of nodes with 1+ RxNorm drugs		199	218	411
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 7]	2 [1, 4]	3 [1, 7]
		Min	1	1	1
		Max	62	94	50
Statistics of unique leaf nodes	Count		170	235	349
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	1	1	1
		Max	17	78	33
	Number of nodes with 1+ RxNorm drugs		154	169	325
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 6.75]	2 [1, 4]	2 [1, 5]
		Min	1	1	1
		Max	62	94	50

Index			0	5	13
High-level category name			Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
Depth			11	14	10
Width	Median [Q1, Q3]		3 [2, 5]	3 [2, 5]	3 [2, 5]
	Min		2	2	2
	Max		12	14	16
Statistics of all unique nodes	Count		242	339	516
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	0	0	0
		Max	23	78	36
	Number of nodes with 1+ RxNorm drugs		199	218	411
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 7]	2 [1, 4]	3 [1, 7]
		Min	1	1	1
		Max	62	94	50
Statistics of unique leaf nodes	Count		170	235	349
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	1	1	1
		Max	17	78	33
	Number of nodes with 1+ RxNorm drugs		154	169	325
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 6.75]	2 [1, 4]	2 [1, 5]
		Min	1	1	1
		Max	62	94	50

Table 2.

Topological characteristics of the structure of the drug indication taxonomy (DISTILL).

Index			0	5	13
High-level category name			Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
Depth			11	14	10
Width	Median [Q1, Q3]		3 [2, 5]	3 [2, 5]	3 [2, 5]
	Min		2	2	2
	Max		12	14	16
Statistics of all unique nodes	Count		242	339	516
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	0	0	0
		Max	23	78	36
	Number of nodes with 1+ RxNorm drugs		199	218	411
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 7]	2 [1, 4]	3 [1, 7]
		Min	1	1	1
		Max	62	94	50
Statistics of unique leaf nodes	Count		170	235	349
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	1	1	1
		Max	17	78	33
	Number of nodes with 1+ RxNorm drugs		154	169	325
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 6.75]	2 [1, 4]	2 [1, 5]
		Min	1	1	1
		Max	62	94	50

Index			0	5	13
High-level category name			Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
Depth			11	14	10
Width	Median [Q1, Q3]		3 [2, 5]	3 [2, 5]	3 [2, 5]
	Min		2	2	2
	Max		12	14	16
Statistics of all unique nodes	Count		242	339	516
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	0	0	0
		Max	23	78	36
	Number of nodes with 1+ RxNorm drugs		199	218	411
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 7]	2 [1, 4]	3 [1, 7]
		Min	1	1	1
		Max	62	94	50
Statistics of unique leaf nodes	Count		170	235	349
	Indication terms per node	Median [Q1, Q3]	1 [1, 3]	1 [1, 2]	1 [1, 2]
		Min	1	1	1
		Max	17	78	33
	Number of nodes with 1+ RxNorm drugs		154	169	325
	RxNorm drug per node	Median [Q1, Q3]	2 [1, 6.75]	2 [1, 4]	2 [1, 5]
		Min	1	1	1
		Max	62	94	50

Comparison between DISTLL and SNOMED-CT

Cardiovascular diseases

In SNOMED-CT, “Disorder of cardiovascular system” has 7585 SNOMED-CT standard descendant concepts, 5917 of which had at least 1 record observed (direct or mapped via source vocabulary) in a network of 15 real-world databases. This includes 53 direct children, among which the most prevalently occurring conditions in real-world data are (with number of descendants and highlighted concepts in descendants): heart disease (n = 2292, heart failure, atrial fibrillation, myocardial infarction), hypertensive disorder (n = 133, essential hypertension, pregnancy-induced hypertension), vascular disorder (n = 3404, including atherosclerosis, migraine, embolism), acute disease of cardiovascular system (n = 347, acute myocardial infarction, paroxysmal atrial fibrillation, transient cerebral ischemia), chronic disease of cardiovascular system (n = 248, chronic heart failure, rheumatic heart disease, chronic atrial fibrillation), cardiovascular injury (n = 791, myocardial infarction, angina pectoris, cerebrovascular accident), peripheral vascular disease (n = 211, peripheral arterial disease, arteriosclerosis of artery of extremity), cerebrovascular disease (n = 556, transient cerebral ischemia, carotid artery obstruction, thrombotic stroke), thrombosis (n = 498, deep venous thrombosis), and low blood pressure (n = 28, orthostatic hypotension). The sub-taxonomy of “Disorder of cardiovascular system” is large, with some concepts having 9 levels of separation, but most concepts (3753/5917) are within 4 levels.

In comparison, the DISTILL sub-taxonomy of “Cardiovascular diseases” is much smaller, having 242 descendant concepts, including 11 direct children (Figure 2): heart diseases (n = 139, endocarditis, heart failure, atrial fibrillation, coronary artery disease, myocardial infarction), hypertensive diseases (n = 42, essential hypertension, pulmonary arterial hypertension, hypertensive heart renal disease), vascular diseases (n = 79, peripheral venous disease, cerebrovascular disease, stroke, myocardial infarction, hypertension, deep vein thrombosis, pulmonary embolism), cardiac dysrhythmia (n = 21, atrial fibrillation), rheumatic heart diseases (n = 3), cerebrovascular diseases (n = 7, stroke), ischemic heart diseases (n = 21, angina, myocardial infarction), diseases of pulmonary circulation (n = 17, pulmonary embolism, pulmonary artery hypertension), heart disease in pregnancy (n = 7), cardiac arrest (n = 3), and ill-defined descriptions and complications of heart disease (n = 70, including heart failure, atrial fibrillation, chest pain, myocardial infarction). While SNOMED-CT is more granular, DISTILL has greater maximum depth than SNOMED-CT, with 1 concept “systemic disease-associated cardiomyopathy” having 10 levels of separation; most concepts (155/242) are within 3 levels.

There is high alignment between the categories in the 2 classifications, with strong overlap in clinical concepts in heart disease, hypertensive disorders, vascular disorder, and cerebrovascular disease. SNOMED-CT split peripheral vascular disease, while DISTILL organized related peripheral venous diseases under vascular diseases. SNOMED-CT created an orthogonal classification of acute and chronic conditions, many of which belong to other ancestors within the “Disease of cardiovascular system” hierarchy. One noteworthy clinical example of classification inconsistency is “ischemic stroke”: SNOMED-CT classifies “cerebral infarction” as a “Disorder of nervous system,” while DISTILL mapped the term “cerebral infarction” to the concept “stroke,” which was classified under both “vascular diseases” and “cerebrovascular diseases” within the “Cardiovascular diseases” sub-taxonomy.

Endocrine system diseases

In SNOMED-CT, “Disorder of endocrine system” subsumes “Diabetes mellitus,” “Disorder of thyroid gland,” “Disorder of ovary,” “Disorder of endocrine gland,” “Neoplasm of endocrine system,” “Disease of pancreas,” “Disorder of pituitary gland,” and “Disorder of adrenal gland” amongst its 47 direct children, collectively covering 2121 concepts. While DISTILL has comparable coverage of diabetes, thyroid/pituitary/adrenal gland-related conditions, and endocrine cancers under its sub-taxonomy of “Endocrine system disorders,” it does not cover hypoglycemia or all other pancreatic or ovary-related disorders (such as acute pancreatitis or ovarian cyst) amongst its 339 concepts. It does, however, contain “hyperlipidemia” and “hypercholesterolemia” under both “lipid disorders” and “metabolic disorders,” which in SNOMED-CT gets classified under “Disorder of lipoprotein and/or lipid metabolism” outside of the “Disease of endocrine system.”

Genitourinary system diseases

SNOMED-CT’s “Disorder of the genitourinary system” concept subsumes 5671 concepts, and has 20 direct children, including “Disorder of urinary system,” “Disorder of reproductive system,” “Chronic disease of genitourinary system,” “Infectious disease of genitourinary system,” “Malignant neoplasm of genitourinary system,” “Inflammatory disorder of genitourinary system,” and “Acute genitourinary disorder.” In contrast, DISTILL’s organization beneath its “Genitourinary system diseases” concept was more anatomically aligned, with direct children including “Urinary tract infections,” “Kidney diseases,” “Bladder diseases,” “Prostate diseases,” “Testicular diseases,” “Penile diseases,” “Urethral diseases,” and “Gynecological diseases,” amongst its 516 concepts. Both contain highly prevalent clinical findings, including “End-stage renal disease,” “urinary tract infectious disease”/“urinary tract infections,” “Acute renal failure syndrome”/“acute kidney injury,” “Primary malignant neoplasm of prostate”/“Prostate cancer,” “Chronic kidney disease,” “Kidney stone,” “testicular hypofunction,” “benign prostatic hyperplasia,” “acute cystitis,” and “dysmenorrhea.” Clinical findings in SNOMED-CT not covered in DISTILL included “hematuria syndrome,” “cyst of ovary,” and “Chronic kidney disease due to type 2 diabetes mellitus.”

Performance of DISTLL

Based on Table 3, for concept-to-concept subsumption relations, the accuracies for each of the 3 high-level categories, as assessed by any of the evaluators, are higher than 0.7. The inter-rater reliability scores exceed 0.7, indicating “good to very good” reliability.²² However, for concept-to-term subsumption relations, the accuracies range from 0.329 to 0.819. The inter-rater reliability scores are lower than 0.5, indicating “fair to moderate” reliability.²²

Table 3.

Evaluation results of drug indication concept-to-concept and concept-to-term subsumption relations for 3 high-level categories from 3 evaluators and their inter-rater reliability.

Relationship type	High-level category	Number of test cases	Accuracy			Gwet’s AC1 coefficient (95% CI)
Relationship type	High-level category	Number of test cases	Rater 1	Rater 2	Rater 3	Gwet’s AC1 coefficient (95% CI)
Concept-to-concept	Cardiovascular diseases	49	0.857	1	0.898	0.84 (0.733, 0.947)
	Endocrine system diseases	73	0.863	0.986	0.712	0.72 (0.596, 0.844)
	Genitourinary system diseases	75	0.84	0.907	0.787	0.857 (0.784, 0.93)
Concept-to-term	Cardiovascular diseases	64	0.562	0.703	0.672	0.462 (0.285, 0.64)
	Endocrine system diseases	85	0.6	0.8	0.329	0.234 (0.086, 0.381)
	Genitourinary system diseases	105	0.81	0.819	0.467	0.43 (0.293, 0.566)

Relationship type	High-level category	Number of test cases	Accuracy			Gwet’s AC1 coefficient (95% CI)
Relationship type	High-level category	Number of test cases	Rater 1	Rater 2	Rater 3	Gwet’s AC1 coefficient (95% CI)
Concept-to-concept	Cardiovascular diseases	49	0.857	1	0.898	0.84 (0.733, 0.947)
	Endocrine system diseases	73	0.863	0.986	0.712	0.72 (0.596, 0.844)
	Genitourinary system diseases	75	0.84	0.907	0.787	0.857 (0.784, 0.93)
Concept-to-term	Cardiovascular diseases	64	0.562	0.703	0.672	0.462 (0.285, 0.64)
	Endocrine system diseases	85	0.6	0.8	0.329	0.234 (0.086, 0.381)
	Genitourinary system diseases	105	0.81	0.819	0.467	0.43 (0.293, 0.566)

Table 3.

Evaluation results of drug indication concept-to-concept and concept-to-term subsumption relations for 3 high-level categories from 3 evaluators and their inter-rater reliability.

Relationship type	High-level category	Number of test cases	Accuracy			Gwet’s AC1 coefficient (95% CI)
Relationship type	High-level category	Number of test cases	Rater 1	Rater 2	Rater 3	Gwet’s AC1 coefficient (95% CI)
Concept-to-concept	Cardiovascular diseases	49	0.857	1	0.898	0.84 (0.733, 0.947)
	Endocrine system diseases	73	0.863	0.986	0.712	0.72 (0.596, 0.844)
	Genitourinary system diseases	75	0.84	0.907	0.787	0.857 (0.784, 0.93)
Concept-to-term	Cardiovascular diseases	64	0.562	0.703	0.672	0.462 (0.285, 0.64)
	Endocrine system diseases	85	0.6	0.8	0.329	0.234 (0.086, 0.381)
	Genitourinary system diseases	105	0.81	0.819	0.467	0.43 (0.293, 0.566)

Relationship type	High-level category	Number of test cases	Accuracy			Gwet’s AC1 coefficient (95% CI)
Relationship type	High-level category	Number of test cases	Rater 1	Rater 2	Rater 3	Gwet’s AC1 coefficient (95% CI)
Concept-to-concept	Cardiovascular diseases	49	0.857	1	0.898	0.84 (0.733, 0.947)
	Endocrine system diseases	73	0.863	0.986	0.712	0.72 (0.596, 0.844)
	Genitourinary system diseases	75	0.84	0.907	0.787	0.857 (0.784, 0.93)
Concept-to-term	Cardiovascular diseases	64	0.562	0.703	0.672	0.462 (0.285, 0.64)
	Endocrine system diseases	85	0.6	0.8	0.329	0.234 (0.086, 0.381)
	Genitourinary system diseases	105	0.81	0.819	0.467	0.43 (0.293, 0.566)

We further computed the self-checking consistency score for the prompt for investigating the concept-to-term subsumption relation and its question-level consistency score using semantically equivalent questions.²³ For the self-checking consistency score, the temperature was set to 1,²⁴ and 10 stochastic samples were generated. For the question-level consistency score, the temperature was set to 0 and the following 3 semantic equivalent questions were included (1) “Is “term” subsumed by “concept”? Return yes or no only”; (2) “Does “term” fall under “concept”? Return yes or no only”; (3) “Is “term” a type of “concept”? Return yes or no only”. According to Table 4, concept-term pairs for which all evaluators concurred that GPT-4 accurately identified the subsumption relation, had greater consistencies. Those on which evaluators did not reach a consensus on GPT-4’s judgment, had relatively lower consistencies. Those for which all evaluators concurred that GPT-4’s judgment was inaccurate, had the lowest consistencies.

Table 4.

Average self-checking consistency score and average question-level consistency score for the 3 high-level categories.

		Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
All cases	N (%)	64 (100%)	85 (100%)	105 (100%)
	Avg. self-checking consistency score (SD)	0.095 (0.254)	0.167 (0.322)	0.111 (0.27)
	Avg. question-level consistency score (SD)	0.182 (0.244)	0.176 (0.255)	0.137 (0.215)
Fully agreed correct cases	N (%)	27 (42.2%)	22 (25.9%)	43 (41%)
	Avg. self-checking consistency score (SD)	0.011 (0.042)	0.077 (0.243)	0.035 (0.149)
	Avg. question-level consistency score (SD)	0.099 (0.155)	0.045 (0.117)	0.039 (0.108)
Discrepant cases	N (%)	28 (43.8%)	50 (58.8%)	52 (49.5%)
	Avg. self-checking consistency score (SD)	0.089 (0.218)	0.142 (0.294)	0.179 (0.326)
	Avg. question-level consistency score (SD)	0.19 (0.211)	0.2 (0.269)	0.192 (0.25)
Fully agreed incorrect cases	N (%)	9 (14.1%)	13 (15.3%)	10 (9.5%)
	Avg. self-checking consistency score (SD)	0.367 (0.485)	0.415 (0.428)	0.09 (0.285)
	Avg. question-level consistency score (SD)	0.407 (0.401)	0.308 (0.287)	0.267 (0.211)

		Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
All cases	N (%)	64 (100%)	85 (100%)	105 (100%)
	Avg. self-checking consistency score (SD)	0.095 (0.254)	0.167 (0.322)	0.111 (0.27)
	Avg. question-level consistency score (SD)	0.182 (0.244)	0.176 (0.255)	0.137 (0.215)
Fully agreed correct cases	N (%)	27 (42.2%)	22 (25.9%)	43 (41%)
	Avg. self-checking consistency score (SD)	0.011 (0.042)	0.077 (0.243)	0.035 (0.149)
	Avg. question-level consistency score (SD)	0.099 (0.155)	0.045 (0.117)	0.039 (0.108)
Discrepant cases	N (%)	28 (43.8%)	50 (58.8%)	52 (49.5%)
	Avg. self-checking consistency score (SD)	0.089 (0.218)	0.142 (0.294)	0.179 (0.326)
	Avg. question-level consistency score (SD)	0.19 (0.211)	0.2 (0.269)	0.192 (0.25)
Fully agreed incorrect cases	N (%)	9 (14.1%)	13 (15.3%)	10 (9.5%)
	Avg. self-checking consistency score (SD)	0.367 (0.485)	0.415 (0.428)	0.09 (0.285)
	Avg. question-level consistency score (SD)	0.407 (0.401)	0.308 (0.287)	0.267 (0.211)

Concept-term pairs within the test set were further categorized into fully agreed correct cases (ie, all 3 evaluators concurred that GPT-4 accurately identified their subsumption relation), discrepant cases (ie, evaluators did not reach a consensus on GPT-4’s judgment), and fully agreed incorrect cases (all evaluators concurred that GPT-4 inaccurately identified their subsumption relation). The closer the score to 0, the greater the consistency of the GPT-4’s output.

Table 4.

Average self-checking consistency score and average question-level consistency score for the 3 high-level categories.

		Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
All cases	N (%)	64 (100%)	85 (100%)	105 (100%)
	Avg. self-checking consistency score (SD)	0.095 (0.254)	0.167 (0.322)	0.111 (0.27)
	Avg. question-level consistency score (SD)	0.182 (0.244)	0.176 (0.255)	0.137 (0.215)
Fully agreed correct cases	N (%)	27 (42.2%)	22 (25.9%)	43 (41%)
	Avg. self-checking consistency score (SD)	0.011 (0.042)	0.077 (0.243)	0.035 (0.149)
	Avg. question-level consistency score (SD)	0.099 (0.155)	0.045 (0.117)	0.039 (0.108)
Discrepant cases	N (%)	28 (43.8%)	50 (58.8%)	52 (49.5%)
	Avg. self-checking consistency score (SD)	0.089 (0.218)	0.142 (0.294)	0.179 (0.326)
	Avg. question-level consistency score (SD)	0.19 (0.211)	0.2 (0.269)	0.192 (0.25)
Fully agreed incorrect cases	N (%)	9 (14.1%)	13 (15.3%)	10 (9.5%)
	Avg. self-checking consistency score (SD)	0.367 (0.485)	0.415 (0.428)	0.09 (0.285)
	Avg. question-level consistency score (SD)	0.407 (0.401)	0.308 (0.287)	0.267 (0.211)

		Cardiovascular diseases	Endocrine system diseases	Genitourinary system diseases
All cases	N (%)	64 (100%)	85 (100%)	105 (100%)
	Avg. self-checking consistency score (SD)	0.095 (0.254)	0.167 (0.322)	0.111 (0.27)
	Avg. question-level consistency score (SD)	0.182 (0.244)	0.176 (0.255)	0.137 (0.215)
Fully agreed correct cases	N (%)	27 (42.2%)	22 (25.9%)	43 (41%)
	Avg. self-checking consistency score (SD)	0.011 (0.042)	0.077 (0.243)	0.035 (0.149)
	Avg. question-level consistency score (SD)	0.099 (0.155)	0.045 (0.117)	0.039 (0.108)
Discrepant cases	N (%)	28 (43.8%)	50 (58.8%)	52 (49.5%)
	Avg. self-checking consistency score (SD)	0.089 (0.218)	0.142 (0.294)	0.179 (0.326)
	Avg. question-level consistency score (SD)	0.19 (0.211)	0.2 (0.269)	0.192 (0.25)
Fully agreed incorrect cases	N (%)	9 (14.1%)	13 (15.3%)	10 (9.5%)
	Avg. self-checking consistency score (SD)	0.367 (0.485)	0.415 (0.428)	0.09 (0.285)
	Avg. question-level consistency score (SD)	0.407 (0.401)	0.308 (0.287)	0.267 (0.211)

Discussion

We developed an automatic pipeline using LLM and RWE for drug indication taxonomy learning, from free-text drug labels to a structured hierarchical classification of indication concepts. LLMs perform a variety of critical roles in the pipeline, each contributing to a specific subtask and cumulatively enhancing the pipeline’s overall functionality. It functions as a text parser to extract indication terms and a similarity checker to detect their semantic equivalence. It also serves as a content generator to create a more granular classification of indications and a relation checker to determine subsumption relations between 2 clinical entities. RWE, on the other hand, plays the role of a quality controller, ensuring an appropriate granularity and maintaining the right balance in the taxonomy’s structure—neither too shallow to miss essential details and fail to distinguish distinct drug indication terms, nor too deep to become overly complex and create unnecessary subdivisions for semantically equivalent terms. Additionally, we defined rules to maintain consistency and acyclicity of the taxonomy.

The rationale behind deciding on 3 levels of hierarchy for each creation of sub-categories resulted from our iterative and extensive trial-and-error experimentation. We initially restricted it to 2 levels but observed that GPT-4 tended to collapse the hierarchical structure of concepts, which was not desirable because it leads to a large flat list and is against the ontology design principles listed in “Ontology development 101” by Noy and McGuinness.²⁵ Our later experiments with 4 layers resulted in GPT-4 overly emphasizing the depth of the structure, which also did not yield the desired outcome in terms of generating as many accurate child concepts as possible. Therefore, after careful consideration of the tradeoffs between going deeper or shallower in these experiments, we chose the depth of 3, which provided the most desirable balance for our framework.

Cimino²⁶ proposed the widely adopted desiderata for controlled medical taxonomy. Various levels of granularity are preferred and necessary. The 3 high-level categories in the evaluation have a depth comparable to the counterparts in SNOMED-CT. Prior to the proposed approach, we had attempted to derive a complete taxonomy for a root concept using a single prompt; however, the outputs were not satisfactory. It failed to produce a taxonomy with the necessary level of granularity to differentiate indication terms effectively. In this study, we implemented a crawling procedure, enabling the construction of a taxonomy with a fine-grained granularity.

A key desideratum is facilitating the expansion of content. The continuous emergence of novel products targeting newly identified diseases and subtypes underscores this necessity and our framework inherently accommodates such expansions. Upon the identification of a new drug indication term, our method will systematically determine the most appropriate position within the taxonomy by checking its subsumption relation to the concepts from top to bottom and finding its closest ancestor concepts. Additional potential descendant concepts will be generated by LLMs, and their inclusion will depend on their information gain determined by RWE. Multiple previous studies have also investigated the automatic text-mining-based methods for ontology extension, adding new sibling or child concepts to the existing ones in a biomedical ontology.²⁷^,²⁸ Compared to theirs, we are taking advantage of LLMs, including their language generation capability, adaptability, and scalability.

Another desirable characteristic is polyhierarchy, where a concept can have multiple parent concepts. It facilitates a variety of strategies for navigating taxonomy and integrating data.²⁹ Fortunately, our framework fulfills this demand. For example, the concept “acute pyelonephritis” belongs to both “kidney infections” and “urinary tract infections with pyelonephritis” in a parent-child relationship. Additionally, our approach also meets the desideratum of avoiding the use of “not elsewhere classified” concepts, whose definitions can only be determined by the rest of the concepts. Miscellaneous concepts were excluded.

We have conducted evaluations for indication concept-to-concept and concept-to-term subsumption relations. The former achieved higher accuracies and “good to very good” inter-rater reliability, suggesting that LLMs are good at constructing their own hierarchies for drug indications. However, the latter achieved lower accuracies and “fair to moderate” inter-rater reliability, indicating the complexity of the concept-to-term subsumption relation determination.

After revisiting the wrong concept-term pairs, 2 primary types of errors were identified. The first type of error refers to the shared children issue. Both the indication concept and indication term function as parents. They share children, but at least 1 of them has their own unique offspring. For example, the concept “arrhythmias in pregnancy” and the term “repetitive paroxysmal supraventricular tachycardia” could potentially share the child “repetitive paroxysmal supraventricular tachycardia in pregnancy,” but “arrhythmias in pregnancy” could have unique children such as “atrial fibrillation in pregnancy.” The prompt was designed to determine an “is-a” relationship between a term and a concept. However, lacking knowledge of any of their potential children led to wrong decisions. A possible solution is to implement a 2-step verification process. For a concept $C$ and a term $T$ ⁠, we can first check the possibility for a patient of having $C$ without $T$ ⁠, and then check the possibility of having $T$ without $C$ ⁠. The idea behind this verification process is to identify the existence of unshared children unique to $T$ or $C$ ⁠. If the output is YY (ie, yes-yes), $C$ and $T$ have both shared and unshared children. If the output is NN (ie, no-no), $C$ and $T$ are synonyms. If the output is YN, $C$ subsumes $T$ ⁠, and vice versa. The determination of concept-to-concept relations was not subject to the same issue, owing to its purely top-down derivation process.

The second type of error pertains to the inversion of concept-term subsumption relations. Specifically, the term falling under the concept unexpectedly subsumes that concept. For example, the concept “st-segment elevation myocardial infarction (stemi)” is subsumed by, but does not subsume, the term “acute myocardial infarction coronary occlusion.” A follow-up experiment was conducted where we directed queries to LLMs that inverted the concept-term subsumption relation, asking “Is “concept” a “term”? Return yes or no only”. The responses were affirmative in most of the cases with this type of error, suggesting that they were treated as synonyms.

Additionally, concept-term pairs on which evaluators reach no consensus or even completely disagree tend to exhibit low consistency in their prompt outputs for subsumption relation checking. A lower consistency suggests that GPT-4 may find this task challenging leading to increased chances of mistakes. LLMs still struggle to classify indication terms in unregulated language extracted from product labels into a structured vocabulary. Substantial gains can be further achieved regarding using LLMs for concept normalization.

There are several limitations in our study. First, the available RWE for information gain computation is limited where 24% of indication terms were without a linked RxNorm drug. Imprecision might arise in the process of expanding the taxonomy based on information gain, which could affect its granularity. Second, we only selected GPT-4 as the underlying model. It is crucial to acknowledge the diversity of general LLMs (eg, LLaMA,³⁰ PaLM2³¹) and domain-specific LLMs (eg, PMC-LLaMA,³² Med-PaLM 2³³). No single LLM consistently outperforms all others across every scenario.³⁴ Conducting comparative studies across various LLMs may be valuable. Additionally, the prompt engineering might not be fully optimized for each subtask. Sivarajkumar et al.,³⁵ investigated the optimal prompt types for different clinical NLP tasks, providing guidance in refining prompts. Regarding the generated high-level categories, many indication terms fell under the category of rare diseases as its definition is vague and diverse. It is recommended to provide a specific definition in the prompt for checking the concept-to-term subsumption relation for the concept “rare diseases.”

Following our current study, numerous potential directions for future research arise. First, we can further take advantage of the inherent non-determinism of LLMs. Beyond using RWE to control the depth of the taxonomy, it can also guide us in selecting the optimal stochastic output of the subcategories for a concept of interest. In addition, it would be worthwhile to explore to what extent a domain-specific LLM helps compared to a general LLM. One might be interested in first fine-tuning an LLM on drug labels and extant biomedical ontologies such as SNOMED and Disease Ontology³⁶ and then using it for drug indication taxonomy construction. Furthermore, future work could involve analyzing drug indications in a more detailed manner, being linked not only to an active moiety but also to its dosage. The drug-indication relationships would be further categorized into prevention, diagnosis, treatment, and symptomatic alleviation. It may be interesting to see how our method can be adapted into a pipeline for constructing an ontology with various attributes and relationships included. Additionally, although we endeavored to develop as comprehensive a taxonomy as possible, we acknowledge that it may remain an incomplete representation. This situation arises because, despite employing an iterative and recursive concept searching strategy, as well as designing prompts specifically crafted to encourage GPT-4 to identify all true child concepts of the concept of interest, achieving complete coverage of all child concepts is inherently challenging. As part of our future work, we plan to assess the completeness of the taxonomy and dedicate efforts towards its refinement. We also plan to enrich our taxonomy by incorporating extracted terms that are subsumed by concept C but not by its child concepts and are not synonyms of C. These will be added as additional child concepts of C, thereby constructing a more comprehensive taxonomy.

Having the detailed pipeline we developed for drug indication taxonomy construction, it is crucial to recognize the broader implications of our approach. Our core principles are to break down the task into manageable subtasks and incorporate RWE to steer the process. The methodologies and strategies we have implemented hold great potential for adaptation and application across various tasks involving information acquisition, normalization, and classification of medical knowledge.

Conclusions

Generative AI can be used to support various taxonomy development activities. However, they do not fully support an end-to-end process such as directly mapping terms to concepts in existing vocabularies or directly generating a complete taxonomy for a concept. We proposed an automatic pipeline integrating LLMs and RWE to generate a taxonomy, optimized to distinguish between indications and further organize the drugs. Further evaluation is needed to assess its support for downstream tasks, such as enabling large-scale phenotyping based on drug-indication relationships and indication taxonomy. Overall, we provide a general framework for developing a taxonomy that can be applied beyond the context of drug indications.

Acknowledgments

The authors would like to thank Gowtham Rao and Dmitry Dymshyts for their clinical expertise and independent validation of the concept-to-concept and concept-to-term subsumption relations.

Author contributions

The authors contributed to the study as follows: Yilu Fang: Idea initialization, conceptualization, methodology, data curation, formal analysis, investigation, validation, visualization, and writing—original draft. Patrick Ryan: Idea initialization, project administration, conceptualization, methodology, formal analysis, investigation, validation, research supervision, and writing—reviewing and editing. Chunhua Weng: Project administration, investigation, funding acquisition, research supervision, and writing—reviewing and editing.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

The research reported in this publication was supported by the National Center for Advancing Translational Sciences grant OT2TR003434, National Library of Medicine grants 1R01LM014344 and R01LM009886. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflicts of interest

Y.F. and C.W. declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. P.R. is an employee of Janssen Research and Development and a shareholder of Johnson & Johnson.

Data availability

The data underlying this article are available in the article and in its online supplementary material. The source code is available upon reasonable request from the corresponding author.

References

Wang

Rastegar-Mojarad

, et al.

Clinical information extraction applications: A literature review

J Biomed Inform

2018

;

Hahn

UDO

Oleynik

Medical information extraction in the age of deep learning

Yearb Med Inform

2020

;

(

208

220

PubMed

Indications and Usage Section of Labeling for Human Prescription Drug and Biological Products–Content and Format Guidance for Industry. https://www.fda.gov/files/drugs/published/Indications-and-Usage-Section-of-Labeling-for-Human-Prescription-Drug-and-Biological-Products-%E2%80%94-Content-and-Format-Guidance-for-Industry.pdf

Bhatt

Roberts

Chen

, et al.

DICE: A drug indication classification and encyclopedia for AI-based indication extraction

Front Artif Intell

2021

;

711467

Fung

Jao

Demner-Fushman

Extracting drug indication information from structured product labels using natural language processing

J Am Med Inform Assoc

2013

;

(

482

488

Khare

Wei

C-H

Automatic Extraction of Drug Indications from FDA Drug Labels. American Medical Informatics Association;

2014

787

Ursu

Holmes

Knockel

, et al.

DrugCentral: online drug compendium

Nucleic Acids Res

2017

;

(

D932

D939

Shi

Ren

Zhang

Gong

Liang

Information extraction from FDA drug Labeling to enhance product-specific guidance assessment using natural language processing

Front Res Metr Anal

2021

;

670006

Aronson

AR.

Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program. American Medical Informatics Association;

2001

OHDSI. Usagi. Accessed November 1, 2023. https://github.com/OHDSI/Usagi

Hoxha

Jiang

Weng

Automated learning of domain taxonomies from text using background knowledge

J Biomed Inform

2016

;

295

306

Singhal

Azizi

TAO

, et al.

Large language models encode clinical knowledge

Nature

2023

;

620

(

7972

172

180

Agrawal

Hegselmann

Lang

Kim

Sontag

Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.

2022

1998

2022

Yang

Deng

Wang

Song

MIN

Shen

SI.

A generative drug–drug interaction triplets extraction framework based on large language models

Proc Assoc Inf Sci Technol.

2023

;

(

980

982

Crossref

Kartchner

Ramalingam

Al-Hussaini

Kronick

Mitchell

Zero-Shot Information Extraction for Clinical Meta-Analysis using Large Language Models. In: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks.

2023

396

405

Wang

Gao

Exploring the in-context learning ability of large language model for biomedical concept linking. arXiv, arXiv:2307011372023, preprint: not peer reviewed.

Cohen

Geva

Berant

Globerson

Crawling the Internal Knowledge-Base of Language Models. In: Findings of the Association for Computational Linguistics: EACL 2023

. 2023.

Funk

Hosemann

Jung

Lutz

2023

. Towards ontology construction with language models. arXiv, arXiv:230909898, preprint: not peer reviewed.

OpenAI

2023

. GPT-4 Technical Report. Accessed January 3, 2024. https://cdn.openai.com/papers/gpt-4.pdf

Bohn

Gilbert

Knoll

Kern

Ryan

PB.

2023

. Large-scale empirical identification of candidate comparators for pharmacoepidemiological studies. medRxiv:2023.02.14.23285755, preprint: not peer reviewed. https://doi.org/10.1101/2023.02.14.23285755

Gwet

KL.

Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement among Raters

Advanced Analytics, LLC

;

2014

Google Preview

Wongpakaran

Wedding

Gwet

KL.

A comparison of Cohen's Kappa and Gwet's AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples

BMC Med Res Methodol

2013

;

Zhang

Das

Malin

Kumar

SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency. In: Findings of the Association for Computational Linguistics: EMNLP 2023

. 2023.

Manakul

Liusie

Gales

MJ.

SelfcheckGPT: zero-resource black-box hallucination detection for generative large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

Noy

McGuinness

DL.

Ontology development 101: a guide to creating your first ontology. 2012. https://corais.org/sites/default/files/ontology_development_101_aguide_to_creating_your_first_ontology.pdf

Cimino

J J.

Desiderata for controlled medical vocabularies in the twenty-first century

Methods Inf Med

1998

;

(

04/05

394

403

PubMed

Fabian

Wächter

Schroeder

Extending ontologies by finding siblings using set expansion techniques

Bioinformatics

2012

;

(

i292

i300

Althubaiti

Kafkas

Abdelhakim

Hoehndorf

Combining lexical and context features for automatic ontology extension

J Biomed Semantics

2020

;

(

Richesson

Fung

Krischer

JP.

Heterogeneous but “standard” coding systems for adverse events: Issues in achieving interoperability between apples and oranges

Contemporary Clinical Trials

2008

;

(

635

645

Touvron

Lavril

Izacard

, et al.

2023

. LLaMA: open and efficient foundation language models. arXiv, arXiv:230213971, preprint: not peer reviewed.

Anil

Dai

Firat

, et al.

2023

. PaLM 2 technical report. arXiv, arXiv:230510403, preprint: not peer reviewed.

Zhang

Wang

Xie

2023

. PMC-LLaMA: further finetuning LLaMA on medical papers. arXiv, arXiv:230414454, preprint: not peer reviewed.

Singhal

Gottweis

, et al.

2023

. Towards expert-level medical question answering with large language models. arXiv, arXiv:230509617, preprint: not peer reviewed.

Jahan

Laskar

MTR

Peng

Huang

JX.

A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks

Comput Biol Med

2024

;

171

108189

Sivarajkumar

Kelley

Samolyk-Mazzanti

Visweswaran

Wang

An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study

JMIR Med Inform

2024

;

e55318

Schriml

Arze

Nadendla

, et al.

Disease Ontology: a backbone for disease semantic integration

Nucleic Acids Res

2012

;

(

Database issue

D940

D946

PubMed