Table 3. Open in new tab Performance of...

Table 3.

Performance of the agreement models compared to a random and majority baseline on the manually annotated test-set for the task of predicting concept presence in a patient note. Bolded text represents the best performing approach.

Approach	Concept accuracy	Concept precision	Concept recall	Concept F1
Random baseline	0.31	0.45	0.31	0.35
Majority baseline	0.65	0.42	0.65	0.51
Llama2-7B	0.43	0.62	0.43	0.38
Llama2-13B	0.47	0.68	0.48	0.5
GPT-4 (prompt does not contain concept definition)	0.85	0.86	0.82	0.83
GPT-4 (prompt contains concept definition)	0.86	0.87	0.86	0.84

Approach	Concept accuracy	Concept precision	Concept recall	Concept F1
Random baseline	0.31	0.45	0.31	0.35
Majority baseline	0.65	0.42	0.65	0.51
Llama2-7B	0.43	0.62	0.43	0.38
Llama2-13B	0.47	0.68	0.48	0.5
GPT-4 (prompt does not contain concept definition)	0.85	0.86	0.82	0.83
GPT-4 (prompt contains concept definition)	0.86	0.87	0.86	0.84

This Feature Is Available To Subscribers Only