Abstract

Recent developments of deep learning methods have demonstrated their feasibility in liver malignancy diagnosis using ultrasound (US) images. However, most of these methods require manual selection and annotation of US images by radiologists, which limit their practical application. On the other hand, US videos provide more comprehensive morphological information about liver masses and their relationships with surrounding structures than US images, potentially leading to a more accurate diagnosis. Here, we developed a fully automated artificial intelligence (AI) pipeline to imitate the workflow of radiologists for detecting liver masses and diagnosing liver malignancy. In this pipeline, we designed an automated mass-guided strategy that used segmentation information to direct diagnostic models to focus on liver masses, thus increasing diagnostic accuracy. The diagnostic models based on US videos utilized bi-directional convolutional long short-term memory modules with an attention-boosted module to learn and fuse spatiotemporal information from consecutive video frames. Using a large-scale dataset of 50 063 US images and video frames from 11 468 patients, we developed and tested the AI pipeline and investigated its applications. A dataset of annotated US images is available at https://doi.org/10.5281/zenodo.7272660.

Introduction

Liver malignancy is the fourth leading cause of cancer-related death worldwide and ranks sixth in terms of incident cases [1]. In China alone, liver malignancy accounts for more than half of all liver malignancy-related deaths [2]. Early detection and diagnosis of liver malignancy, as well as timely treatment, are crucial for improving patient survival. Ultrasound (US) is a flexible, safe, low cost and real-time examination tool that employs the pulse-echo principle to produce an anatomical tomogram and detect abnormalities such as liver masses [3]. It is commonly used as the first-line liver imaging method for monitoring, screening and diagnosis of liver malignancy [4]. Despite its extensive usage, the accuracy of US-based detection of liver malignancy varies widely [5, 6], owing to the fact that US is highly dependent on the expertise, experience and attention to detail of radiologists [2]. Therefore, it is vital to develop computer-aided diagnosis systems to help radiologists improve diagnostic accuracy.

Recent developments of deep learning models in artificial intelligence (AI) [7] have demonstrated their feasibility in medical imaging [8, 9, 10], including diagnosing liver malignancy using US images [11]. However, a fully automated AI pipeline that is robust to varying US image conditions and aligned with the standard of actual clinical application is lacking. Many previous studies used small datasets [12–14], lacked external testing to evaluate their models’ robustness and reliability, or required radiologists to manually select images from US videos and annotate mass regions [15–17], hence limiting their clinical applications. A distinct challenge is heterogeneity of US images. That is, liver images are known to have various levels of brightness and noises, diverse liver shapes and various sizes and locations of liver masses. This challenge makes it difficult to develop a robust AI model for detection and diagnosis of liver malignancy. Another challenge is the opacity of the decision-making process of deep learning models, which hinders the translation of AI systems to clinical applications.

Meanwhile, although studies have reported AI models with considerable accuracy for detecting liver malignancy, the majority of these models were developed on static US images that were manually selected by radiologists [14,15,18–20]. However, in US images, the overall morphology of liver masses and their relationships with surrounding structures, which are critical for radiologists in making diagnosis, would most likely be lost [21]. On the other hand, the US videos can provide more comprehensive information on the morphology and texture of the liver masses, which AI models can employ to develop a more accurate diagnosis model for liver malignancy [22]. However, not all video frames can be used for diagnostic analysis; radiologists always search for frames that clearly display liver masses. Therefore, the design of a comprehensive model for US videos should maximize their advantages while limiting their disadvantages.

In this study, we aim to develop a comprehensive AI pipeline that imitates the workflow of radiologists for liver malignancy diagnosis using US images and video frames. This pipeline is a hierarchical, fully automated system that integrates deep learning segmentation and classification methods to perform scan-liver segmentation, mass detection and segmentation and diagnostic analysis (Figure 1A). In this pipeline, we designed an automated mass-guided strategy to incorporate mass segmentation information into the diagnostic network so that the diagnostic models can focus on the mass regions in the US images, thus increasing prediction accuracy and making the results more interpretable. On top of this, we took advantage of the bi-directional convolutional long short-term memory (BConvLSTM) network [23], which is capable of extracting spatiotemporal information from US videos and developed Attention-Boosted BConvLSTM-based diagnostic models. Not only can the models learn morphological information about liver masses and their relationships with surrounding structures from consecutive frames, but they also pay particular attention to key frames that clearly show liver masses.

 The overview of the AI pipeline. (A) The AI pipeline for liver malignancy diagnosis. First, a segmentation framework was applied to US images to hierarchically segment scan regions and livers. Then, a CNN detection model and a segmentation model were applied to the segmented liver images to detect and segment liver masses, respectively. Finally, the diagnostic models used a combination of US images and clinical factors for (i) liver malignancy diagnosis and subtype predictions, (ii) comparison with the performance of radiologists and (iii) development of new models for US videos. (B) A three-stage segmentation framework. This framework was trained on manually annotated images to segment scan regions from images, livers from other organs and masses from livers. The trained networks were integrated into the AI pipeline in (A). See also Supplementary Figure S2a for more detailed model description. (C) An Attention-Boosted BConvLSTM-based classification model for US videos. For every t consecutive US frames, the CNN models’ outputs from the AI pipeline were input into the BConvLSTM layers and then the Attention-Boosted module learned and fused spatiotemporal information for liver malignancy diagnosis. See also Supplementary Figure S2b for more detailed model description.
Figure 1

The overview of the AI pipeline. (A) The AI pipeline for liver malignancy diagnosis. First, a segmentation framework was applied to US images to hierarchically segment scan regions and livers. Then, a CNN detection model and a segmentation model were applied to the segmented liver images to detect and segment liver masses, respectively. Finally, the diagnostic models used a combination of US images and clinical factors for (i) liver malignancy diagnosis and subtype predictions, (ii) comparison with the performance of radiologists and (iii) development of new models for US videos. (B) A three-stage segmentation framework. This framework was trained on manually annotated images to segment scan regions from images, livers from other organs and masses from livers. The trained networks were integrated into the AI pipeline in (A). See also Supplementary Figure S2a for more detailed model description. (C) An Attention-Boosted BConvLSTM-based classification model for US videos. For every t consecutive US frames, the CNN models’ outputs from the AI pipeline were input into the BConvLSTM layers and then the Attention-Boosted module learned and fused spatiotemporal information for liver malignancy diagnosis. See also Supplementary Figure S2b for more detailed model description.

The AI pipeline was trained, validated and internally tested on a large-scale cohort of 43 746 US images containing a variety of US equipment, examination settings and histological subtypes. It was externally tested on two datasets with a total of 6317 images, which demonstrated its robustness and efficiency across a variety of US imaging conditions and liver mass types. Our model outperformed junior radiologists and was comparable to mid-level radiologists in terms of accuracy on an independent cohort. To investigate the potential clinical applications of our AI pipeline, we simulated a scenario using consensus from both radiologists and the AI pipeline in decision-making. Moreover, experimental results showed that our video-based diagnostic models can provide more accurate prediction than image-based models for diagnosing liver malignancy with an increasing area under the receiver operating characteristic curve (AUC) from 0.967 to 0.983 (with clinical factors) and 0.943 to 0.966 (without clinical factors) at the patient level.

Materials and methods

Data collection

We constructed a large US dataset by combining data from three geographical regions in China: Guangzhou, a city in Guangdong province (the Guangzhou cohort); Foshan, a city also in Guangdong province (the Foshan cohort) and Yichang, a city in Hubei province (the Yichang cohort). A total of 50 063 US images from 11 468 patients in these three cohorts were obtained. We also collected serological examination results for patients with liver masses in both the Guangzhou and Foshan cohorts (Supplementary Tables S1 and S2). The US devices and definitions of liver masses are summarized in Supplementary Appendices 1.1 and 1.2, respectively. The Ethic committee approvals were obtained in all the institutions, and all the participants’ patients signed a consent form.

To collect the US images and video clips, the examining physicians performed two-dimensional US scans of the liver according to the routine procedure. Each video was clipped during mass appearance in the visual field. Meanwhile, the images with the major sections of the masses were kept if intrahepatic masses were found and clearly displayed. All US images and clinical factors were first de-identified to remove any patient-related information. A subset of 735 US images were annotated for segmentation model development, including 435 images of malignant masses, 200 images of benign masses and 100 images of normal livers. Two radiologists with >10 years of experience annotated and verified the data, respectively. In the event of a disagreement, they discussed the case and reached a consensus.

Implementation of the AI pipeline

As shown in Figure 1A, the AI pipeline consisted of four components: scan-liver segmentation, mass detection, mass segmentation and diagnostic analysis of liver masses. First, the scan-liver segmentation model received US images as input and produced normalized liver images. Second, the model for liver mass detection predicted whether or not the liver images contain masses. Then, the liver images containing masses were processed for mass segmentation. Finally, the mass-guided deep learning models integrated liver images with patients’ clinical factors to conduct diagnostic analysis, including malignant and benign liver mass classification, and histological subtype prediction. The design of the framework was able to reduce several kinds of noises in the original images, such as US background noises, human operation biases and device-dependent variations, while also providing greater generalization. The following sections detail the development of three-stage liver-mass segmentation models (Figure 1B and Supplementary Figure S2a) for the first and third components, as well as the implementation of classification models for the second and final component.

Development of three-stage segmentation framework

A raw US image contains a background area outside the US scan region and the echogenicity of other unrelated organs and tissues, which may interfere with the classification models for liver malignancy diagnosis. We propose an automated three-stage segmentation framework to solve this problem and ensure device compatibility and performance. As shown in Figure 1B and Supplementary Figure S2a, the framework includes three stages: segmenting US scan regions from images (Stage 1), livers from the scan regions (Stage 2) and masses from livers (Stage 3). Together, they decomposed a multi-class segmentation problem into a sequence of three binary segmentation problems according to sub-region hierarchy.

At the first stage, we down-sampled the raw US images to a low resolution of |$128\times 128$| and segmented the scan regions using the scan-region segmentation models. We then calculated the bounding boxes with the segmentation results and cropped them from the original images, thereby removing the areas outside the scan regions. At the second stage, we segmented the liver regions from the cropped images using the liver segmentation models, thereby removing other organs, background and noise. At this stage, the inputs (scan regions) were normalized to |$256\times 256$| to balance computational cost and accuracy. The segmented liver regions were then cropped from the original images and normalized to |$512\times 512$|⁠. At the third stage, the mass segmentation models took the normalized liver images to segment liver masses for diagnostic analysis.

We assessed three widely used deep learning semantic segmentation models as the backbone for the framework: Fully Convolutional Network (FCN) [24], U-Net [25] and DeepLabV3 [26]. FCN consists of multiple convolutional and max-pooling layers, followed by up-sampling layers to identify pixel-wise labels and predict segmentation masks. Compared with FCN, U-Net adds horizontal concatenation operations that combine high-resolution features in the contracting path with the up-sampled output. In this way, the successive convolution layers can learn to assemble more precise outputs and increase localization accuracy. DeepLabV3 utilizes the Atrous Spatial Pyramid Pooling module to probe convolutional features at multiple scales, thus boosting the model’s segmentation performance at multiple scales. To improve their generalizability, we pretrained the models on the Microsoft (MS) common object in context (COCO) dataset [27], which is a large-scale semantic segmentation dataset with 2.5 million labeled instances in 328 k images. Moreover, we used data augmentation that included rotation, brightness adjustment, horizontal/vertical flips and elastic deformations during the training stage to allow models to learn slight variations of images. The training settings are summarized in Supplementary Appendix 1.3.

Development of diagnostic models

To simulate the clinical pathways of US examination, we developed a hierarchical diagnostic system that integrated segmentation networks into classification networks. The classification networks included two sequential diagnostic procedures to address common clinical scenarios. First, a ‘mass detection’ network is called to differentiate between US images with masses (abnormal) and those without (normal). Second, a ‘diagnostic analysis’ network is called to classify liver masses as malignant or benign and perform histological subtype prediction.

For the ‘mass detection’ network, segmented liver images from Stage 2 of the segmentation framework were first resized to 512|$\times$|512 and then analyzed by a DenseNet121-based convolutional neural network (CNN) model for differentiation between normal and abnormal livers. In comparison, we also developed a CNN model using original US images without segmented livers as inputs. We trained, validated and internally tested the classification models on the US image data from the Guangzhou cohort, with a random patient-level split of 70%:10%:20%.

Various combinations of clinical information were used to develop the ‘diagnostic analysis’ models to classify liver masses as malignant or benign. Specially, using the segmented masses from Stage 3 of the segmentation framework, we proposed an automated mass-guided strategy that incorporated the mass segmentation information into the diagnostic network, which classified liver masses without (LM-Net) or with clinical factors (LMC-Net). In detail, we added the mass segmentation regions on liver images as an additional input channel to the diagnostic network, thereby enhancing the diagnostic network’s ability to detect significant mass features and utilize them for diagnosis. For LMC-Net, we incorporated clinical factors into this architecture with a fully connected layer. Features of clinical factors were then concatenated with output features of the liver and mass branches and analyzed by two fully connected layers. After the fully connected layers, a SoftMax computation layer was added to produce probabilities for classification tasks. In this study, we intent to evaluate the diagnostic benefit of using both the images and the clinical factors. For this purpose, we developed two additional models: a liver image-only diagnostic model based on the CNN model (L-Net) and a machine learning (ML) classifier (C-model) using Gradient Boost Decision Tree on clinical factors. An additional benefit is that we can utilize the clinical factor-only model (C-model) to make a diagnosis if a mass is not detected by the model.

Explanation of decision-making

We used Gradient-weighted Class Activation Mapping (Grad-CAM) to discover how much each liver region in the US images contributed the classification of malignant versus benign masses, as performed by deep learning models. We applied Grad-CAM to the final convolutional layer of CNN architectures to highlight the important regions for prediction.

We also adopted a Shapley additive explanation (SHAP) method to illustrate the effect of clinical features on the C-model. SHAP is an effective method that provides explainability of the model with an advantage of local and global interpretability [28,29].

Comparison with radiologists and performance enhancement by AI

Twelve radiologists with varying years of experience were chosen and separated into three groups based on their work experience: junior-level (fewer than 8 years), intermediate-level (8–12 years) and senior-level (> 12 years), with four radiologists in each group. They independently made diagnoses for patients in the Foshan cohort according to examined US images and clinical factors. In the same cohort, we employed the LMC-Net model to predict liver-malignancy probabilities.

We then investigated the potential clinical applications of LMC-Net in the diagnosis of liver malignancy. We simulated the scenario using consensus derived from both radiologists and our AI model, in which the AI model was deployed as a ‘second reader’ of the diagnostic decisions of radiologists [30]. We randomly divided the four senior radiologists and the four junior radiologists into four groups that each group consisted of one senior radiologist and one junior radiologist. For each group, we used a junior radiologist as the first reader. When the AI model agreed with a junior radiologist’s decision, the decision was considered final. In the event of a disagreement, a senior radiologist’s opinion was sought.

Histological subtype prediction

To assess the AI model’s performance in distinguishing more detailed histological subtypes, we selected two subsets from the Guangzhou cohort, one with malignant liver masses and the other with benign liver masses. The malignant subset includes 915 images from 412 patients with hepatocellular carcinoma (HCC), 78 images from 36 patients with intrahepatic cholangiocarcinoma (ICC) and 457 images from 188 patients with metastases. The benign subset includes 4123 images from 1527 patients with hemangioma, 5005 images from 1832 patients with liver cysts and 250 images from 113 patients with focal nodular hyperplasia (FNH). Malignant masses and FNHs were confirmed by biopsy or post-surgery pathology, whereas hemangiomas and cysts were confirmed by enhanced imaging. This study employed a 5-fold cross-validation strategy to train and validate the LMC-Net model, with a random patient-level split of 80%:20% for training and validation in each fold.

Attention-boosted BConvLSTM-based models for US videos

In practice, radiologists examine and analyze multiple consecutive US video frames for morphological and texture information of liver masses, as well as the other information such as the sizes to diagnose liver malignancy. To imitate radiologists’ decision-making process, we proposed an Attention-Boosted BConvLSTM-based diagnostic models for US videos (Figure 1C and Supplementary Figure S2b), where BConvLSTM network captures spatiotemporal information from US videos [31]. The BConvLSTM used two ConvLSTMs [32] to process the input video frames into two directions of forward and backward paths, which were recurrent layers designed for spatiotemporal data. Since frames in the BConvLSTM-based model should not be regarded equally important, we proposed an attention-boosted module to weight the mass-attention values such that critical frames in US videos receive more attention. Given a sequence of t frames |${x}_1,\dots, {x}_i,\dots, {x}_t$| from a video, we utilized the softmax computation of the sizes of mass regions within each frame to calculate the mass-attention values |${\alpha}_i$| as follows:

where |${s}_i$| represents the proportion of the segmented masse region in the whole liver image, and |$\varepsilon$| was set to 0.01 to alleviate the adverse impact of segmentation errors. The attention-boosted module was added to the BConvLSTM layers. Let |${H}_1,\dots, {H}_t$| denote the hidden state tensors of the BConvLSTM layers. The output of the attention-boosted module |${Y}_t$| was calculated as the weighted sum of the hidden states and mass-attention values as follows:

In this study, we developed two attention-boosted BConvLSTM-based diagnostic models for US videos, one using only US videos (LM-VNet) and the other using a combination of US videos and clinical factors (LMC-VNet). We constructed the BConvLSTM-based diagnostic network using the backbone of LM-Net followed by two BConvLSTM layers, each with a kernel size of |$3\times 3$|⁠. The LM-Net was pretrained on the US image dataset to address the problem of small-scale video data. Considering the strong similarity of adjacent frames, we sub-sampled one-third of the input video frames from the original 15 frames-per-second stream. For each frame, the LM-Net backbone extracted convolutional features and fed them into the BConvLSTM layers. For every 16 consecutive frames, the BConvLSTM layers extracted the spatiotemporal features as 16 corresponding hidden state tensors, and then the attention-boosted module and an average pooling layer compressed them into one output tensor. The output tensor was the input of the fully connected (FC) layer in the LM-VNet model, whereas in the LMC-VNet model, the output tensor concatenated with the features extracted from clinical factors was the input of the FC layer. Furthermore, an additional SoftMax activation function following the FC layer predicted the malignant probabilities. We calculated video-level malignant probabilities by taking weighted sum of predicted malignant probabilities from all frames using frame-level mass-attention values as weights. In comparison, we also evaluated the LM-Net and the LMC-Net models by treating US videos as individual images. We averaged the probabilities for each patient if there were more than one image/video for one patient. The training settings are summarized in Supplementary Appendix 1.4.

Evaluating the models

We used Intersection Over Union (IoU) as a metric to evaluate the model’s performance on segmentation (Supplementary Table S3). The IoU is determined by dividing the area of overlap between the prediction segmentation region and the ground truth by the area of union of the predicted segmentation region and the ground truth. We used a variety of metrics to assess the model’s performance on classification tasks, including sensitivity, specificity, precision, accuracy and AUC. Sensitivity, specificity, precision and accuracy were determined by Youden index. The confidence intervals for the difference of two values were calculated by the bootstrap method [33] with 1000 repeats. A two-sided permutation test with 10 000 trials was used to generate P-values for the difference [34], and P-value < 0.05 was considered statistically significant. We drew the smooth AUC-receiver operator characteristic (ROC) curves using the pROC package [35].

Results

Clinical characteristics

The Guangzhou cohort served as the training, validation and internal-testing datasets for our AI pipeline. It consisted of a US image dataset and a US video dataset, a total of 43 746 US images from 10 997 patients. The US image dataset consisted of 25 087 images from 10 831 patients, including images of normal livers, benign masses and malignant masses. The US video dataset consisted of 205 video clips containing 18 659 US frames captured from 166 patients during the appearance of a mass.

The Foshan and Yichang cohorts served as two external test datasets to assess the generalizability and applicability of our AI pipeline. The Foshan cohort also served as a prospective study with 673 US images from 370 patients. The Yichang cohort was a US video dataset, which included 5644 frames from 101 patients during the appearance of a mass. Meanwhile, for each video clip, we also collected the US image that displayed the clearest mass. In this way, the video datasets of Guangzhou cohort and Yichang cohort could be used to evaluate the performance of the AI pipeline on US videos as well as US images. Details are summarized in Table 1 and Supplementary Figure S1.

Table 1

Patient demographic statistics in the developmental/internal-testing dataset (Guangzhou) and external test datasets (Foshan and Yichang)

DatasetsImage dataset (Guangzhou)Video dataset (Guangzhou)External test set1 (Foshan)External test set2 (Yichang)
Number of US images25 08718 659 (205 clips)6735644 (101 clips)
Number of Patients10 831166370101
Age mean, years (std)46.46 (14.39)52.83 (13.97)48.63 (14.78)49.25 (16.22)
Male (%)6408 (59.16%)118 (71.08%)199 (53.78%)55 (54.46%)
Normal images10 241245
Benign images954984752181708
Malignant images529710 1841643936
Normal patients (%)4284 (39.55%)73 (19.73%)
Benign patients (%)4569 (42.18%)103 (62.05%)218 (58.92%)45 (44.55%)
Malignant patients (%)1978 (18.26%)63 (37.95%)79 (21.35%)56 (55.45%)
DatasetsImage dataset (Guangzhou)Video dataset (Guangzhou)External test set1 (Foshan)External test set2 (Yichang)
Number of US images25 08718 659 (205 clips)6735644 (101 clips)
Number of Patients10 831166370101
Age mean, years (std)46.46 (14.39)52.83 (13.97)48.63 (14.78)49.25 (16.22)
Male (%)6408 (59.16%)118 (71.08%)199 (53.78%)55 (54.46%)
Normal images10 241245
Benign images954984752181708
Malignant images529710 1841643936
Normal patients (%)4284 (39.55%)73 (19.73%)
Benign patients (%)4569 (42.18%)103 (62.05%)218 (58.92%)45 (44.55%)
Malignant patients (%)1978 (18.26%)63 (37.95%)79 (21.35%)56 (55.45%)
Table 1

Patient demographic statistics in the developmental/internal-testing dataset (Guangzhou) and external test datasets (Foshan and Yichang)

DatasetsImage dataset (Guangzhou)Video dataset (Guangzhou)External test set1 (Foshan)External test set2 (Yichang)
Number of US images25 08718 659 (205 clips)6735644 (101 clips)
Number of Patients10 831166370101
Age mean, years (std)46.46 (14.39)52.83 (13.97)48.63 (14.78)49.25 (16.22)
Male (%)6408 (59.16%)118 (71.08%)199 (53.78%)55 (54.46%)
Normal images10 241245
Benign images954984752181708
Malignant images529710 1841643936
Normal patients (%)4284 (39.55%)73 (19.73%)
Benign patients (%)4569 (42.18%)103 (62.05%)218 (58.92%)45 (44.55%)
Malignant patients (%)1978 (18.26%)63 (37.95%)79 (21.35%)56 (55.45%)
DatasetsImage dataset (Guangzhou)Video dataset (Guangzhou)External test set1 (Foshan)External test set2 (Yichang)
Number of US images25 08718 659 (205 clips)6735644 (101 clips)
Number of Patients10 831166370101
Age mean, years (std)46.46 (14.39)52.83 (13.97)48.63 (14.78)49.25 (16.22)
Male (%)6408 (59.16%)118 (71.08%)199 (53.78%)55 (54.46%)
Normal images10 241245
Benign images954984752181708
Malignant images529710 1841643936
Normal patients (%)4284 (39.55%)73 (19.73%)
Benign patients (%)4569 (42.18%)103 (62.05%)218 (58.92%)45 (44.55%)
Malignant patients (%)1978 (18.26%)63 (37.95%)79 (21.35%)56 (55.45%)

Performance of segmentation models

The results of the segmentation models for our three-stage segmentation framework are summarized in Supplementary Table S3. All models had at least 0.97 IoU for scan-region segmentation, 0.92 IoU for liver segmentation and 0.71 IoU for liver mass segmentation. DeepLabV3 performed the best with 0.988, 0.940 and 0.758 IoU in the US scan region, liver and mass segmentations, respectively. As a result of its superior overall performance, we chose DeepLabV3 as the backbone of our segmentation framework.

Figure 2 shows three examples of DeepLabV3 segmentation results for US scan regions, livers and liver masses, as well as comparisons to manual annotations. The AI framework’s ability to perform precise US image segmentations was clearly demonstrated in the nearly perfect agreements between human annotations and segmentations. As a result, the segmentation system alone might be used as a visualization tool for radiologists to highlight lesion areas.

Three examples illustrating the segmentation framework’s results. The first column displays the original US images. The second column displays the manually segmented US images. The third column displays the segmented US images generated by the AI segmentation framework. The fourth column and the fifth column display saliency maps (using Grad-CAM) of the diagnostic models with (LM-Net) and without (L-Net) mass-guided strategy, respectively. The colors indicate the regions that the AI models prioritize when performing malignancy diagnosis, with the red color indicating a greater contribution to the prediction results, the white color a moderate contribution and the blue color a minor contribution.
Figure 2

Three examples illustrating the segmentation framework’s results. The first column displays the original US images. The second column displays the manually segmented US images. The third column displays the segmented US images generated by the AI segmentation framework. The fourth column and the fifth column display saliency maps (using Grad-CAM) of the diagnostic models with (LM-Net) and without (L-Net) mass-guided strategy, respectively. The colors indicate the regions that the AI models prioritize when performing malignancy diagnosis, with the red color indicating a greater contribution to the prediction results, the white color a moderate contribution and the blue color a minor contribution.

Performance of diagnostic models for US images

As shown in Figure 3A and Supplementary Table S4, our approach was able to detect abnormal livers from normal livers using segmented liver images with an AUC of 0.990 [95% confidence interval (95% CI): 0.986–0.992], a sensitivity of 95.1%, a specificity of 96.1%, a precision of 97.4% and an accuracy of 95.5%. In comparison, the CNN model using original US images without segmented livers as inputs had worse performance, with an AUC of 0.963 (95% CI: 0.960–0.966), a sensitivity of 87.9%, a specificity of 94.2%, a precision of 96.2% and an accuracy of 91.9%.

The performance of the diagnostic system on the Guangzhou cohort. (A) A comparison of the ROC curves for CNN models that detect liver masses using original US images and liver images. (B) A comparison of the ROC curves for four AI models using different combinations of clinical factors and US images for classifying benign versus malignant masses. C-model: a ML model using only clinical factors; L-Net: a deep learning diagnostic model using liver images; LM-Net: a deep learning diagnostic model using a combination of liver images and mass segmentation information; LMC-Net: a deep learning diagnostic model using a combination of liver images, mass segmentation information and clinical factors. (C) AUC values of the CNN model trained with different numbers of liver images for liver mass detection and (D) AUC values of the LMC-Net model trained with different numbers of liver images for liver malignancy diagnosis.
Figure 3

The performance of the diagnostic system on the Guangzhou cohort. (A) A comparison of the ROC curves for CNN models that detect liver masses using original US images and liver images. (B) A comparison of the ROC curves for four AI models using different combinations of clinical factors and US images for classifying benign versus malignant masses. C-model: a ML model using only clinical factors; L-Net: a deep learning diagnostic model using liver images; LM-Net: a deep learning diagnostic model using a combination of liver images and mass segmentation information; LMC-Net: a deep learning diagnostic model using a combination of liver images, mass segmentation information and clinical factors. (C) AUC values of the CNN model trained with different numbers of liver images for liver mass detection and (D) AUC values of the LMC-Net model trained with different numbers of liver images for liver malignancy diagnosis.

We evaluated the performance of the classification models for classifying liver masses as malignant or benign on the internal test dataset containing 20% of patients with liver masses from the Guangzhou Cohort, and the results are shown in Figure 3B and Supplementary Table S5. The model LM-Net without clinical factors achieved an accuracy of 89.3% and an AUC of 0.940 (95% CI: 0.927–0.954), and the model LMC-Net with clinical factors achieved an accuracy of 91.5% and an AUC of 0.968 (95% CI: 0.960–0.975). In comparison, the clinical factor only C-model had an accuracy of 81.0% and an AUC of 0.885 (95% CI: 0.880–0.889), whereas the image only L-Net had an accuracy of 85.2% and an AUC of 0.916 (95% CI: 0.907–0.924). These results demonstrated that the mass-guided strategy on liver images improved the diagnostic accuracy by 4.1%, and that adding clinical factors improved the image-based diagnostic accuracy by 2.2%.

To evaluate the performance gains of our proposed model from increasing data size, we randomly sampled 500, 1000, 2000, 4000, 8000 and 16 000 images from our dataset for training a liver mass detection model, and 500, 1000, 2000, 4000 and 8000 images for training a liver malignancy diagnosis model. After training, we evaluated the model performance on the internal-testing dataset. Each experiment was repeated five times, and the mean and standard deviation of AUC values were reported. As shown in Figures 3C and D, the performance improved as the number of training samples increased. Especially, the model performance achieved 0.978 AUC for liver mass detection, and 0.955 AUC for liver malignancy diagnosis, when the training data exceeded 4000.

Explanation of decision-making

To examine the basis of the decision-making process of the AI models, we first applied a visual explanation algorithm called Grad-CAM [36] in conjunction with the CNN architecture model to profile and compare the attention regions of the liver images with (LM-Net) and without (L-Net) the mass-guided strategy. As shown in Figure 2, the saliency maps of mass-guided model (LM-Net) enabled a greater focus on the liver masses, which were the critical regions for diagnostic analysis, and thus produced a better classification accuracy.

Using an explainer SHAP on the C-model, we examined the significance of the clinical factors and displayed the results in Figure 4. Figure 4A and B represents the instance-level interpretation for patients with liver malignancy and benign masses, respectively. Figure 4C and D illustrates the global feature attributions of the whole dataset. The alpha-fetoprotein (AFP) level was identified as the most significant factor in the liver malignancy diagnosis. The serum enzyme levels, including γ-glutamyl transpeptidase, aspartate aminotransferase, alkaline aminotransferase and alkaline phosphatase, also contribute substantially to the diagnosis. These findings are consistent with current knowledge, indicating that the model appropriately accounts for clinical factors.

The effects of clinical features on the ML model (C-model) as determined by SHAP. (A) Clinical feature contribution for a patient diagnosed with malignant liver mass. The horizontal axis represents the prediction probability. Features contributing to the increase of the probability were highlighted in red, whereas those contributing to the decrease of the probability were highlighted in blue. (B) Clinical feature contribution for a patient diagnosed with benign liver mass. (C) Distribution of the effects of each clinical feature on the global-level output. The colors represent the features’ values, with red as high and blue as low. The features to the left of the bar contribute negatively to the malignancy prediction, whereas the features to the right contribute positively to the malignancy prediction. (D) Average effect for each clinical feature. AFP: alpha-fetoprotein (ng/ml); ALB: albumin (g/L); GGT: gamma-glutamyl transferase (U/L); AST: aspartate aminotransferase (U/L); HBsAg: hepatitis B surface antigen; ALP: alkaline phosphatase (U/L); ALT: alkaline aminotransferase (U/L); CEA: carcinoembryonic antigen (ug/L); TBIL: total bilirubin (umol/L); DBIL: direct bilirubin (umol/L), Sex: 1(female), 2(male).
Figure 4

The effects of clinical features on the ML model (C-model) as determined by SHAP. (A) Clinical feature contribution for a patient diagnosed with malignant liver mass. The horizontal axis represents the prediction probability. Features contributing to the increase of the probability were highlighted in red, whereas those contributing to the decrease of the probability were highlighted in blue. (B) Clinical feature contribution for a patient diagnosed with benign liver mass. (C) Distribution of the effects of each clinical feature on the global-level output. The colors represent the features’ values, with red as high and blue as low. The features to the left of the bar contribute negatively to the malignancy prediction, whereas the features to the right contribute positively to the malignancy prediction. (D) Average effect for each clinical feature. AFP: alpha-fetoprotein (ng/ml); ALB: albumin (g/L); GGT: gamma-glutamyl transferase (U/L); AST: aspartate aminotransferase (U/L); HBsAg: hepatitis B surface antigen; ALP: alkaline phosphatase (U/L); ALT: alkaline aminotransferase (U/L); CEA: carcinoembryonic antigen (ug/L); TBIL: total bilirubin (umol/L); DBIL: direct bilirubin (umol/L), Sex: 1(female), 2(male).

Independent external testing of the AI models

We evaluated the AI models using two external datasets from geographically distinct regions (Foshan and Yichang).

Using the Foshan cohort, we first applied the CNN model to all 673 liver images from 370 patients to detect the existence of liver masses. As shown in Figure 5A, the model had a sensitivity of 86.7%, a specificity of 88.0% and an AUC of 0.945 (95% CI: 0.933–0.955). For 297 patients (382 images) with liver masses, we then applied the LMC-Net to classify these images as malignant or benign. As shown in Figure 5B, the model had a sensitivity of 82.7%, a specificity of 92.7% and an AUC of 0.928 (95% CI: 0.902–0.950). These results demonstrated robust performance of the AI models on external datasets.

The performance of the diagnostic system on the external test datasets and its comparison to radiologists. (A) The ROC curve for the CNN model using liver images to detect liver masses on the external test dataset (Foshan). (B) The ROC curve for the LMC-Net model for classifying benign versus malignant masses on the external test dataset (Foshan). The results include the mean diagnostic accuracies of junior, mid-level, senior radiologists and the consensus decision reached by radiologists and the AI model. (C) The ROC curve for the LM-Net for classifying benign versus malignant masses using the images from the Yichang cohort.
Figure 5

The performance of the diagnostic system on the external test datasets and its comparison to radiologists. (A) The ROC curve for the CNN model using liver images to detect liver masses on the external test dataset (Foshan). (B) The ROC curve for the LMC-Net model for classifying benign versus malignant masses on the external test dataset (Foshan). The results include the mean diagnostic accuracies of junior, mid-level, senior radiologists and the consensus decision reached by radiologists and the AI model. (C) The ROC curve for the LM-Net for classifying benign versus malignant masses using the images from the Yichang cohort.

Using the Yichang cohort, we applied the LM-Net to classify 101 US images from 101 patients as malignant or benign. As shown in Figure 5C, the LM-Net model had a sensitivity of 85.4%, a specificity of 77.8% and an AUC of 0.885 (95% CI: 0.828–0.922). These results validated the generalizability of the AI model.

Comparison with radiologists and performance enhancement by AI

Comparison of LMC-Net with the judgement of 12 US radiologists on liver malignancy diagnosis using the Foshan cohort is shown in Figure 5B and Supplementary Table S8. Specifically, the sensitivity of the deep learning model was comparable to that of the mid-level radiologists (82.7% versus 82.6%, P > 0.05) at a respectable specificity of 92.7%, but significantly higher than that of junior radiologists (82.7% versus 75.6%, P < 0.001).

In the simulation study, the combination of human and AI resulted in overall performance that was better than that of senior radiologists alone (Accuracy: 91.3% versus 89.5%), while saving 79.6% of senior radiologists’ labor (Figure 5B, Supplementary Table S9). These results demonstrated that the AI model could improve the performance of junior radiologists and reduce the workload of senior radiologists.

Histological subtype prediction

The LMC-Net model was able to differentiate HCC from the other subtypes with an AUC of 0.796 (95% CI: 0.763–0.828), ICC from the other subtypes with an AUC of 0.692 (95% CI: 0.609–0.775) and metastases from the other subtypes with an AUC of 0.779 (95% CI: 0.741–0.812) (Figure 6A). For the benign subset, the LMC-Net model was able to differentiate FNH from the other subtypes with an AUC of 0.881 (95% CI: 0.848–0.912), liver cyst from the other subtypes with an AUC of 0.930 (95% CI: 0.923–0.937) and hemangioma from the other subtypes with an AUC of 0.903 (95% CI: 0.895–0.911) (Figure 6B). These results demonstrated the utility of AI models in histological subtype prediction.

The performance of the LMC-Net models for malignant and benign subtype classifications. (A) ROC curves for the malignant subtype classification. (B) ROC curves for the benign subtype classification.
Figure 6

The performance of the LMC-Net models for malignant and benign subtype classifications. (A) ROC curves for the malignant subtype classification. (B) ROC curves for the benign subtype classification.

AI performance for US videos

As shown in Figure 7A and Supplementary Table S6, the LM-VNet model without clinical factors had a sensitivity of 87.3%, a specificity of 91.0%, a precision of 87.9% and an AUC of 0.966 (95% CI: 0.955–0.977), whereas the LMC-VNet model with clinical factors had better performance with a sensitivity of 90.9%, a specificity of 93.5%, a precision of 92.5% and an AUC of 0.983 (95% CI: 0.972–0.991). In comparison, if we replaced the US videos with individual images and applied the LM-Net model and the LMC-Net model to these images, performance decreased. The LM-Net model had a sensitivity of 84.7%, a specificity of 88.8%, a precision of 87.1% and an AUC of 0.943 (95% CI: 0.924–0.958), whereas the LMC-Net model had a sensitivity of 88.1%, a specificity of 91.9%, a precision of 89.6% and an AUC of 0.967 (95% CI: 0.955–0.979). These results demonstrated the importance of spatiotemporal information between consecutive US frames, which should be incorporated into the AI models for more accurate diagnosis. We validated the LM-VNet on the video dataset of the Yichang cohort. As shown in Figure 7B, the LM-VNet had a sensitivity of 86.0%, a specificity of 84.3%, a precision of 85.6% and an AUC of 0.901 (95% CI: 0.873–0.921), which demonstrated the robustness of the video-based model.

The performance comparison of the diagnostic system using US image and video data. (A) The ROC curves for four diagnostic models in classifying benign versus malignant masses using the developmental dataset (Guangzhou). LM-VNet: a video model using a combination of liver images and mass segmentation information. LMC-VNet: a video model using a combination of liver images, mass segmentation information and clinical factors. (B) The ROC curves for the LM-VNet model for classifying benign versus malignant masses in the external test dataset (Yichang).
Figure 7

The performance comparison of the diagnostic system using US image and video data. (A) The ROC curves for four diagnostic models in classifying benign versus malignant masses using the developmental dataset (Guangzhou). LM-VNet: a video model using a combination of liver images and mass segmentation information. LMC-VNet: a video model using a combination of liver images, mass segmentation information and clinical factors. (B) The ROC curves for the LM-VNet model for classifying benign versus malignant masses in the external test dataset (Yichang).

We investigated the contributions of various modules in our model through the ablation study. For the clinical factors, the diagnostic models integrated with clinical factors (LMC-) outperformed the models without using the clinical factors (LM-) by roughly 2% in AUC, demonstrating that the clinical factors do contribute to the diagnosis. For the BConvLSTM module, the video diagnostic models based solely on the BConvLSTM module, without the Attention-Boosted module (w/o AB), outperformed the LM-Net and LMC-Net models that treated video frames as distinct images by roughly 1% in AUC. This result indicates that including spatiotemporal information into the diagnosis is advantageous. For the Attention-Boosted module, the LM-VNet and LMC-VNet models, both of which had the BConvLSTM and Attention-Boosted modules, outperformed the BConvLSTM (w/o AB) models by nearly 1% in AUC, indicating the importance of paying attention to the critical frames.

We compared our proposed methods with conventional ML solutions, TextureRF, TextureSVM and TextureANN [7], and other deep learning methods, ModelLBand ModelLBC [15]. As shown in Supplementary Table S6, all deep learning methods outperformed the conventional ML methods because of their superior feature extraction ability, and our proposed models outperformed the previous deep learning models.

Figure 8 shows a video case of a malignant mass. When the liver masses were clearly visible in the frames such as Frames 16 and 27, the segmentation model was able to provide precise mass contours, and the diagnostic models predicted high malignant probabilities. However, when the liver masses were not clearly visible such as Frame 1, the clinical factors could make more contributions to diagnostic analysis. Overall, the video-based diagnostic models outperformed the image-based diagnostic models. Additional cases are shown in Supplementary Figure S3.

An example of a video case with HCC. (A) Malignant probabilities on all frames by four models. The image-based models (LM-Net* and LMC-Net*) produced malignant probabilities for each frame. The video-based models (LM-VNet and LMC-VNet) produced the malignant probabilities based on information from both the current and previous frames. (B) Four samples of the video frames and their corresponding mass segmentation results by the mass segmentation model.
Figure 8

An example of a video case with HCC. (A) Malignant probabilities on all frames by four models. The image-based models (LM-Net* and LMC-Net*) produced malignant probabilities for each frame. The video-based models (LM-VNet and LMC-VNet) produced the malignant probabilities based on information from both the current and previous frames. (B) Four samples of the video frames and their corresponding mass segmentation results by the mass segmentation model.

Conclusion and discussion

In this study, we developed an AI pipeline for fully automated liver malignancy screening and diagnosis using large-scale US datasets. The pipeline followed clinical practice of US examination, including detecting and segmenting liver masses, classifying them as either malignant or benign and subsequently making histological subtype predictions. In this process, an automated mass-guided strategy was designed to incorporate segmentation information into diagnostic networks. Serological examinations and other clinical factors were integrated with US images to make a comprehensive diagnosis. Moreover, we proposed attention-boosted BConvLSTM-based diagnostic models to improve the diagnostic accuracy by imitating real-world radiologists’ examination on US videos. The pipeline was evaluated on multiple cohorts and demonstrated high accuracy for detecting liver masses and differentiating between benign and malignant masses.

Independently, we developed a three-stage framework that segments target regions in the order of US scan regions, livers and masses, from larger to smaller (Figure 1B and Supplementary Figure S2a). Similar challenges occur in other medical image analysis, including CT scans, for which a multi-stage segmentation framework was proposed to solve this problem and ensure device compatibility and performance [37]. This design addressed major challenges in US image analysis to build a robust and generalizable AI system. Specifically, the three original US images in Figure 2 displayed varying levels of brightness, noise outside scan regions, diverse liver shapes and various mass sizes. The proposed framework was able to reduce variations in scan regions and the interference of extraneous noise by segmenting US scan regions, to remove irrelevant parts (organs) in images by segmenting livers and to provide localized liver masses for down-streaming diagnosis.

According to the diagnostic models’ classification results and saliency maps shown in the last two columns of Figure 2, the mass-guided strategy can direct the diagnostic model to focus on liver masses, boundaries and adjacent areas to produce the most accurate diagnoses. This is similar to how radiologists use information such as mass sizes, mass features and boundary features, to make diagnoses, thereby elevating confidence in the AI model’s predictions.

In clinical practice, it is critical to integrate various clinical data to make correct diagnoses [15]. The serological examination is an important reference point for radiologists when determining whether a mass is benign or malignant [38]. For example, patients with chronic liver disease who have elevated AFP levels are suggested to have an increased risk of HCC [39, 40]. Inspired by this, we developed multiple AI diagnostic models based on various sources of clinical data, as shown in Figure 3B. The combined information significantly improved diagnostic accuracy, increasing an AUC of 0.940 without serological examinations (LM-Net) to an AUC 0.968 with serological examination (LMC-Net). This increase demonstrated that US images and serological examinations complemented one another in malignancy diagnosis. The external test of the Foshan cohort confirmed that the diagnostic accuracy (an AUC of 0.928 by LMC-Net in Figure 5B) was comparable to that of mid-level radiologists.

Radiologists identify liver masses by viewing US videos in a real-world situation, instead of a few static US images. US videos can provide more comprehensive morphological and texture information on liver masses and their relationships with surrounding structures [21]. In this study, we proposed Attention-Boosted BConvLSTM models based on US videos that offered a greater potential for integration into existing US diagnostic systems and may provide a better diagnostic accuracy in a real clinical setting. The attention modules assessed the qualities of frames and calculated their weights by using the mass segmentation results. As shown in Figure 8, the segmentation model provided precise mass contours for the frames showing clear masses such as Frames 16 and 27, but struggled to segment the mass region when the liver mass was not clearly displayed such as Frames 1 and 20. Our AI models could integrate weighted features from current and previous frames to reduce the adverse impact of low-quality frames and provide more accurate diagnosis.

In previous studies, Virmani et al. [41] adopted an SVM-based method on 56 US images from 56 patients to differentiate between HCC and normal cases, achieving 88.8% accuracy, with 90.0% sensitivity for detecting normal cases, and 86.6% sensitivity for HCC cases. Xi et al. [12] developed deep learning models to differentiate benign from malignant focal liver lesions using 911 images from 596 patients, achieving an accuracy of 84%. Brehar et al. [7] compared deep learning models and conventional ML models on 1331 annotated images from 268 patients. The deep learning models achieved 0.95 AUC and 91% accuracy, outperforming 0.72 AUC and 66% accuracy for the ML methods. Shen et al. [14] established a prediction model using a logistic regression algorithm on 266 patients to discriminate between malignant and benign liver lesions, achieving 0.942 AUC and 90.6% accuracy. In our study, we developed our models on a large-scale dataset of 25 087 images of US image dataset, and they outperformed these studies with 0.990 AUC and 95.5% accuracy for liver mass detection, and 0.968 AUC and 91.5% accuracy for liver malignancy diagnosis. Moreover, the robustness of our methods was validated on external datasets.

Several limitations of our study warrant additional investigation. First, owing to the limited number of patients with biopsy or post-surgery pathology results, we only investigated the models’ performance on three subtype classifications for the malignant and benign masses, respectively. More efforts should be made in the future to classify patients into more comprehensive subtypes. Second, of all the proposed AI models, LMC-VNet provided the most accurate diagnosis. However, due to the lack of serological examination results on the Yichang cohort, we did not test the LMC-VNet model on the external dataset. In the future, we hope to perform extensive tests on the model. Third, this study was conducted on US data in China, where hepatitis B-related liver malignancy accounted for most liver malignant patients. Our model may be biased to this type of malignancy. As more data are gathered from various geographical regions around the world [2], we anticipate that better models will be developed.

In summary, we developed an application of deep learning models to automate liver mass detection and classification. AI-assisted liver malignancy screening models have the potential to reduce medical costs, while improving screening efficiency and accuracy at all levels of health care, especially primary care. We have shown that our AI models can increase radiologists’ diagnostic accuracy, especially for less experienced radiologists, and may aid in the prognosis and treatment of patients with liver malignancy.

Data availability

The de-identified data are available at https://doi.org/10.5281/zenodo.7272660. The dataset despite being open to public access, is subject to copyright. Any use of data contained within this dataset must receive appropriate acknowledgement and credit.

Code availability

We provided the Python source code, which is available at https://github.com/AndlierXu/AI-liver-ultrasound/.

Key points
  • We proposed a fully automated AI pipeline that imitates the workflow of radiologists for detecting liver masses and diagnosing liver malignancy using a large-scale dataset of US images and videos.

  • We developed video-based diagnostic models that could naturally be integrated into existing US diagnostic systems. We demonstrated that they provided a higher diagnostic accuracy than image-based models in the clinical setting.

  • Our AI models can increase radiologists’ diagnostic accuracy, especially for less experienced radiologists, and may aid in the prognosis and treatment of patients with liver malignancy.

Acknowledgments

We would like to thank the anonymous reviewers for valuable suggestions.

Funding

National Key R&D Program of China (2021YFF1201303 and 2019YFB1404804), National Natural Science Foundation of China (grants 61872218 and 61906105), Guoqiang Institute of Tsinghua University, Tsinghua University Initiative Scientific Research Program, Beijing National Research Center for Information Science and Technology (BNRist) and Tsinghua-Qingdao Institute of Data Science.

Yiming Xu is a PhD candidate of the Department of Computer Science and Technology of Tsinghua University. His research interests include clinical/medical informatics.

Bowen Zheng is a doctor of The Third Affiliated Hospital of Sun Yat-Sen University. Her research interests include clinical/medical informatics.

Xiaohong Liu is a PhD of the Department of Computer Science and Technology of Tsinghua University. His research interests include clinical/medical informatics.

Tao Wu is a doctor of The Third Affiliated Hospital of Sun Yat-Sen University, whose research interests include clinical/medical informatics.

Jinxiu Ju is a PhD of The Third Affiliated Hospital of Sun Yat-Sen University, whose research interests include clinical/medical informatics.

Shijie Wang is a PhD of The Third Affiliated Hospital of Sun Yat-Sen University. His research interests include clinical/medical informatics.

Yufan Lian is a doctor of The Third Affiliated Hospital of Sun Yat-Sen University, whose research interests include clinical/medical informatics.

Hongjun Zhang is a doctor of The Third Affiliated Hospital of Sun Yat-Sen University, whose research interests include clinical/medical informatics.

Tong Liang is a doctor of The Foshan Traditional Chinese Medicine Hospital, whose research interests include clinical/medical informatics.

Ye Sang is a doctor of China Three Gorges University and Yichang Central People’s Hospital, whose research interests include clinical/medical informatics.

Rui Jiang is an Associate Professor of Department of Automation and BNRist of Tsinghua University. His research interests include clinical/medical informatics, bioinformatics.

Guangyu Wang is a Professor of School of Information and Communication Engineering of Beijing University of Posts and Telecommunications. Her research interests include clinical/medical informatics.

Jie Ren is a Professor of The Third Affiliated Hospital of Sun Yat-Sen University. Her research interests include clinical/medical informatics.

Ting Chen is a Professor of Department of Computer Science and Technology & Institute of Artificial Intelligence & BNRist of Tsinghua University. His research interests include clinical/medical informatics and bioinformatics.

References

1.

Bray
 
F
,
Ferlay
 
J
,
Soerjomataram
 
I
, et al.  
Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries
.
CA Cancer J Clin
 
2018
;
68
(
6
):
394
424
.

2.

Akinyemiju
 
T
,
Abera
 
S
,
Ahmed
 
M
, et al.  
The burden of primary liver cancer and underlying etiologies from 1990 to 2015 at the global, regional, and national level: results from the global burden of disease study 2015
.
JAMA Oncol
 
2017
;
3
(
12
):
1683
91
.

3.

Wilkinson
 
R
.
Principles of real-time two-dimensional B-scan ultrasonic imaging
.
J Med Eng Technol
 
1981
;
5
(
1
):
21
9
.

4.

Tchelepi
 
H
,
Ralls
 
PW
.
Ultrasound of focal liver masses
.
Ultrasound Q
 
2004
;
20
(
4
):
155
69
.

5.

Bolondi
 
L
.
Screening for hepatocellular carcinoma in cirrhosis
.
J Hepatol
 
2003
;
39
(
6
):
1076
84
.

6.

Samoylova
 
ML
,
Mehta
 
N
,
Roberts
 
JP
, et al.  
Predictors of ultrasound failure to detect hepatocellular carcinoma
.
Liver Transpl
 
2018
;
24
(
9
):
1171
7
.

7.

Brehar
 
R
,
Mitrea
 
DA
,
Vancea
 
F
, et al.  
Comparison of deep-learning and conventional machine-learning methods for the automatic recognition of the hepatocellular carcinoma areas from ultrasound images
.
Sensors
 
2020
;
20
(
11
):
3085
.

8.

Yasaka
 
K
,
Akai
 
H
,
Abe
 
O
, et al.  
Deep learning with convolutional neural network for differentiation of liver masses at dynamic contrast-enhanced CT: a preliminary study
.
Radiology
 
2018
;
286
(
3
):
887
96
.

9.

Hu
 
HT
,
Wang
 
W
,
Chen
 
LD
, et al.  
Artificial intelligence assists identifying malignant versus benign liver lesions using contrast-enhanced ultrasound
.
J Gastroenterol Hepatol
 
2021
;
36
(
10
):
2875
83
.

10.

Marya
 
NB
,
Powers
 
PD
,
Fujii-Lau
 
L
, et al.  
Application of artificial intelligence using a novel EUS-based convolutional neural network model to identify and distinguish benign and malignant hepatic masses
.
Gastrointest Endosc
 
2021
;
93
(
5
):
1121
1130.e1
.

11.

Nishida
 
N
,
Yamakawa
 
M
,
Shiina
 
T
, et al.  
Current status and perspectives for computer-aided ultrasonic diagnosis of liver lesions using deep learning technology
.
Hepatol Int
 
2019
;
13
(
4
):
416
21
.

12.

Xi
 
IL
,
Wu
 
J
,
Guan
 
J
, et al.  
Deep learning for differentiation of benign and malignant solid liver lesions on ultrasonography
.
Abdominal Radiol
 
2021
;
46
(
2
):
534
43
.

13.

Hassan
 
TM
,
Elmogy
 
M
,
Sallam
 
E-S
.
Diagnosis of focal liver diseases based on deep learning technique for ultrasound images
.
Arabian J Sci Eng
 
2017
;
42
(
8
):
3127
40
.

14.

Shen
 
H
,
Lv
 
G
,
Lin
 
H
, et al.  
Development of an ultrasound prediction model to discriminate between malignant and benign liver lesions
.
Ultrasound Med Biol
 
2020
;
46
(
4
):
952
8
.

15.

Yang
 
Q
,
Wei
 
J
,
Hao
 
X
, et al.  
Improving B-mode ultrasound diagnostic performance for focal liver lesions using deep learning: a multicentre study
.
EBioMedicine
 
2020
;
56
:
102777
.

16.

Xu
 
SS-D
,
Chang
 
CC
,
Su
 
CT
, et al.  
Classification of hepatocellular carcinoma and liver abscess by applying neural network to ultrasound images
.
Sensors Mater
 
2020
;
32
(
8
):
2659
753
.

17.

Nishida
 
N
,
Yamakawa
 
M
,
Shiina
 
T
, et al.  
Artificial intelligence (AI) models for the ultrasonographic diagnosis of liver tumors and comparison of diagnostic accuracies between AI and human experts
.
J Gastroenterol
 
2022
;
57
(
4
):
309
21
.

18.

Yamada
 
A
.
Deep learning promotes B-mode ultrasound screening for focal liver lesions
.
EBioMedicine
 
2020
;
56
:102814.

19.

Hwang
 
YN
,
Lee
 
JH
,
Kim
 
GY
, et al.  
Classification of focal liver lesions on ultrasound images by extracting hybrid textural features and using an artificial neural network
.
Biomed Mater Eng
 
2015
;
26
(
s1
):
S1599
611
.

20.

Schmauch
 
B
,
Herent
 
P
,
Jehanno
 
P
, et al.  
Diagnosis of focal liver lesions from ultrasound using deep learning
.
Diagn Interv Imaging
 
2019
;
100
(
4
):
227
33
.

21.

Chen
 
C
,
Wang
 
Y
,
Niu
 
J
, et al.  
Domain knowledge powered deep learning for breast cancer diagnosis based on contrast-enhanced ultrasound videos
.
IEEE Trans Med Imaging
 
2021
;
40
(
9
):
2439
51
.

22.

Tesanic
 
DM
,
Merz
 
E
.
Artifacts in 3D prenatal sonography
.
Ultraschall in der Medizin–Eur J Ultrasound
 
2020
;
41
(
03
):
286
91
.

23.

Song
 
H
,
Wang
 
W
,
Zhao
 
S
, et al.  Pyramid dilated deeper convlstm for video salient object detection. In:
Proceedings of the European Conference on Computer Vision (ECCV)
. Munich, Germany: Springer. Cham.
2018
;715–731.

24.

Long
 
J
,
Shelhamer
 
E
,
Darrell
 
T
. Fully convolutional networks for semantic segmentation. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Boston, MA, USA: IEEE.
2015
; 3431–3440.

25.

Ronneberger
 
O
,
Fischer
 
P
,
Brox
 
T
. U-net: Convolutional networks for biomedical image segmentation. In:
International Conference on Medical Image Computing and Computer-Assisted Intervention
. Nassir Navab, Joachim Hornegger, William M. Wells, Alejandro Frangi.
Munich, Germany: Springer
. 2015: 234–241.

26.

Chen
 
L-C
,
Papandreou
 
G
,
Schroff
 
F
, et al.  Rethinking atrous convolution for semantic image segmentation.  
arXiv preprint arXiv:1706.05587
,
2017
.

27.

Lin
 
T-Y
, et al.  Microsoft coco: common objects in context. In:
European Conference on Computer Vision
. David Fleet, Tomas Pajdla, Bernt Schiele, Tinne Tuytelaars.
Zurich, Switzerland: Springer
. 2014: 740–755.

28.

Lundberg
 
SM
,
Erion
 
G
,
Chen
 
H
, et al.  
From local explanations to global understanding with explainable AI for trees
.
Nat Mach Intell
 
2020
;
2
(
1
):
56
67
.

29.

Lundberg
 
S
,
Lee
 
S-I
. A unified approach to interpreting model predictions
arXiv preprint arXiv:1705.07874
. In:
2017
.

30.

McKinney
 
SM
,
Sieniek
 
M
,
Godbole
 
V
, et al.  
International evaluation of an AI system for breast cancer screening
.
Nature
 
2020
;
577
(
7788
):
89
94
.

31.

Xingjian
 
S
,
Chen
 
Z
,
Wang
 
H
,
Yeung
 
YD
. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In:
Advances in Neural Information Processing Systems
. Montreal, Quebec, Canada: Curran Associates, Inc. 2015;802–810.

32.

Rahman
 
SA
,
Adjeroh
 
DA
.
Deep learning using convolutional LSTM estimates biological age from physical activity
.
Sci Rep
 
2019
;
9
(
1
):
1
15
.

33.

Efron
 
B
,
Tibshirani
 
RJ
.
An Introduction to the Bootstrap
.
Boca Raton, Florida, USA: Chapman and Hall/CRC
.
1994
.

34.

Chihara
 
LM
,
Hesterberg
 
TC
.
Mathematical Statistics with Resampling and R
.
Hoboken, New Jersey, USA: John Wiley & Sons
.
2018
.

35.

Robin
 
X
,
Turck
 
N
,
Hainard
 
A
, et al.  
pROC: an open-source package for R and S+ to analyze and compare ROC curves
.
BMC Bioinform
 
2011
;
12
(
1
):
1
8
.

36.

Selvaraju
 
RR
,
Cogswell
 
M
,
Das
 
A
, et al.  Grad-cam: visual explanations from deep networks via gradient-based localization. In
Proceedings of the IEEE International Conference on Computer Vision
. Venice, Italy: IEEE. 2017;618–626.

37.

Zhang
 
K
,
Liu
 
X
,
Shen
 
J
, et al.  
Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of COVID-19 pneumonia using computed tomography
.
Cell
 
2020
;
181
(
6
):
1423
1433.e11
.

38.

Schwartz
 
JM
, et al.  
Clinical Features and Diagnosis of Hepatocellular Carcinoma
. Waltham: UptoDate;
2019
.

39.

Tsukuma
 
H
,
Hiyama
 
T
,
Tanaka
 
S
, et al.  
Risk factors for hepatocellular carcinoma among patients with chronic liver disease
.
New Engl J Med
 
1993
;
328
(
25
):
1797
801
.

40.

Tzartzeva
 
K
,
Singal
 
AG
.
Testing for AFP in combination with ultrasound improves early liver cancer detection
.
Expert Rev Gastroenterol Hepatol
 
2018
;
12
(
10
):
947
9
.

41.

Virmani
 
J
,
Kumar
 
V
,
Kalra
 
N
, et al.  
SVM-based characterization of liver ultrasound images using wavelet packet texture descriptors
.
J Digit Imaging
 
2013
;
26
(
3
):
530
43
.

Author notes

Yiming Xu, Bowen Zheng, Xiaohong Liu contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]