Abstract

The low incidence of failures and high randomness in high-speed train wheelset bearings pose significant challenges in identifying bearing defects under few-shot sample conditions. An inception transformer (IFormer)-based weighted prototype network is proposed for few-shot recognition of wheelset bearing defect images. To capture subtle differences in few-shot samples, an IFormer network integrating the strengths of convolutional neural networks (CNNs) and transformers is adopted in the prototype representation space. A multi-path fusion attention mechanism (MPAM)-based weighting prototype block is introduced to assign weights to features of same-class samples, thus enhancing the representation of target class prototypes. By integrating the modified cost function (MCF), the proposed model can more accurately evaluate the similarity between query samples and class prototypes. Extensive experiments on a public steel plates surface defects data set and the self-constructed train wheelset bearing defect (TWBD) data set demonstrate the robustness of the proposed model compared to other state-of-the-art few-shot learning models. Furthermore, the effectiveness of the proposed model has been validated through a series of ablation experiments and visualization analyses. The proposed approach shows potential as a tool to facilitate intelligent recognition of train wheelset bearing images under few-shot sample conditions.

Highlights
  • An inception transformer (IFormer)-based weighted prototype network is presented.

  • The IFormer network is successfully use to capture information in the feature embedding space.

  • A weighting prototype block is devised to assign and asses varying contributions to the respective class prototype.

  • A modified cost function is devised to enhance the robustness of the model's classification process.

1. Introduction

Railway transport boasts numerous advantages, including high capacity, cost-effectiveness, compactness, and environmental friendliness. It plays a pivotal and indispensable role in fostering the development of the national economy. The wheelset bearing serves as the crucial component of the bogie in high-speed trains, whose health status is imperative for ensuring the safe operation of trains. Given their prolonged service under rigorous operating conditions, the bearings are susceptible to scoring, spalling, pitting, and numerous other forms of damage (Liu et al., 2021b). Consequently, investigating the health operation and maintenance of wheelset bearings in high-speed trains holds significant academic merit and is pivotal in ensuring the uninterrupted and secure operation of trains.

Currently, techniques such as temperature monitoring (Glowacz et al., 2018), wayside acoustic analysis (He et al., 2013), and vibration monitoring (Huang et al., 2019) are extensively utilized for online fault detection in train wheelset bearings. The system triggers an alarm if the measured temperature of the wheelset surpasses a predefined threshold during train operation. However, despite its simplicity, the effectiveness of temperature monitoring is questionable due to its limited sensitivity to early bearing failures and the frequent occurrence of false alarms. The wayside acoustic diagnostic system (WADS) enables non-contact detection, utilizing multiple microphone array sensors positioned along both tracks. This system effectively diagnoses bearing faults by extracting fault characteristic information from acoustic signals (Ding et al., 2022). Although the WADS has been implemented in various engineering applications, the acoustic signals collected often exhibit weak fault signatures and encounter challenges associated with the Doppler effect.

Vibration monitoring serves as the pre-eminent diagnostic technique for bearings, as it encapsulates extensive information regarding the operating state within the vibration signals, thus facilitating both offline and online inspections. Recently, numerous vibration signal analysis methods, such as spectral analysis (Wang et al., 2016), time–frequency decomposition (Li et al., 2022a, 2022b), morphology filter (Li et al., 2018), and machine learning (Oh et al., 2022; Kim et al., 2023), have garnered significant attention. It can be concluded after extensive investigation that on-board vibration monitoring for wheelset bearing diagnostics has not yet gained widespread adoption, primarily owing to constraints in sensor installation space and power supply.

Oil monitoring (Wu et al., 2013) is a commonly employed offline monitoring technique for wheelset bearings. The primary components of oil monitoring include ferrography analysis, pollution analysis, infrared spectrum analysis, and spectral analysis. Through the analysis of train axlebox lubricant samples, oil monitoring provides insights into the operating condition, fault mode, and wear degree of wheelset bearings (Zhu et al., 2013). A major limitation of oil monitoring lies in its high cost and intricate operational prerequisites, necessitating a considerable investment in personnel and time.

Accurate detection of defects to wheelset bearing surfaces is challenging without dismantling the axlebox, where the bearings are mounted. To enhance bearing fault detection accuracy, disassembly and repair of wheelset bearings are scheduled based on train mileage during routine maintenance (Liu et al., 2021b). Manual visual inspection remains the primary method for detecting surface defects in wheelset bearings. However, the quality of manual visual inspection is subject to the experience of skilled inspectors, rendering it unreliable at times. Furthermore, prolonged exposure to bright lights during manual inspection poses health risks to inspectors. Figure 1 shows the numerous deficiencies in the visual inspection of wheelset bearings on a real repair streamline.

Visual inspection of wheelset bearings.
Figure 1:

Visual inspection of wheelset bearings.

With the rapid progress of computer technology, machine vision monitoring has emerged and increasingly replaced manual inspection. However, the practical application of machine vision-based research for detecting surface defects in wheelset bearings is still relatively limited. Most image recognition techniques for detecting bearing defects utilize conventional image processing methods, such as thresholding, edge detection, filtering, and image clustering, to analyse and extract relevant colour, edge, and texture features from images (Lu et al., 2016). Lei (2004) proposed a machine vision system for bearing diameter inspection, employing dual filtering and four detection operators to enhance the extraction of bearing image edges. Shen et al. (2013) introduced a machine vision system for the bearing lubrication process, leveraging image processing techniques to efficiently and precisely locate bearing cages in digital images. Shen et al. (2012) developed a machine vision system for bearing defect detection, capable of identifying diverse defects, including deformation, corrosion, and scratches on bearing caps. The aforementioned methods frequently necessitate intricate image processing techniques, which hinder their applicability for intelligent image recognition and fail to fulfill practical requirements regarding detection efficiency and user-friendliness.

In recent years, deep learning has exhibited significant potential in image processing, recognition, and detection due to its superior feature extraction and adaptive capabilities. Consequently, there has been a surge in research in various image recognition studies, encompassing convolutional neural networks (CNNs) (Zhang et al., 2022b), autoencoders (Zhao et al., 2017), graph neural networks (Hong et al., 2021), and transformer-based networks (He et al., 2021). The crucial advantage of deep learning lies in its capability to autonomously learn feature representations from images, requiring minimal expert experience and human intervention. Notably, various network models based on CNNs, such as U-Net (Ronneberger et al., 2015), Faster R-CNN (Zeng et al., 2021), and the you only look once (YOLO) series (Hnewa & Radha, 2021; Li et al., 2021), have emerged and established a prominent role in the realm of image classification and recognition. The transformer model is a deep learning architecture based entirely on a self-attention mechanism, which can effectively capture relationships across all input elements, making it superior for long-range dependency modelling (Alexey, 2020). Therefore, various types of transformer-based variants models, such as vision and swin transformers, have emerged and have been widely used in research areas of machine translation and natural language modelling (Wu et al., 2024; He et al., 2025). However, it suffers from high computational complexity and poor local feature acquisition capability. To the best of our knowledge, there is no publicly available open-source data set specifically designed for defect detection on train wheelset bearing surfaces. Additionally, deep learning methods require a considerable amount of labelled data to effectively train neural network models. However, the scarcity of sufficient data samples often leads to overfitting in these models, thereby compromising their generalization capabilities and recognition accuracy.

In practical engineering applications, most wheelset bearings extracted from trains remain in good condition, thus resulting in a notably low occurrence of surface defects. Therefore, obtaining sample images of a specified size for analysis constitutes a substantial challenge. Therefore, it is crucial to develop a methodology capable of efficiently detecting and identifying surface defects on wheelset bearings, given the limited availability of training data and the need for high generalization capabilities. Currently, two primary techniques are employed when dealing with limited samples. First, data-driven methods such as generative adversarial networks (Han et al., 2019) and variational auto-encoder (Yan et al., 2021) are utilized to generate high-quality synthetic samples based on genuine data distribution. Yao et al. (2024) designed a semi-supervised adversarial method using an adversarial training strategy to improve model performance. Secondly, model-based approaches such as transfer learning (Han et al., 2020), meta-learning (He et al., 2024), and regularization techniques (Zhong et al., 2024) are applied to enhance the feature extraction process within network models.

Recent advancements in meta-learning-based few-shot learning techniques have emerged as promising candidates for fault diagnosis, especially in scenarios where a limited number of samples are available. In the realm of bearing fault diagnosis, meta-learning methods are increasingly being utilized for classification tasks that centre on the analysis of vibration signals, whereas applications to image-based diagnosis remain limited. Ma et al. (2023) presented a multi-order graph embedding few-shot model. Wang et al. (2024) developed an attention-centric rotating machinery few-shot fault diagnosis model by integrating internal and external attention mechanisms. Metric-based meta-learning, a representative few-shot learning approach, integrates similarity learning with meta-learning derived from analogous training experiences. This integration facilitates the customization of personalized learners for diverse tasks, thereby expediting the execution of novel tasks. Snell et al. (2017) initially introduced a metric-based prototype network for few-shot image classification. This network maps both labelled and unlabelled samples into a shared space, utilizing the mean of samples within a class as the representative class prototype. Classification occurs through the measurement of Euclidean (EU) distances between unlabelled samples and each respective class prototype. Feng et al. (2022) conducted a comprehensive review of deep meta-learning for fault diagnosis in mechanical equipment, highlighting its benefits in applications with limited sample sizes. Li et al. (2024) developed an adaptive class-augmented prototype network that integrates instance- and representation-level augmentation mechanisms, incorporating adaptive debiased contrastive learning during model training. Guo et al. (2024) proposed a prototype network for diagnostic model construction, utilizing two feature extractors for extracting prototypical and query features separately. Zhou & Yu (2023) introduced a novel weighted prototype network (WPN) for few-shot learning, encompassing feature extraction and prototype modification. This network utilizes a graph neural network to explore the contribution of each sample to its respective class prototype.

The aforementioned research methods offer a viable approach for fault diagnosis of wheelset bearing defect images, utilizing limited samples and prototype networks. Nonetheless, recognizing wheelset bearing defect images in practical settings poses several challenges. First, the presence of dust, grease, rust, and other contaminants generates significant background noise in the images of wheelset bearing defects. Secondly, the irregular shapes of defects result in substantial variations in texture, edges, and other characteristics, introducing inconsistencies within the same type of defects.

Given the aforementioned challenges, along with the impact of limited samples, there is a pressing need to enhance the accuracy and generalization capabilities of existing prototype network models. (1) CNNs are widely employed as embedded backbone for traditional prototype network (TPN) architectures. While they excel at extracting local features, CNNs struggle with encoding long-range dependencies in images and lack global modelling capabilities, thereby compromising the accuracy of prototype representations. (2) TPN typically employs the mean of the embeddings from samples within a class as the class prototype. However, samples within the same class may contain diverse information and vary in their contributions to the class prototype, rendering the direct averaging of samples an unreasonable approach for deriving the class prototype. (3) The EU distance function is utilized to assess the similarity between query samples and class prototypes in TPN. However, this single-metric approach poses challenges in flexibly and efficiently assessing the distance, affects the generalization ability of the model.

Based on the preceding discussion, an inception transformer (IFormer)-based WPN is proposed for few-shot recognition of wheelset bearing defect images. First, the IFormer is employed as a feature extractor to extract feature embeddings for samples of different categories. Then, a weighting prototype block is implemented, leveraging an innovative attention mechanism, to develop a weight space that takes into account the contribution of different samples in the same class. Finally, a modified cost function (MCF) is introduced to accurately reflect the correlations between query samples and class prototypes. In summary, the primary contributions of this research are as follows:

  1. A prototype representation feature extractor utilizing the IFormer network is introduced, aiming to capture subtle differences in features among few-shot samples belonging to various categories. The IFormer integrates the strengths of CNNs and self-attention, enabling it to efficiently capture both high-frequency details and low-frequency global information in the feature embedding space.

  2. A weighting prototype block employing the multi-path fusion attention mechanism (MPAM) is devised to assign weights to features of same-class samples and assess their varying contributions to the respective class prototype. The MPAM is complementarily utilized to enhance the representation of class prototypes of interest, achieved through the integration of spatial direction and channel position information.

  3. An MCF, incorporating distance fusion and metric scaling, is devised to improve the robustness of the model's classification process. The MCF can flexibly and efficiently assess the similarity between the query samples and the class prototypes, resulting in more reliable metric scores.

The remainder of this paper is organized as follows. Section 2 reviews related works, and Section 3 presents the proposed model in detail. Section 4 outlines materials and experiment setups. Section 5 shows the experimental validation and discussion of the results in depth. Section 6 gives the conclusion.

2. Related Work

2.1 Meta-learning

Meta-learning encompasses the network's capability to learn from previous tasks' shared knowledge and experiences, utilizing this to facilitate the learning of new tasks. Distinct from traditional deep learning algorithms, the fundamental unit of meta-learning during both training and testing is the episode, with each epoch comprising multiple episodes. An episode comprises a support set and a query set. Specifically, the support set consists of N categories, with each category encompassing K samples. Correspondingly, the query set contains an identical number of categories as the support set. This approach, referred to as ‘N-way, K-shot’ in meta-learning, is utilized by meta-learning classification networks during both the training and testing phases.

2.2 Prototype network

The prototype network, a classical meta-learning classification approach rooted in distance metrics, learns the similarity between samples by computing a distance matrix. The prototype network utilizes a learnable feature extractor to embed samples into a metric space, where the average of extracted features for a given category serves as its prototype representation. The prototype network leverages prior knowledge to attain remarkable classification performance in novel tasks with limited examples.

The prototype network utilizes a data set D = {S,Q}, which comprises a support set S and a query set Q. The embedding module |${f_\phi }$| retrieves images from the support and query sets, denoted as |${f_\phi }( x )$|⁠. The learnable parameters of the shared network encompass |$\phi $|⁠, as well as the model's structure, initial parameters, learning rate, and additional configuration settings. In each class of the support set, there are n images, and their averaged value serves as the prototype representation. In addition, |${c_k}$| represents the kth class prototype.

(1)

where |${S_k}{\rm{ = }}\{ {( {x_1^s,y_k^s} ),( {x_2^s,y_k^s} ) \cdots ,( {x_n^s,y_k^s} )} \}$| represents a set of samples belonging to class k in the support set, |${y_k}$| denotes the genuine label associated with class k, and n represents the number of the samples in |${S_k}$|⁠.

The query set image |${x^q}$| is processed through the embedding module |${f_\phi }$|⁠, resulting in its feature representation |${f_\phi }( {{x^q}} )$|⁠. Simultaneously, the support set image undergoes the embedding module, yielding the class prototype in accordance with Equation 1. Subsequently, the metric function d is utilized to compute the distance between |${x^q}$| and |${c_k}$|⁠. Ultimately, a softmax distribution of labels specific to the query set image is generated across the support categories. The predicted probability of the query set image |${x^q}$| belonging to class k is represented as:

(2)

3. Proposed Methodology

3.1 Inception transformer

The recognition of wheelset bearing images poses significant challenges, including strong noise, low contrast, and blurred boundaries. Additionally, these images contain substantial background information, and the target objects vary in size, ranging from large spalling faults to small pitting faults. It is evident that feature extraction necessitates a consideration of both localized details and global image features. Traditional CNNs in TPN struggle to accurately capture features in complex and few-shot samples. To address this, the IFormer network (Si et al., 2022) is employed as the prototype representation feature extractor to enhance category prototype representation accuracy.

The IFormer network utilizes a hierarchical design with four stages, each incorporating a patch embedding operation and multiple IFormer blocks. A 2D image feature map undergoes patch embedding operation, resulting in 1D patch embeddings. Each IFormer block consists of two serially connected residual structures, with the inception token mixer (ITM) forming the core of the first residual structure, as illustrated in Figure 2. By integrating CNNs and transformer functionalities, the ITM captures both high- and low-frequency information.

Structure of the IFormer block.
Figure 2:

Structure of the IFormer block.

Given an input feature |$X \in {R^{N \times C}}$|⁠, it is separated into |${X_h} \in {R^{N \times {C_h}}}$| and |${X_l} \in {R^{N \times {C_l}}}$| along the channel dimension, where |${C_h} + {C_l} = C$|⁠. Subsequently, |${X_h}$| and |${X_l}$| are assigned to the high- and low-frequency mixers, respectively. |${X_h}$| is partitioned into |${X_{h1}} \in {R^{N \times \frac{{{C_h}}}{2}}}$| and |${X_{h2}} \in {R^{N \times \frac{{{C_h}}}{2}}}$|⁠. |${X_{h1}}$| undergoes a max pooling (MaxPool) layer and a subsequent linear layer, whereas |${X_{h2}}$| undergoes a linear layer and a subsequent depthwise convolution (DWConv) layer, which are show as

(3)
(4)

where |${Y_{h1}}$| and |${Y_{h2}}$| represent the outputs generated by the high-frequency mixer.

In the low-frequency mixer, an average pooling (AvgPool) layer is employed to reduce the spatial dimensions of |${X_l}$| prior to the application of multi-head self-attention. This design reduces computational complexity while retaining global information, as defined:

(5)

where |${Y_l}$| represents the output of the low-frequency mixer. Ultimately, the outputs of the low- and high-frequency mixers are concatenated along the channel dimension:

(6)

The IFormer block also incorporates a feed-forward network (FFN), akin to conventional transformers, and applies LayerNorm (LN) before both ITM and FFN layers. The formal definition of the IFormer block is:

(7)
(8)

3.2 Weighting prototype block

TPN employs Equation 1 to calculate the class prototype, assigning equal weights to each sample in the support set. However, a crucial concern arises from the variation in informational content among samples within the same class, resulting in differing contributions to the class prototype. This can lead to discrepancies between the obtained and the ideal prototypes, thus affecting the accuracy and generalizability of the model. Consequently, an MPAM-based weighting prototype block is designed to evaluate the varying contributions of different feature maps from same-class samples to their respective class prototypes, assigning appropriate weights to each sample in the support set. The detailed structure is illustrated in Figure 3.

The detail of the MPAM-based weighting prototype block.
Figure 3:

The detail of the MPAM-based weighting prototype block.

The traditional squeeze excitation (SE, Hu et al., 2018b) attention mechanism focuses only on channel relationships and re-evaluates the importance of each channel but ignores the spatial information of the features. The convolutional block attention module (CBAM, Woo et al., 2018) calculates spatial and channel-level weights for feature maps independently, resulting in high computational complexity and the lack of interoperability between spatial and channel information. Therefore, a novel MPAM is proposed to quantify the varying contributions of same-class samples to their class prototype and assign weights for their feature maps.

First, the weighting prototype block concatenates the sample feature maps of each category and utilizes MPAM to extract the key features of each sample and assess the importance of each sample feature to the category. Secondly, the results from MPAM are used to generate weights via a Sigmoid function and combined with the corresponding sample features to obtain the weighted sample features of the category. Finally, the weighted category sample features are passed through a Flatten layer and averaged to obtain a weighted prototype representation of the category. Notably, MPAM incorporates class-sample features from both horizontal (X) and vertical (Y) directions, achieving the integration of the spatial direction and channel position information. In each direction, MPAM utilizes a convolution layer and an AvgPool layer to aggregate different forms of features. The convolution layer captures fine-grained details, while the AvgPool layer extracts overarching features. This operation enhances MPAM's ability to learn more information in both directions. Additionally, it captures richer non-linear relationships through the use of the novel Mish activation function.

In the support set, given that one category contains k samples and each sample's feature map |$f \in {R^{C \times H \times W}}$|⁠, where C denotes the number of channels, while H and W represent the width and height of the feature map. Initially, the sample feature maps of the same class are concatenated:

(9)

where |$F \in {R^{k \times C \times H \times W}}$|⁠. Subsequently, F is put into MPAM. The MPAM initially employs a convolution layer and an AvgPool layer separately to aggregate features in horizontal direction. The convolution layer captures fine-grained details, while the AvgPool layer extracts overarching features. Consequently, the output for the horizontal component can be formulated as:

(10)

where |$z_{\textrm{conv}}^X \in {R^{C \times H \times 1}}$| represents the output of the convolution layer, while |$z_{\textrm{pool}}^X \in {R^{k \times C \times H \times 1}}$| denotes the output of the AvgPool layer. Subsequently, |$z_{\textrm{conv}}^X$| and |$z_{\textrm{pool}}^X$| are concatenated to integrate the information along the horizontal axis. The proposed MPAM can capture more comprehensive remote dependencies in the horizontal direction while maintaining positional information in the vertical direction. Then, a convolution layer with a 1 × 1 convolution kernel is conducted as follows:

(11)

where |${z^X} \in {R^{k \times C/r \times H \times 2}}$| is the result of the horizontal fusion of features. Here, r represents the reduction ratio (r = 4 is taken in this study). A batch normalization (BN) layer and a non-linear activation layer using the Mish activation function are applied to |${z^X}$|⁠:

(12)

where |${u^X} \in {R^{k \times C/r \times H \times 2}}$| represents the intermediate feature map, encoding spatial information along the horizontal axis. The Mish function can be expressed as follows (Misra, 2019):

(13)

Compared to traditional activation functions such as ReLU and Swish, Mish can provide a richer feature expressiveness while maintaining computational efficiency. In addition, Mish has a self-regularization property, which helps to reduce overfitting and improve the generalization of the module. Next a 1 × 1 convolution layer and an AvgPool layer are used in series, which can be expressed as:

(14)

where |${g^X} \in {R^{k \times C \times H \times 1}}$| represents the horizontal features. The vertical features are proposed with the same operations described above, and the processing result of vertical features is represented as |${g^Y} \in {R^{k \times C \times 1 \times W}}$|⁠. Then, a broadcast addition mechanism is applied to fusion |${g^X}$| and |${g^Y}$|⁠, and the output is subsequently passed through a Sigmoid function:

(15)

where ⊕ denotes the operation of the broadcast addition mechanism, |$g \in {R^{k \times C \times H \times W}}$| serves as the attention weights for the input feature map F. Ultimately, the output H of the proposed MPAM block is expressed as:

(16)

After MPAM, H is fed into a flatten layer and averaged to obtain the final weighted prototype representation of the class P:

(17)

where |$P \in {R^{C * H * W}}$|⁠. The detailed procedure of the weighting prototype block is given in Algorithm 1.

Algorithm 1

Pipeline for computing weighting prototype block.

Algorithm 1

Pipeline for computing weighting prototype block.

3.3 Modified cost function

The EU distance is commonly used to assess the correlation between prototypes in the TPN. However, relying only on a single EU distance to determine the prototype representation distance is prone to few-shot sample overfitting, which affects the generalization ability of the model. In addition, the EU distance between two features is more sensitive to noise and outliers in the high-dimensional space of the data, making it difficult to accurately determine the distance between the query point and the class prototype, reducing the robustness of the model's classification process. Therefore, an MCF incorporating distance fusion and metric scaling is developed to assess the similarity between query samples and class prototypes in a flexible and efficient manner, thereby yielding more reliable metric scores.

The introduction of L1 distance can improve the tolerance of EU distance to feature outliers in high-dimensional space. Furthermore, given the similarities in properties between EU and L1 distances, a sample that is close to its corresponding category prototype can yield shorter distances using both measures, resulting in a higher fused distance score. Consequently, this facilitates the accurate determination of the sample's categorical affiliation. The distance fusion |${d_{\rm {MCF}}}$| is calculated on the basis of the EU distance |${d_E}$| and the L1 distance |${d_{L1}}$| is introduced as a regularization term, which can be expressed as:

(18)

where |$A = ( {{a_1},{a_2}, \cdots ,{a_n}} )\,\mathrm{ and}\,B = ( {{b_1},{b_2}, \cdots ,{b_n}} )$| represent two features that are used to calculate distance fusion |${d_{\rm {MCF}}}$| and λ denotes a modification parameter for balancing the two distances that can be learned with the model training process. Utilizing L1 distance as a regularization term for the EU distance, the distance fusion |${d_{\rm {MCF}}}$| in MCF can capture the exact similarity between the features while improving its robustness.

In addition, to enhance the interaction between the metric distance function and the softmax function, a metric scaling factor is further introduced into the similarity metric process of the prototype network. The metric scaling factor has been demonstrated to scale the distance metric and improve the effect of feature clustering and few-shot image classification (Oreshkin et al., 2018). Finally, the proposed MCF is defined according to Equation 2:

(19)

where |${d_{\rm {MCF}}}$| denotes the distance fusion and α is the scaling factor, which is set to 5 in the experiment by referring to the literature (Kang et al., 2021; Zhang et al., 2022c).

The overall architecture of the proposed IFormer-based WPN is given in Figure 4. Initially, metric learning is utilized to segregate the data set into distinct support and query sets. The IFormer network is responsible for extracting image feature embeddings from both the support set and the query set. Notably, to simplify the network structure, the number of IFormer blocks in the four distinct stages of the IFormer network is set to 1, 1, 3, and 1. Subsequently, the MPAM-based weighting prototype block is utilized to derive the prototype representation of class prototypes from the support set feature maps. Concurrently, the MPAM is employed to refine the embedding of the query points. Following this, the MCF is utilized to evaluate the similarity between the query points and the class prototypes. The predicted labels are then derived by comparing the similarity scores with the corresponding label information. Ultimately, the few-shot classification task is achieved through iterative optimization and parameter updates during the training process.

Overall architecture of the proposed IFormer-based WPN.
Figure 4:

Overall architecture of the proposed IFormer-based WPN.

4. Materials and Experiment Setups

4.1 Materials and data set

The proposed approach is evaluated using both the publicly available NEU-CLS data set (Bao et al., 2021) and a self-constructed train wheelset bearing defect (TWBD) data set.

  1. NEU-CLS data set. The NEU-CLS data set is a publicly accessible repository of images used for the classification of surface defects in steel plates. This data set is released by Northeastern University in Shenyang, China. The data set encompasses six distinct defect types that are commonly observed in hot-rolled steel plates, including crazing (CR), inclusion (IN), patches (PA), pitting (PI), rolled-in scale (RS), and scratches (SC). Each defect category within the data set consists of 300 grey-scale images, all of which have the resolution of 200 × 200 pixels. In the NEU-CLS data set, the presence of faulty regions is more pronounced in the sample image, and the images exhibit enhanced sharpness and reduced noise. Furthermore, the intra-class defects demonstrate comparable characteristics, while the inter-class defects display significant variations in appearance.

  2. TWBD data set. The TWBD data set is derived from a maintenance workshop dedicated to train wheelset bearings. These bearing defects were attributed to prolonged train operation. Bearings under overhaul are dismantled and arranged on the assembly line. Images of the inner ring, outer ring, and rollers are captured with a handheld camera in a well-lit environment. The process of collecting the TWBD data set is illustrated in Figure 5. We collected 1200 images with the size of 800 × 800 pixels belonging to four different classes, which are 300 images per class in the TWBD data set. The data set includes images of normal bearings (NO) alongside three defect categories: spalling (SP), pitting (PI), and scratches (SC). Figure 6 shows a small number of different types of images.

The process of collecting the TWBD data set.
Figure 5:

The process of collecting the TWBD data set.

Different types of train bearing images: (a) normal, (b) scratches, (c) spalling, (d) pitting.
Figure 6:

Different types of train bearing images: (a) normal, (b) scratches, (c) spalling, (d) pitting.

Both the NEU-CLS and the TWBD data sets are from industrial scenarios, and both are metal surface damage failures. These two data sets have partially identical defect categories, such as PI and SC. However, due to the harsh service environment of wheelset bearings, defects under the same category in the TWBD data set show dissimilarities in both size and regularity, leading to significant differences in their textures, edges, and other features, making inconsistencies in defects of the same type. For instance, the SP contains many defect images with different sizes, shapes, and manifestations. In addition, the TWBD data set contains massive environmental noise that is not related to the defects, making the image quality worse than that of the NEU-CLS data set.

For the two mentioned data sets, we randomly split each class of images into meta-training, meta-validation, and meta-testing sets, with a ratio of 2:1:1. The input image sizes for both data sets are resized to 224 × 224. We employ the Adam optimizer (Reddi et al., 2019) with a learning rate of 10−3. The initial modification parameter λ (denoted in Equation 18) was set to 0.01.

We utilize the six-way K-shot modes in the support set to experiment on the NEU-CLS data set and the four-way K-shot modes on the TWBD data set. During the training phase, we set 50 epochs, with each epoch comprising 30 randomly sampled meta-train tasks from the meta-training set, amounting to a total of 1500 meta-train tasks. Each meta-train task involves 11 images per class in the query set, resulting in a total of K + 11 images per class. After each epoch, we perform 30 meta-validation tasks randomly selected from the meta-validation set to calculate the average validation accuracy, ensuring that the number of images in the query set remains consistent with that of meta-training. During the testing phase, we maintain consistency in the number of images in the query set and perform 300 randomly sampled meta-test tasks from the meta-testing set to evaluate the generalization accuracy of the proposed model.

4.2 Implementation details and evaluation metrics

The experiments are conducted on a computer equipped with a 13th Generation Intel(R) Core(TM) i9-13900K CPU and an NVIDIA GeForce RTX 4090 GPU with 24 GB memory. PyTorch 2.1.0 and Python 3.11.5 are utilized as the deep learning frameworks for our experiments. To evaluate the performance of the model, accuracy (ACC), precision (PRE), recall (RE), and the F1 score are selected as the metrics for analysing the results. The detailed calculation procedure for these evaluation metrics can be found in the literature (Jiang et al., 2024). To reduce the randomness of the evaluation analysis, all methods below are tested 10 times.

5. Experiments

5.1 Experiment results

In this section, the proposed model is initially evaluated on the NEU-CLS data set for few-shot classification experiments. The confusion matrices corresponding to the six-way test results with 1-, 5-, and 10-shot settings are presented in Figure 7(a)–(c), respectively. For the 1-, 5-, and 10-shot settings, the test classification accuracies are 82.42%, 97.05%, and 99.47%, respectively. It can be found that the classification accuracies for CR, IN PA, and SC are more accurate for the NEU-CLS data set, but the classification results for PI and RS are more erroneous. The test results on the NEU-CLS data set demonstrate that the proposed model can achieve high classification accuracy.

Confusion matrices of classification results of the NEU-CLS and the TWBD data sets: (a) NEU-CLS six-way 1-shot, (b) NEU-CLS six-way 5-shot, (c) NEU-CLS six-way 10-shot, (d) TWBD four-way 1-shot, (e) TWBD four-way 5-shot, and (f) TWBD four-way 10-shot.
Figure 7:

Confusion matrices of classification results of the NEU-CLS and the TWBD data sets: (a) NEU-CLS six-way 1-shot, (b) NEU-CLS six-way 5-shot, (c) NEU-CLS six-way 10-shot, (d) TWBD four-way 1-shot, (e) TWBD four-way 5-shot, and (f) TWBD four-way 10-shot.

Experimental analyses are further conducted on the self-constructed TWBD data set in order to validate the generality of our proposed model. The confusion matrices for the four-way test results using 1-, 5-, and 10-shot learning are depicted in Figure 7(d)–(f), respectively. Test classification accuracy is 64.32% for the 1-shot setting, 90.12% for the 5-shot setting, and 98.64% for the 10-shot setting. For the TWBD data set, the classification results are found to be positive for SP and NO but more erroneous for PI and SC. The main reason is that the image features of SP and NO defects are more obvious and easier to identify than PI and SC defects. Moreover, PI and SC images contain a lot of similar information, which can interfere with the model's judgement and lead to incorrect classification. The test results on the TWBD data set demonstrate that the proposed model achieves high accuracy in classifying four types of few-shot samples of train wheelset bearings.

5.2 Effect of the proposed IFormer-based feature extractor

To further demonstrate the advantages of the proposed IFormer-based feature extractor, four commonly used state-of-the-art backbones: VGG16 (Yang et al., 2021), GoogleNet (Yang et al., 2023), ResNet18 (Liu et al., 2021a), and VIT-T (Han et al., 2022) are selected for a comparative analysis. To ensure a fair comparison, only the IFormer network in the proposed model is substituted by VGG16, GoogleNet, ResNet18, and VIT-T successively, while the other modules and hyperparameters remain the same. Figure 8 presents the test results for different feature extractors on the NEU-CLS and TWBD data sets, evaluated under the 10-shot setting.

Effect of different feature extractors: (a) NEU-CLS, and (b) TWBD.
Figure 8:

Effect of different feature extractors: (a) NEU-CLS, and (b) TWBD.

The evaluation metrics of VGG16 and GoogleNet, both utilizing traditional convolutional operations, exhibit comparable performance and surpass VIT-T. ResNet18 yields better test results than both VGG16 and GoogleNet, attributed to its residual construction that effectively alleviates gradient vanishing and enhances feature extraction capabilities. VIT-T solely relies on the self-attention mechanism and necessitates a substantial amount of training samples to attain optimal network performance. Consequently, VIT-T exhibits the poorest performance among all methods under few-shot testing conditions. The proposed method utilizes the IFormer network, which integrates the advantages of CNNs and self-attention mechanisms to enhance performance and attain the best test results. Furthermore, the ROC curves for comparative analyses of the TWBD data set utilizing various feature extractors are generated; the outcomes are depicted in Figure. 9. Notably, the IFormer network exhibits superior performance in terms of reconciling true positive rates with false positive rates. The preceding analyses emphasize the capacity of the IFormer-based feature extractor to efficiently discern subtle variations among few-shot samples.

ROC curves on each category comparison under different feature extractors for the TWBD data set: (a) spalling, (b) pitting, (c) scratches, and (d) normal.
Figure 9:

ROC curves on each category comparison under different feature extractors for the TWBD data set: (a) spalling, (b) pitting, (c) scratches, and (d) normal.

5.3 Comparison of few-shot classification models

In order to further validate the superiority of the proposed method, six state-of-the-art few-shot learning methods, including match network (MatchNet; Zhang et al., 2021), relation network (RelateNet; Hu et al., 2018a), TPN (Snell et al., 2017), DeepEMD (Zhang et al., 2022a), DeepBDC (Xie et al., 2022), and WPN (Zhou & Yu, 2023), are selected to compare the few-shot classification performances of different meta-learning models with that of our proposed approach. Figure 10a and b presents the test results for the six-class NEU-CLS data set with 5- and 10-shot learning configurations, while Figure 10c and d illustrates the test results for the four-class TWBD data set.

Performance comparison of different few-shot learning methods: (a) NEU-CLS six-way 5-shot, (b) NEU-CLS six-way 10-shot, (c) TWBD four-way 5-shot, and (d) TWBD four-way 10-shot.
Figure 10:

Performance comparison of different few-shot learning methods: (a) NEU-CLS six-way 5-shot, (b) NEU-CLS six-way 10-shot, (c) TWBD four-way 5-shot, and (d) TWBD four-way 10-shot.

It can be seen that the proposed model achieves the best results in evaluation metrics ACC, PRE, RE, and F1 compared to other few-shot learning methods in the comparative analyses under different setting conditions in Figure 10. Specifically, the proposed model improves the accuracy by 3.11% for the six-way 5-shot, 3.86% for the six-way 10-shot, 9.89% for the four-way 5-shot, and 10.94% for the four-way 10-shot, respectively, compared with the baseline model TPN. Compared to the second-best DeepBDC model, the proposed model improves the accuracy by 1.22% for the six-way 5-shot, 0.53% for the six-way 10-shot, 3.62% for the four-way 5-shot, and 3.96% for the four-way 10-shot, respectively. The proposed model comprehensively employs the IFormer backbone network, MPAM-based weighting prototype block, and the MCF, which work synergistically to significantly improve the classification performance.

5.4 Comparison of different attention mechanisms in the weighting prototype block

The MPAM is a crucial component of the proposed weighting prototype block. Hence, it is necessary to verify the superiority of the MPAM further. Two state-of-the-art attention mechanisms, namely CBAM (Woo et al., 2018) and coordinate attention (CA; Hou et al., 2021), are incorporated into the weighting prototype block, leading to the development of a CBAM-based weighting prototype block and a CA-based weighting prototype block in our approach. Figure 11a and b presents the classification results for the NEU-CLS data set under the 5- and 10-shot settings, while Figure 11c and d shows the results for the TWBD data set.

The classification results of different attention mechanisms: (a) NEU-CLS six-way 5-shot, (b) NEU-CLS six-way 10-shot, (c) TWBD four-way 5-shot, and (d) TWBD four-way 10-shot.
Figure 11:

The classification results of different attention mechanisms: (a) NEU-CLS six-way 5-shot, (b) NEU-CLS six-way 10-shot, (c) TWBD four-way 5-shot, and (d) TWBD four-way 10-shot.

When comparing to CBAM- and CA-based weighting prototype blocks, the MPAM-based weighting prototype block exhibits superior performance in terms of ACC, PRE, RE, and F1 under both 5- and 10-shot settings. The experimental results demonstrate that the MPAM-based weighting prototype block can achieve the integration of spatial direction and channel position information, thus significantly enhancing the representation of class prototypes of interest.

5.5 Comparison of different distances

In order to verify the advantage of the proposed MCF method, a set of comparison experiments containing L1 distance, EU distance, and the proposed MCF is designed. Testing results of two data sets with 5- and 10-shot learning configurations are shown in Figure 12. It can be seen that the classification accuracy of the model using EU distance is higher than that of L1 distance. However, due to the poor robustness of the EU distance in high-dimensional space, its stability is less than that of the L1 distance. The MCF introduces the L1 distance as a regularization term to EU distance and a metric scaling factor, which improves the robustness while capturing the exact similarity between the features and achieves better results than the L1 and the EU distances in terms of both accuracy and stability.

Performance comparison of different distance.
Figure 12:

Performance comparison of different distance.

5.6 Ablation study

This section presents an in-depth ablation study, examining the contributions of the IFormer-based feature extractor, the weighting prototype block, and the MCF within the proposed network architecture. A comprehensive set of ablation experiments on the TWBD data set is conducted, adhering to a setting of 10-shot learning. The ablation study comprises seven experiments, each based on a distinct network framework. The first experiment is labelled M1, a baseline network based on TPN (Snell et al., 2017), which employs four convolution layers as the feature extractor. M2 represents the baseline network incorporating the IFormer-based feature extractor. M3 is the baseline network with the weighting prototype block. M4 represents the baseline network with the MCF. M5 is the baseline network that integrates both the IFormer and the weighting prototype block. M6 is the baseline network incorporating both the IFormer and the MCF. M7 represents the proposed network model. The results of the ablation study are depicted in Figure 13. M1 exhibits the lowest values across all four evaluating metrics. The inclusion of the weighting prototype block and the MCF resulted in a 4.91% and 1.13% improvement in classification accuracy for M5 and M6, respectively, compared to M2. Similarly, the PRE, RE, and F1 scores illustrate a consistent trend. The results for M3 and M4 show that the baseline model is also able to significantly improve the model performance with the introduction of the weighting prototype block and the MCF, respectively. It is noted that the weighting prototype block contributes significantly to enhancing the model's performance. The proposed method M7 achieved the best performance across all four evaluating metrics, underlining the essential role of the IFormer network, weighting prototype block, and MCF in the proposed model.

The results of ablation experiments: (a) ACC, (b) PRE, (c) RE, and (d) F1.
Figure 13:

The results of ablation experiments: (a) ACC, (b) PRE, (c) RE, and (d) F1.

5.7 Visualization analysis

Since the deep neural network is a black box, it may yield correct classification outcomes despite inaccurate decision-making processes. Consequently, to delve deeper into the model's structural properties, we employ the t-distributed stochastic neighbour embedding (t-SNE) technique for visual analysis, showcasing various stages of image feature learning and clustering within the model. This visualization encompasses four key stages of our model: model input, after IFormer, after weighting prototype block, and after MCF. The feature distribution diagrams for NEU-CLS and TWBD data sets under the 10-shot setting are presented in Figures 14 and 15, respectively. Diverse colours and shapes of dots are utilized to signify distinct image categories within both data sets.

Visualization results of the proposed model in NEU-CLS: (a) input, (b) after IFormer, (c) after weighting prototype block, and (d) after MCF.
Figure 14:

Visualization results of the proposed model in NEU-CLS: (a) input, (b) after IFormer, (c) after weighting prototype block, and (d) after MCF.

Visualization results of the proposed model in TWBD: (a) input, (b) after IFormer, (c) after weighting prototype block, and (d) after MCF.
Figure 15:

Visualization results of the proposed model in TWBD: (a) input, (b) after IFormer, (c) after weighting prototype block, and (d) after MCF.

It can be observed that numerous samples belonging to distinct classes overlap during the model input stage in Figure 14. Following the IFormer network, various image features are progressively segregated. Following the weighting prototype block, despite a minor degree of feature overlap, the distinct category features are markedly concentrated within a narrow range, notably for categories of the CR, RS, PI, and SC. This validates that the proposed MPAM-based weighting prototype block possesses excellent category recognition capabilities within a limited feature space of few-shot samples. A similar clustering pattern of category features is evident in Figure 15. The proposed MPAM can not only obtain more representative weighted class prototypes, but also get the query sample features with more effective information, thus achieving better clustering results in visualization analysis. After the application of the weighting prototype block, the spacing between features within the same category is markedly diminished, while clustering results among different categories become more distinct. Ultimately, following the implementation of the MCF, image features belonging to distinct categories from both NEU-CLS and TWBD exhibit enhanced clustering and separation.

6. Conclusions

Current maintenance procedures for high-speed train wheelset bearings primarily rely on manual visual inspection and empirical assessment. However, the low incidence and high randomness of bearing failures make acquiring large-scale image samples challenging. Consequently, deep learning-based machine vision methods face reduced accuracy and limited generalization under few-shot sample conditions. To address this, an IFormer-based WPN is proposed, leveraging the advantages of the IFormer network, an MPAM-based weighting prototype block, and the MCF framework for few-shot classification of wheelset bearing defect images. Specifically, the IFormer acts a feature extractor in the prototype representation space. An MPAM-based weighting prototype block is proposed to reweight the features of identical samples during prototype computation while the MCF efficiently evaluates the similarity between query samples and class prototypes.

Experimental results demonstrate that the proposed approach outperforms state-of-the-art few-shot learning models across various performance metrics. The model achieves significant improvements in classification accuracy under few-shot sample conditions. Ablation experiments and visualization analyses validate the model's effectiveness, showcasing exceptional classification and generalization capabilities in few-shot learning tasks. Despite the promising results, the proposed model has limitations, particularly in terms of the scope of used data sets and the model's ability to generalize across different bearing types. Further investigation is needed to explore how well the approach can adapt to more diverse data sets and real-time monitoring systems. Future research should focus on expanding the model's applicability to different bearing types and fault scenarios, as well as improving its performance on real-time data.

Conflicts of Interest

The authors declare no conflict of interest.

Author Contributions

Feiyue Deng (Conceptualization, Formal analysis, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing), Zeheng Huang (Data curation), Rujiang Hao ( Supervision), Xiaohui Gu (Funding acquisition), and Shaopu Yang (Supervision)

Funding

The work is supported by National Natural Science Foundation of China (12272243 and 12372056).

Data Availability

Due to the nature of this research, participants of this study did not agree for their data to be shared publicly, so supporting data is not available.

Acknowledgments

Thanks to all co-authors for their hard work.

References

Alexey
 
D.
(
2020
).
An image is worth 16×16 words: Transformers for image recognition at scale
.
preprint arxiv: 2010.11929
.

Bao
 
Y.
,
Song
 
K.
,
Liu
 
J.
,
Wang
 
Y.
,
Yan
 
Y.
,
Yu
 
H.
,
Li
 
X.
(
2021
).
Triplet-graph reasoning network for few-shot metal generic surface defect segmentation
.
IEEE Transactions on Instrumentation and Measurement
,
70
,
1
11
. .

Ding
 
X.
,
Li
 
Y.
,
Xiao
 
J.
,
He
 
Q.
,
Yang
 
X.
,
Shao
 
Y.
(
2022
).
Parametric Doppler correction analysis for wayside acoustic bearing fault diagnosis
.
Mechanical Systems and Signal Processing
,
166
,
108375
. .

Feng
 
Y.
,
Chen
 
J.
,
Xie
 
J.
,
Zhang
 
T.
,
Lv
 
H.
,
Pan
 
T.
(
2022
).
Meta-learning as a promising approach for few-shot cross-domain fault diagnosis: Algorithms, applications, and prospects
.
Knowledge-Based Systems
,
235
,
107646
. .

Glowacz
 
A.
,
Glowacz
 
W.
,
Glowacz
 
Z.
,
Kozik
 
J.
(
2018
).
Early fault diagnosis of bearing and stator faults of the single-phase induction motor using acoustic signals
.
Measurement
,
113
,
1
9
. .

Guo
 
Z.
,
Ao
 
S.
,
Ao
 
B.
(
2024
).
Few-shot learning based oral cancer diagnosis using a dual feature extractor prototypical network
.
Journal of Biomedical Informatics
,
150
,
104584
. .

Han
 
K.
,
Wang
 
Y.
,
Chen
 
H.
,
Chen
 
X.
,
Guo
 
J.
,
Liu
 
Z.
,
Tang
 
Y.
,
Xiao
 
A.
,
Xu
 
C.
,
Xu
 
Y.
,
Yang
 
Z.
,
Zhang
 
Y.
,
Tao
 
D.
(
2022
).
A survey on vision transformer
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
45
,
87
110
. .

Han
 
T.
,
Liu
 
C.
,
Yang
 
W.
,
Jiang
 
D.
(
2019
).
A novel adversarial learning framework in deep convolutional neural network for intelligent diagnosis of mechanical faults
.
Knowledge-based Systems
,
165
,
474
487
. .

Han
 
T.
,
Liu
 
C.
,
Yang
 
W.
,
Jiang
 
D.
(
2020
).
Deep transfer network with joint distribution adaptation: A new intelligent fault diagnosis framework for industry application
.
ISA Transactions
,
97
,
269
281
. .

He
 
D.
,
Zhang
 
Z.
,
Jin
 
Z.
,
Zhang
 
F.
,
Yi
 
C.
,
Liao
 
S.
(
2025
).
RTSMFFDE-HKRR: A fault diagnosis method for train bearing in noise environment
.
Measurement
,
239
,
115417
. .

He
 
Q.
,
Wang
 
J.
,
Hu
 
F.
,
Kong
 
F.
(
2013
).
Wayside acoustic diagnosis of defective train bearings based on signal resampling and information enhancement
.
Journal of Sound and Vibration
,
332
,
5635
5649
. .

He
 
X.
,
Chen
 
Y.
,
Lin
 
Z.
(
2021
).
Spatial-spectral transformer for hyperspectral image classification
.
Remote Sensing
,
13
,
498
. .

He
 
Y.
,
He
 
D.
,
Lao
 
Z.
,
Jin
 
Z.
,
Miao
 
J.
,
Lai
 
Z.
,
Chen
 
Y.
(
2024
).
Few-shot fault diagnosis of turnout switch machine based on flexible semi-supervised meta-learning network
.
Knowledge-Based Systems
,
294
,
111746
. .

Hnewa
 
M.
,
Radha
 
H.
(
2021
).
Multiscale domain adaptive yolo for cross-domain object detection
. In
2021 IEEE International Conference on Image Processing (ICIP)
(pp.
3323
3327
.).
IEEE
. .

Hong
 
C.
,
Chen
 
L.
,
Liang
 
Y.
,
Zeng
 
Z.
(
2021
).
Stacked capsule graph autoencoders for geometry-aware 3D head pose estimation
.
Computer Vision and Image Understanding
,
208
,
103224
. .

Hou
 
Q.
,
Zhou
 
D.
,
Feng
 
J.
(
2021
).
Coordinate attention for efficient mobile network design
. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp.
13713
13722
.). .

Hu
 
H.
,
Gu
 
J.
,
Zhang
 
Z.
,
Dai
 
J.
,
Wei
 
Y.
(
2018a
).
Relation networks for object detection
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
3588
3597
.). .

Hu
 
J.
,
Shen
 
L.
,
Sun
 
G.
(
2018b
).
Squeeze-and-excitation networks
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
7132
7141
.). .

Huang
 
W.
,
Sun
 
H.
,
Luo
 
J.
,
Wang
 
W.
(
2019
).
Periodic feature oriented adapted dictionary free OMP for rolling element bearing incipient fault diagnosis
.
Mechanical Systems and Signal Processing
,
126
,
137
160
. .

Jiang
 
P.
,
Liu
 
J.
,
Feng
 
J.
,
Chen
 
H.
,
Chen
 
Y.
,
Li
 
C.
,
Cao
 
D.
(
2024
).
Interpretable detector for cervical cytology using self-attention and cell origin group guidance
.
Engineering Applications of Artificial Intelligence
,
134
,
108661
. .

Kang
 
D.
,
Kwon
 
H.
,
Min
 
J.
,
Cho
 
M.
(
2021
).
Relational embedding for few-shot classification
. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
(pp.
8822
8833
.). .

Kim
 
H.
,
Park
 
C. H.
,
Suh
 
C.
,
Chae
 
M.
,
Yoon
 
H.
,
Youn
 
B. D.
(
2023
).
MPARN: Multi-scale path attention residual network for fault diagnosis of rotating machines
.
Journal of Computational Design and Engineering
,
10
,
860
872
. .

Lei
 
L.
(
2004
. June).
A machine vision system for inspecting bearing-diameter
. In
Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No. 04EX788)
(Vol.
5
, pp.
3904
3906
.).
IEEE
. .

Li
 
C.
,
Liu
 
Y.
,
Liao
 
Y.
,
Wang
 
J.
(
2022a
).
A VME method based on the convergent tendency of VMD and its application in multi-fault diagnosis of rolling bearings
.
Measurement
,
198
,
111360
. .

Li
 
X.
,
Shao
 
H.
,
Lu
 
S.
,
Xiang
 
J.
,
Cai
 
B.
(
2022b
).
Highly efficient fault diagnosis of rotating machinery under time-varying speeds using LSISMM and small infrared thermal images
.
IEEE Transactions on Systems, Man, and Cybernetics: Systems
,
52
,
7328
7340
. .

Li
 
H.
,
Li
 
C.
,
Li
 
G.
,
Chen
 
L.
(
2021
).
A real-time table grape detection method based on improved YOLOv4-tiny network in complex background
.
Biosystems Engineering
,
212
,
347
359
. .

Li
 
R.
,
Zhong
 
J.
,
Hu
 
W.
,
Dai
 
Q.
,
Wang
 
C.
,
Wang
 
W.
,
Li
 
X.
(
2024
).
Adaptive class augmented prototype network for few-shot relation extraction
.
Neural Networks
,
169
,
134
142
. .

Li
 
Y.
,
Zuo
 
M. J.
,
Chen
 
Y.
,
Feng
 
K.
(
2018
).
An enhanced morphology gradient product filter for bearing fault detection
.
Mechanical Systems and Signal Processing
,
109
,
166
184
. .

Liu
 
Y.
,
She
 
G. R.
,
Chen
 
S. X.
(
2021a
).
Magnetic resonance image diagnosis of femoral head necrosis based on ResNet18 network
.
Computer Methods and Programs in Biomedicine
,
208
,
106254
. .

Liu
 
Z.
,
Yang
 
S.
,
Liu
 
Y.
,
Lin
 
J.
,
Gu
 
X.
(
2021b
).
Adaptive correlated Kurtogram and its applications in wheelset-bearing system fault diagnosis
.
Mechanical Systems and Signal Processing
,
154
,
107511
. .

Lu
 
C.
,
Wang
 
Y.
,
Ragulskis
 
M.
,
Cheng
 
Y.
(
2016
).
Fault diagnosis for rotating machinery: A method based on image processing
.
PLoS ONE
,
11
,
e0164111
. .

Ma
 
W.
,
Liu
 
R.
,
Guo
 
J.
,
Wang
 
Z.
,
Ma
 
L.
(
2023
).
A collaborative central domain adaptation approach with multi-order graph embedding for bearing fault diagnosis under few-shot samples
.
Applied Soft Computing
,
140
,
110243
. .

Misra
 
D.
(
2019
).
Mish: a self regularized non-monotonic activation function
,
preprint arxiv:1908.08681
.

Oh
 
H.
,
Lee
 
Y.
,
Lee
 
J.
,
Joo
 
C.
,
Lee
 
C.
(
2022
).
Feature selection algorithm based on density and distance for fault diagnosis applied to a roll-to-roll manufacturing system
.
Journal of Computational Design and Engineering
,
9
,
805
825
. .

Oreshkin
 
B.
,
Rodríguez López
 
P.
,
Lacoste
 
A.
(
2018
).
TADAM: Task dependent adaptive metric for improved few-shot learning
. In
Proceedings of the 32nd International Conference on Neural Information Processing Systems
(pp.
719
729
.).

Reddi
 
S. J.
,
Kale
 
S.
,
Kumar
 
S.
(
2019
).
On the convergence of adam and beyond
.
preprint arxiv:1904.09237
. .

Ronneberger
 
O.
,
Fischer
 
P.
,
Brox
 
T.
(
2015
).
U-net: Convolutional networks for biomedical image segmentation
. In
Medical Image Computing and Computer-assisted Intervention–MICCAI 2015: 18th International Conference Proceedings, part III 18
(pp.
234
241
.).
Springer International Publishing
. .

Shen
 
H.
,
Li
 
S.
,
Gu
 
D.
,
Chang
 
H.
(
2012
).
Bearing defect inspection based on machine vision
.
Measurement
,
45
,
719
733
. .

Shen
 
H.
,
Zhu
 
C.
,
Li
 
S.
,
Chang
 
H.
(
2013
).
A machine vision system for bearing greasing procedure
. In
Proceedings of 2013 Chinese Intelligent Automation Conference: Intelligent Automation & Intelligent Technology and Systems
(pp.
309
316
.).
Springer
. .

Si
 
C.
,
Yu
 
W.
,
Zhou
 
P.
,
Zhou
 
Y.
,
Wang
 
X.
,
Yan
 
S.
(
2022
).
Inception transformer
.
Advances in Neural Information Processing Systems
,
35
,
23495
23509
.

Snell
 
J.
,
Swersky
 
K.
,
Zemel
 
R.
(
2017
).
Prototypical networks for few-shot learning
.
Advances in Neural Information Processing Systems
,
30
,
4077
4877
. .

Wang
 
Y.
,
Xiang
 
J.
,
Markert
 
R.
,
Liang
 
M.
(
2016
).
Spectral kurtosis for fault detection, diagnosis and prognostics of rotating machines: A review with applications
.
Mechanical Systems and Signal Processing
,
66
,
679
698
. .

Wang
 
Z.
,
Ding
 
Y.
,
Han
 
T.
,
Xu
 
Q.
,
Yan
 
H.
,
Xie
 
M.
(
2024
).
Adaptive attention-driven few-shot learning for robust fault diagnosis
.
IEEE Sensors Journal
,
24
,
26034
26043
. .

Woo
 
S.
,
Park
 
J.
,
Lee
 
J. Y.
,
Kweon
 
I. S.
(
2018
).
Cbam: Convolutional block attention module
. In
Proceedings of the European Conference on Computer Vision (ECCV)
(pp.
3
19
.). .

Wu
 
J.
,
He
 
D.
,
Li
 
J.
,
Miao
 
J.
,
Li
 
X.
,
Li
 
H.
,
Shan
 
S.
(
2024
).
Temporal multi-resolution hypergraph attention network for remaining useful life prediction of rolling bearings
.
Reliability Engineering & System Safety
,
247
,
110143
.

Wu
 
T.
,
Wu
 
H.
,
Du
 
Y.
,
Peng
 
Z.
(
2013
).
Progress and trend of sensor technology for on-line oil monitoring
.
Science China Technological Sciences
,
56
,
2914
2926
. .

Xie
 
J.
,
Long
 
F.
,
Lv
 
J.
,
Wang
 
Q.
,
Li
 
P.
(
2022
).
Joint distribution matters: Deep brownian distance covariance for few-shot classification
. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp.
7972
7981
.). .

Yan
 
X.
,
She
 
D.
,
Xu
 
Y.
,
Jia
 
M.
(
2021
).
Deep regularized variational autoencoder for intelligent fault diagnosis of rotor–bearing system within entire life-cycle process
.
Knowledge-Based Systems
,
226
,
107142
. .

Yang
 
H.
,
Ni
 
J.
,
Gao
 
J.
,
Han
 
Z.
,
Luan
 
T.
(
2021
).
A novel method for peanut variety identification and classification by Improved VGG16
.
Scientific Reports
,
11
,
15756
. .

Yang
 
L.
,
Yu
 
X.
,
Zhang
 
S.
,
Long
 
H.
,
Zhang
 
H.
,
Xu
 
S.
,
Liao
 
Y.
(
2023
).
GoogLeNet based on residual network and attention mechanism identification of rice leaf diseases
.
Computers and Electronics in Agriculture
,
204
,
107543
. .

Yao
 
J.
,
Chang
 
Z.
,
Han
 
T.
,
Tian
 
J.
(
2024
).
Semi-supervised adversarial deep learning for capacity estimation of battery energy storage systems
.
Energy
,
294
,
130882
. .

Zeng
 
L.
,
Sun
 
B.
,
Zhu
 
D.
(
2021
).
Underwater target detection based on faster R-CNN and adversarial occlusion network
.
Engineering Applications of Artificial Intelligence
,
100
,
104190
. .

Zhang
 
C.
,
Cai
 
Y.
,
Lin
 
G.
,
Shen
 
C.
(
2022a
).
Deepemd: differentiable earth mover's distance for few-shot learning
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
45
,
5632
5648
. .

Zhang
 
D.
,
Hao
 
X.
,
Liang
 
L.
,
Liu
 
W.
,
Qin
 
C.
(
2022b
).
A novel deep convolutional neural network algorithm for surface defect detection
.
Journal of Computational Design and Engineering
,
9
,
1616
1632
. .

Zhang
 
Y.
,
Li
 
W.
,
Zhang
 
M.
,
Wang
 
S.
,
Tao
 
R.
,
Du
 
Q.
(
2022c
).
Graph information aggregation cross-domain few-shot learning for hyperspectral image classification
.
IEEE Transactions on Neural Networks and Learning Systems
,
35
,
1912
1925
. .

Zhang
 
Z.
,
Liu
 
Y.
,
Wang
 
X.
,
Li
 
B.
,
Hu
 
W.
(
2021
).
Learn to match: automatic matching network design for visual tracking
. In
Proceedings of the IEEE/CVF International Conference on Computer Vision
(pp.
13339
13348
.). .

Zhao
 
C.
,
Wan
 
X.
,
Zhao
 
G.
,
Yan
 
Y.
(
2017
).
Spectral–spatial classification of hyperspectral images using trilateral filter and stacked sparse autoencoder
.
Journal of Applied Remote Sensing
,
11
,
016033
016033
. .

Zhong
 
J.
,
Yang
 
Y.
,
Mao
 
H.
,
Qin
 
A.
,
Li
 
X.
,
Tang
 
W.
(
2024
).
Contrastive regularization guided label refurbishment for fault diagnosis under label noise
.
Advanced Engineering Informatics
,
61
,
102478
. .

Zhou
 
Y.
,
Yu
 
L.
(
2023
).
Few-shot learning via weighted prototypes from graph structure
.
Pattern Recognition Letters
,
176
,
230
235
. .

Zhu
 
J.
,
He
 
D.
,
Bechhoefer
 
E.
(
2013
).
Survey of lubrication oil condition monitoring, diagnostics, and prognostics techniques and systems
.
Journal of Chemical Science and Technology
,
2
,
100
115
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]