Bevel gear transmissions are widely employed in mechanical systems where intersecting shafts are required, valued for their high efficiency, smooth operation, compact design, and substantial torque-bearing capacity. However, operating within enclosed spaces under heavy loads, bevel gears are highly susceptible to faults, which can lead to significant property damage and even catastrophic accidents. The meshing process of bevel gears, which changes the axis of rotation, inherently exhibits characteristics of non-stationarity, strong nonlinearity, and high background noise. Traditional fault diagnosis methods rely heavily on expert domain knowledge and manual feature extraction, which are often time-consuming and labor-intensive in practical applications.
Since the formal conceptualization of deep learning in 2006, its research and application have flourished globally. Among various architectures, the Convolutional Neural Network (CNN), with its powerful capability for automatic, hierarchical nonlinear feature extraction and unsupervised parameter adjustment, has effectively addressed the limitations of traditional methods, such as high energy consumption, low precision, and poor adaptability stemming from experience-dependent analysis. Researchers have successfully applied CNNs to gear and bearing fault diagnosis. For instance, some have combined Variational Mode Decomposition (VMD) with CNNs to tackle the difficulty of extracting weak fault features in planetary gearboxes. Others have converted one-dimensional vibration signals into two-dimensional grayscale images for CNN processing to improve model efficiency and recognition rates. Furthermore, studies have explored the application of CNNs in fault diagnosis for wind turbine rolling bearings.
While promising, most existing CNN-based fault diagnosis methods operate under the assumption of having sufficient, labeled training samples under specific conditions. In real-world industrial scenarios, especially during the initial operational phase of new equipment or for rare failure modes, obtaining a large number of fault samples is challenging, leading to a “few-sample” problem. When training data is scarce, deep learning models are prone to overfitting, resulting in poor generalization performance for unseen data. Therefore, the issue of data imbalance and insufficiency is a critical challenge in mechanical fault diagnostics.
To address this gap, this research focuses on the fault diagnosis of bevel gears under variable speed conditions with limited samples. I propose a diagnostic framework that combines data augmentation strategies with a deep 1D Convolutional Neural Network (1D-CNN) and transfer learning. The core idea is to leverage the temporal and periodic nature of vibration signals to artificially expand the dataset, construct a deep network capable of learning robust features from this enhanced data, and then transfer the learned knowledge to diagnose faults under new, unseen operating speeds with minimal target-domain samples.
Methodology
1. Theoretical Foundations
1.1 Convolutional Neural Network (CNN)
A CNN is a specialized deep neural network designed for processing data with a grid-like topology, such as time-series signals or images. Its key feature is the ability to learn spatial/temporal hierarchies of features through convolutional layers and down-sampling operations. The primary components include:
Convolutional Layer: This layer applies a set of learnable filters (kernels) to the input data. Each filter slides (convolves) across the input, computing dot products to produce a feature map, highlighting specific patterns or features. The operations exhibit sparse connectivity and weight sharing, dramatically reducing parameters. The output of the $i$-th convolutional layer can be expressed as:
$$ \mathbf{X}_i = f(\mathbf{a}_i * \mathbf{X}_{i-1} + \mathbf{b}_i) $$
where $\mathbf{X}_{i-1}$ is the input matrix, $*$ denotes the convolution operation, $\mathbf{a}_i$ and $\mathbf{b}_i$ are the learnable weight matrix and bias vector for the layer, and $f(\cdot)$ is a nonlinear activation function.
Pooling Layer: Following convolutional layers, pooling layers perform down-sampling to reduce dimensionality, control overfitting, and introduce translational invariance. Max pooling, which outputs the maximum value from a set of adjacent neurons, is commonly used and is defined for the $k$-th feature map at layer $i$ as:
$$ P_i^k(j) = \max_{(j-1)S + 1 \leq t \leq jS} \{ q_i^k(t) \}, \quad j=1,2,\ldots,n $$
Here, $P_i^k(j)$ is the $j$-th pooled output, $q_i^k(t)$ is the $t$-th value in the pooling region, and $S$ is the width of the pooling region (stride).
Fully Connected (FC) Layer & Output Layer: After several convolutional and pooling layers, the high-level abstract features are flattened and fed into one or more FC layers, which integrate these features for the final classification task. The output layer typically uses a Softmax function to convert the FC layer’s outputs into a probability distribution over the predefined fault classes.
Activation Function: The Rectified Linear Unit (ReLU), defined as $f(x) = \max(0, x)$, is widely used as the activation function in CNNs due to its efficiency in accelerating convergence and mitigating the vanishing gradient problem.
1.2 Transfer Learning
Transfer learning is a machine learning paradigm that aims to improve learning in a target domain/task by transferring knowledge from a related but different source domain/task. This is particularly valuable when the target domain has limited labeled data. Formally, let a domain $D$ consist of a feature space $\mathcal{X}$ and a marginal probability distribution $P(X)$ where $X \in \mathcal{X}$. A task $T$ consists of a label space $\mathcal{Y}$ and a predictive function $f(\cdot)$ learned from the data. Given a source domain $D_s$ with a corresponding task $T_s$, and a target domain $D_t$ with task $T_t$, transfer learning seeks to help improve the learning of the target predictive function $f_t(\cdot)$ in $D_t$ using knowledge gained from $D_s$ and $T_s$, where $D_s \neq D_t$ or $T_s \neq T_t$.
In the context of fault diagnosis for bevel gears, the source domain could be vibration data collected at one set of operating speeds (with sufficient labels), and the target domain could be data from a different set of speeds (with few labels). A model pre-trained on the source data can have its high-level parameters fine-tuned on the small target dataset, enabling effective diagnosis for the new condition.
2. Proposed 1D-CNN Model and Transfer Strategy for Bevel Gears
I propose a 1D-CNN architecture specifically designed for processing the raw one-dimensional vibration signals from bevel gears. The model is built using the PyTorch deep learning framework and emphasizes a deep but efficient structure with 13 parameter-training layers.
2.1 Network Architecture
The proposed network, termed 1CNN, consists of 5 alternating convolutional and pooling layers, followed by fully connected layers. This design allows for progressive feature extraction from the raw time-domain signal. A key design choice is the use of a larger kernel in the first convolutional layer to capture broader initial patterns, followed by smaller kernels in deeper layers to refine feature extraction. The detailed structure is organized into six modules (M1 to M6) as shown in Table 1.
| Module | Layer | Type | Kernel Size | # Filters | Stride | Padding |
|---|---|---|---|---|---|---|
| M1 | 1 | Conv1D | 64×1 | 16 | 16 | Yes |
| 2 | BatchNorm | – | – | – | – | |
| 3 | MaxPool1D | 2×1 | – | 2 | – | |
| 4 | Dropout (0.1) | – | – | – | – | |
| M2 | 5 | Conv1D | 3×1 | 32 | 1 | Yes |
| 6 | BatchNorm | – | – | – | – | |
| 7 | MaxPool1D | 2×1 | – | 2 | – | |
| 8 | Dropout (0.2) | – | – | – | – | |
| M3 | 9 | Conv1D | 3×1 | 64 | 1 | Yes |
| 10 | BatchNorm | – | – | – | – | |
| 11 | MaxPool1D | 2×1 | – | 2 | – | |
| 12 | Dropout (0.2) | – | – | – | – | |
| M4 | 13 | Conv1D | 3×1 | 64 | 1 | Yes |
| 14 | BatchNorm | – | – | – | – | |
| 15 | MaxPool1D | 2×1 | – | 2 | – | |
| 16 | Dropout (0.2) | – | – | – | – | |
| M5 | 17 | Conv1D | 3×1 | 64 | 1 | Yes |
| 18 | BatchNorm | – | – | – | – | |
| 19 | MaxPool1D | 2×1 | – | 2 | – | |
| 20 | Dropout (0.2) | – | – | – | – | |
| M6 | 21 | Fully Connected | 1000 neurons | – | – | – |
| 22 | BatchNorm | – | – | – | – | |
| 23 | Dropout (0.5) | – | – | – | – | |
| 24 | Fully Connected | 100 neurons | – | – | – | |
| 25 | BatchNorm | – | – | – | – | |
| 26 | Dropout (0.5) | – | – | – | – | |
| 27 | Softmax Output | # fault classes | – | – | – |
Batch Normalization (BatchNorm) is used after convolutional and FC layers to stabilize and accelerate training. Dropout layers are strategically placed to prevent overfitting by randomly dropping a fraction of neuron connections during training.
2.2 Model Transfer Strategy
The transfer learning strategy for fault diagnosis in bevel gears is implemented as follows. First, the 1CNN model is pre-trained on a source domain dataset (e.g., data from one or multiple known speed conditions with ample samples). After pre-training, the model parameters, particularly those in the lower and middle layers (Modules M1 to M5), have learned to extract general, low-level features from vibration signals. These parameters are considered “frozen” (i.e., made non-trainable) during the target domain training phase. Subsequently, only the parameters in the high-level, task-specific layers (Module M6, the fully connected and classification layers) are fine-tuned using the small labeled dataset from the target domain (e.g., a new speed condition). This process, performed with a small learning rate using the Adam optimizer, allows the model to adapt its high-level reasoning to the new operating condition while retaining its foundational feature extraction capabilities.
2.3 Data Augmentation via Sliding Window Overlap Sampling
To mitigate the few-sample problem during the source domain pre-training phase, I employ a data augmentation technique based on the temporal continuity of vibration signals. Instead of segmenting a long signal into non-overlapping samples, a sliding window with overlap is used. This method generates a larger number of training samples from the same raw data, increasing the effective iterations during training and providing the deep network with more varied data instances to learn from.
The process is illustrated conceptually: a sliding window of fixed length $L$ (e.g., 2048 data points) moves along the time-series signal with a step size $K$. Each window position yields one sample. When $K < L$, consecutive samples overlap. The total number of samples $N$ generated from a signal segment of length $S$ is given by:
$$ N = \left\lfloor \frac{S – L}{K} \right\rfloor + 1 $$
By adjusting the step size $K$, one can control the degree of overlap and thus the number of generated samples. A smaller $K$ creates more samples with higher overlap, which can enhance the model’s ability to learn robust features from limited original data, albeit at increased computational cost.
3. Experimental Validation and Analysis
3.1 Experimental Setup and Data Acquisition
To validate the proposed method, I conducted experiments using an MFS Mechanical Fault Comprehensive Simulation Test Bench. The system consists of a drive motor, a bevel gearbox, a magnetic brake for loading, and a data acquisition system. The test subject was a pair of straight bevel gears with an 18-tooth pinion and a 27-tooth gear. Three states of the pinion gear were tested: healthy, missing tooth (simulated by EDM), and broken tooth (simulated by EDM).
Vibration signals were collected using IEPE piezoelectric accelerometers mounted on the gearbox housing. A DH5922N dynamic signal analyzer recorded the data at a sampling frequency of 20 kHz. The motor speed was controlled by an inverter, and experiments were conducted under four different frequency/speed conditions: A (10 Hz), B (20 Hz), C (30 Hz), and D (40 Hz), with a constant load applied via the magnetic brake. For each of the 12 scenarios (4 speeds × 3 states), 1 minute of vibration data was recorded, resulting in 1,200,000 data points per channel per run.
3.2 Sample Generation and Dataset Construction
Raw vibration signals were segmented into samples. Considering the rotational period of the bevel gears, a sample length $L$ of 2048 points was chosen, representing roughly two revolutions of the gear. To study the impact of sample quantity, I created datasets using different sampling strategies. The baseline was sequential non-overlapping sampling ($K = L = 2048$). Then, overlapping sampling was applied with varying step sizes $K$: 1600, 1000, 500, 200, and 100. Each strategy yielded a different number of total samples $N$ from the 1-minute recording. For each state under each speed condition, the generated samples were randomly split into training and testing sets with a 5:1 ratio. The detailed data configuration is summarized in Table 2.
| Condition (Freq.) | Gear State | Training Samples | Testing Samples | Label |
|---|---|---|---|---|
| A (10 Hz) | Healthy | 400 | 80 | 0 |
| Missing Tooth | 400 | 80 | 1 | |
| Broken Tooth | 400 | 80 | 2 | |
| B (20 Hz) | Healthy | 400 | 80 | 3 |
| Missing Tooth | 400 | 80 | 4 | |
| Broken Tooth | 400 | 80 | 5 | |
| C (30 Hz) | Healthy | 400 | 80 | 6 |
| Missing Tooth | 400 | 80 | 7 | |
| Broken Tooth | 400 | 80 | 8 | |
| D (40 Hz) | Healthy | 400 | 80 | 9 |
| Missing Tooth | 400 | 80 | 10 | |
| Broken Tooth | 400 | 80 | 11 |
3.3 Diagnosis Procedure and Analysis of Results
The proposed diagnostic workflow was executed as follows. First, the 1CNN model was pre-trained using a dataset from one speed condition (e.g., Condition B) which was augmented via a specific overlapping sampling strategy. After pre-training, the model’s lower layers (M1-M5) were frozen. The model was then transferred and fine-tuned on a very small subset of labeled data from a different target speed condition (e.g., Condition A). Finally, the performance of the transferred model was evaluated on the independent test set from the target condition. This process was repeated for various source-target pairs and sampling strategies.
The core experiment evaluated how the overlapping sampling strategy during pre-training affected the final diagnosis accuracy in the target domain under the few-sample fine-tuning scenario. The results, focusing on the transfer from Condition B (source) to Condition A (target), are synthesized and illustrated. The key finding is that regardless of the sampling method, the diagnosis accuracy generally improves as the number of training samples increases, plateauing after a certain point. Critically, overlapping sampling consistently led to higher peak accuracy compared to non-overlapping sequential sampling. For instance, with an optimal sliding step $K=100$, the overlapped sampling method achieved a diagnostic accuracy approximately 12.85% higher than that achieved with sequential sampling on the same raw data. While smaller step sizes generate more samples and can improve accuracy, they also increase computational load and memory requirements. Therefore, selecting an appropriate step size is a trade-off between performance gain and resource efficiency.
3.4 Visualization of Model Performance
To intuitively understand the model’s feature learning and classification capability, I visualized the high-dimensional features extracted by the model’s last fully connected layer using t-SNE, a nonlinear dimensionality reduction technique. The visualization for a case using overlapping sampling ($K=100$) shows that the test samples from both the source and target domains are clustered effectively according to their fault classes in the reduced 2D space, with clear separability between the healthy, missing tooth, and broken tooth states for bevel gears. This indicates that the 1CNN model, empowered by overlap sampling and transfer learning, has learned robust and transferable feature representations.
Furthermore, a confusion matrix analysis for the target domain test set confirmed the high performance. The model achieved an overall accuracy of 97.73% for the three-class classification task under the new speed condition. Specifically, all samples for the “missing tooth” fault were correctly identified, while only a small number of “broken tooth” samples were misclassified as “healthy.”
Conclusion
In this research, I addressed the challenging problem of fault diagnosis for bevel gears under variable speed conditions with limited training samples. The proposed solution integrates a data augmentation strategy based on sliding window overlap sampling, a specifically designed deep 1D Convolutional Neural Network (1CNN) model, and a transfer learning paradigm.
The findings lead to the following conclusions:
- The developed 1CNN model, with its hierarchical structure of alternating convolutional and pooling layers, is capable of automatically extracting deep, discriminative features directly from the raw one-dimensional vibration signals of bevel gears. When combined with transfer learning, it effectively enables accurate fault diagnosis across variable operating conditions with minimal target-domain data.
- Under few-sample constraints, the sliding window overlap sampling technique serves as a simple yet powerful data augmentation method. By increasing the number and variety of training instances from the same raw data record, it significantly enhances the model’s feature learning capacity during pre-training, leading to a substantial improvement (up to 12.85% in our experiment) in the final cross-condition fault diagnosis accuracy, without altering the model’s fundamental parameters.
- The choice of the sliding step $K$ is crucial. While a smaller step generates more overlapping samples and generally leads to better performance, it also increases computational overhead. An excessively small step may offer diminishing returns on accuracy while disproportionately raising resource costs. Therefore, selecting an appropriate step size is necessary to balance diagnostic performance and operational efficiency.
- The overall framework demonstrates strong practicality. It allows for building a robust diagnostic model using data from accessible or historical operating conditions (source domain) and then efficiently adapting it to new, unseen conditions (target domain) with very few labeled samples. This is particularly valuable for real-world industrial applications where faults are rare and collecting massive fault datasets under every possible condition is infeasible.
This work provides a viable reference for intelligent fault diagnosis of mechanical components like bevel gears in data-scarce environments. Future work could explore adaptive methods for determining the optimal overlap ratio and investigate the framework’s performance under more extreme domain shifts, such as varying load conditions or different types of bevel gears.

