Fault Diagnosis for Bevel Gears Under Variable Speed Conditions with Limited Data: A Deep Transfer Learning Approach

Traditional fault diagnosis methodologies for rotating machinery, such as those applied to bevel gear systems, typically rely on the availability of large volumes of labeled data for training robust models. This requirement poses a significant challenge in real-world industrial settings, where acquiring sufficient fault samples under specific operating conditions is often impractical, especially for critical components like the bevel gear. The problem is further exacerbated under variable speed operations, where the dynamic characteristics of vibration signals change, making fault signatures non-stationary and more difficult to isolate from background noise. This work addresses the critical issue of performing accurate fault diagnosis for bevel gears when only a few labeled samples are available from the target operating condition (the domain of interest). We propose a novel framework that combines intelligent data augmentation, a specialized one-dimensional Convolutional Neural Network (1D CNN) architecture, and deep transfer learning to effectively diagnose bevel gear faults under variable speed scenarios with limited data.

The core of our methodology lies in three synergistic components. First, to mitigate the scarcity of training samples, we exploit the temporal and periodic nature of one-dimensional vibration signals. Instead of using non-overlapping segments, we employ a sliding window with overlap to sample the raw signal. This data augmentation strategy creates an expanded dataset from the originally limited data, increasing the effective number of training iterations and helping the model learn more generalized features. The relationship between the total data points, the number of samples generated, and the sliding step is formalized as follows:

$$ S = (N – 1)K + L $$

where $ S $ represents the total number of data points used, $ N $ is the number of samples generated, $ K $ is the sliding step length (in data points), and $ L $ is the length of each sample (e.g., 2048 points). By adjusting $ K $, we control the degree of overlap and the total number of generated samples $ N $ from a fixed data stream $ S $.

Second, we design a deep 1D CNN model, termed 1CNN, specifically tailored for processing raw temporal vibration signals from a bevel gear. The architecture is constructed with multiple alternating blocks of convolutional and pooling layers to perform automatic, hierarchical feature extraction directly from the input signal. The fundamental operation in a convolutional layer for the $i$-th layer is given by:

$$ X_i = f(W_i \ast X_{i-1} + b_i) $$

where $ X_{i-1} $ is the input feature map from the previous layer, $ W_i $ is the kernel weight matrix for the $i$-th layer, $ \ast $ denotes the convolution operation, $ b_i $ is the bias vector, and $ f(\cdot) $ is the ReLU activation function, chosen for its efficiency in mitigating the vanishing gradient problem. Following convolutional layers, max-pooling layers are applied to down-sample the feature maps, enhancing translational invariance and reducing computational complexity. The operation for the $k$-th feature map in the $i$-th pooling layer is:

$$ P_i^k(j) = \max_{(j-1)S+1 \leq t \leq jS} \{ q_i^k(t) \} $$

where $ P_i^k(j) $ is the $j$-th output of the pooling operation, $ q_i^k(t) $ are the values within the pooling region, and $ S $ is the pooling width. The deep structure allows the model to learn low-level and high-level abstractions from the bevel gear vibration data without manual feature engineering.

Third, we integrate deep transfer learning to bridge the gap between a source domain (with ample labeled data from one operating speed) and a target domain (with limited labeled data from a different speed). The process involves two main stages: pre-training and fine-tuning. Initially, the 1CNN model is pre-trained on the abundant source domain data until convergence, allowing it to learn general features related to bevel gear faults. Subsequently, this pre-trained model is transferred to the target task. To adapt to the new condition while preventing overfitting on the small target dataset, we employ a parameter-freezing strategy. The lower and middle layers (M1-M5), which capture fundamental signal patterns, are frozen. Only the parameters in the higher, more task-specific layers (the fully connected layers in module M6) are fine-tuned using the limited target domain samples with a small learning rate. The final classification is performed by a Softmax layer, which outputs a probability distribution over fault classes. The probability for class $c$ is computed as:

$$ P(y=c | x) = \frac{\exp(z_c)}{\sum_{j=1}^{C} \exp(z_j)} $$

where $z_c$ is the input to the Softmax function for class $c$, and $C$ is the total number of fault classes.

To validate the proposed framework, we conducted experiments using a mechanical fault simulator (MFS) test rig. The setup involved a bevel gearbox with a pinion and a ring gear. Faults were introduced into the pinion gear via electro-discharge machining to simulate different damage modes. Vibration data was collected using IEPE accelerometers mounted on the gearbox housing under four distinct motor speed conditions (corresponding to drive frequencies of 10Hz, 20Hz, 30Hz, and 40Hz) and a constant load. Three health states were considered: healthy, missing tooth, and broken tooth. For each state and speed condition, vibration signals were recorded at a 20 kHz sampling rate.

The raw vibration data was segmented into samples of 2048 data points each. The dataset was constructed for each speed condition as follows:

Operating Condition (Frequency)	Health State	Training Samples	Testing Samples	Label
A (10 Hz)	Healthy / Missing Tooth / Broken Tooth	400 / 400 / 400	80 / 80 / 80	0 / 1 / 2
B (20 Hz)	Healthy / Missing Tooth / Broken Tooth	400 / 400 / 400	80 / 80 / 80	3 / 4 / 5
C (30 Hz)	Healthy / Missing Tooth / Broken Tooth	400 / 400 / 400	80 / 80 / 80	6 / 7 / 8
D (40 Hz)	Healthy / Missing Tooth / Broken Tooth	400 / 400 / 400	80 / 80 / 80	9 / 10 / 11

To investigate the few-sample scenario, we treated one speed condition as the source domain (with 400 training samples per class) and a different speed condition as the target domain. In the target domain, we simulated data scarcity by using only a small subset of its original training data for fine-tuning, while its full test set was used for evaluation. The proposed overlapping sampling method was applied to the limited target training data to generate an augmented dataset. We evaluated the diagnostic performance by transferring the model pre-trained on condition B (20Hz) to condition A (10Hz). The key parameters of our proposed 1CNN architecture are detailed below:

Module	Layer #	Type	Kernel Size	# Filters/Neurons	Stride	Padding
M1	1	Conv1D	64×1	16	16	Yes
	2	Batch Norm	–	–	–	–
	3	MaxPool1D	2×1	16	2	–
	4	Dropout (0.2)	–	–	–	–
M2	5	Conv1D	3×1	32	1	Yes
	6	Batch Norm	–	–	–	–
	7	MaxPool1D	2×1	32	2	–
	8	Dropout (0.2)	–	–	–	–
M5	17	Conv1D	3×1	64	1	Yes
	18	Batch Norm	–	–	–	–
	19	MaxPool1D	2×1	64	2	–
	20	Dropout (0.2)	–	–	–	–
M6	21	Fully Connected	–	1000	–	–
	22	Batch Norm	–	–	–	–
	23	Dropout (0.5)	–	–	–	–
	24	Fully Connected	–	100	–	–
	25	Batch Norm	–	–	–	–
	26	Dropout (0.5)	–	–	–	–
	27	Softmax	–	# of Classes	–	–

The central findings of our experimental analysis focus on the impact of overlapping sampling for data augmentation in the few-sample target domain. We compared the fault diagnosis accuracy on the target domain’s test set using models fine-tuned with datasets created using different sliding step lengths $K$. A smaller $K$ results in higher overlap and generates more training samples from the same original data pool. The results clearly demonstrate that overlapping sampling significantly boosts diagnostic performance when target domain labels are scarce. For instance, when fine-tuning with a very limited target dataset, using a sliding step of $K=100$ (high overlap) improved the average fault recognition accuracy by 12.85% compared to using non-overlapping sampling ($K=2048$). This confirms that our augmentation method effectively enhances the model’s exposure to the target domain’s feature distribution, leading to better generalization.

The effectiveness of the deep transfer learning process itself can be visualized. By applying t-SNE to the features extracted from the fully connected layer of the 1CNN model, we observed that the test samples from both the source and target domains formed compact, well-separated clusters according to their fault classes (healthy, missing tooth, broken tooth) after fine-tuning. This indicates that the model successfully learned transferable features for bevel gear faults that are robust to speed variations. The confusion matrix for the target domain test set showed a high overall accuracy, with the majority of misclassifications occurring only between the healthy and broken tooth states, while the missing tooth fault was identified with perfect precision in our tests.

Further analysis revealed a key trade-off governed by the sliding step $K$. While smaller steps (more overlap) generate more samples and generally yield higher accuracy, they also increase computational cost and memory usage during training. Beyond a certain point, excessively reducing $K$ provides diminishing returns in accuracy while disproportionately increasing training time. Conversely, a very large $K$ (minimal overlap) fails to provide sufficient data augmentation, leading to lower accuracy. Therefore, selecting an appropriate sliding step is crucial for optimizing the balance between diagnostic performance and computational efficiency for bevel gear fault diagnosis.

In conclusion, this research presents a practical and effective solution for diagnosing faults in bevel gears under variable operating speeds with limited training data. The integration of overlapping sampling-based data augmentation, a purpose-built deep 1D CNN model, and a strategic deep transfer learning approach overcomes the challenges posed by data scarcity in the target domain. The framework enables the transfer of knowledge learned from a data-rich source speed condition to a different target speed condition with few labeled samples. This capability is particularly valuable for real-world applications where fault data for specific operating regimes of a bevel gear system is difficult or expensive to obtain. The proposed method demonstrates a significant improvement in fault recognition rates, offering a robust reference for intelligent fault diagnosis in mechanical systems under few-sample conditions. Future work will explore the integration of this approach with domain adaptation techniques to further reduce the distribution discrepancy between source and target domains for even more challenging cross-condition diagnosis scenarios involving bevel gears and other complex gear systems.