Intelligent Fault Diagnosis for Spiral Bevel Gearboxes: A Deep Learning Approach Fusing Time-Frequency Analysis and Coordinate Attention

The reliable operation of mechanical transmission systems is paramount across industries such as aerospace, heavy machinery, and transportation. Among the key components, the spiral bevel gear is highly valued for its high load-bearing capacity, smooth meshing action, and compact design, enabling efficient power transmission between non-parallel, intersecting shafts. However, the complex curvilinear geometry of its tooth flanks makes it sensitive to manufacturing inaccuracies, assembly errors, and operational stresses. When operating under demanding high-speed and heavy-load conditions, spiral bevel gears are prone to developing faults like cracks, pitting, spalling, and tooth breakage. A failure in such a critical component can lead to catastrophic system breakdowns, significant economic losses, and severe safety hazards. Therefore, developing robust and accurate fault diagnosis methodologies for spiral bevel gearboxes is an essential area of research to enable predictive maintenance and ensure operational safety.

Traditional vibration-based fault diagnosis for gears often relies on signal processing techniques to extract handcrafted features from time-domain or frequency-domain signals. These features, such as statistical parameters (root mean square, kurtosis), spectral kurtosis, or envelope demodulation spectra, are then fed into classifiers like Support Vector Machines (SVM) or Artificial Neural Networks (ANN). While effective in controlled settings, these methods have significant limitations. Their performance heavily depends on expert knowledge for feature selection and engineering. More critically, the vibration signals from spiral bevel gearboxes operating in real industrial environments are often contaminated with strong background noise from other rotating elements, electromagnetic interference, and structural resonances. This noise can easily mask the subtle characteristic signatures of incipient faults, making traditional feature extraction unreliable.

Deep learning, a subset of machine learning, has emerged as a powerful tool to overcome these limitations. By using multi-layered neural networks, deep learning models can automatically learn hierarchical and discriminative features directly from raw or minimally processed data. Convolutional Neural Networks (CNNs), in particular, have shown exceptional prowess in processing data with a grid-like topology, such as images. This capability can be leveraged for fault diagnosis by converting one-dimensional vibration signals into two-dimensional time-frequency representations (TFRs). TFRs, such as those generated by the Continuous Wavelet Transform (CWT), provide a joint view of how the signal’s frequency content evolves over time, effectively revealing transient patterns associated with different fault types. Treating these TFRs as “images” of the machine’s health state allows CNNs to learn spatial patterns indicative of specific spiral bevel gear faults.

However, applying standard CNNs to this task faces challenges. First, the useful fault signatures in the time-frequency image might be localized and occupy only small regions, while the rest of the image contains less informative content or noise. Standard convolutional operations process all regions equally, lacking a mechanism to focus on these critical areas. Second, while deeper networks can learn more complex features, they often suffer from degradation problems like vanishing gradients, making them difficult to train effectively. To address these issues simultaneously, this work proposes a novel intelligent fault diagnosis framework for spiral bevel gearboxes that integrates Continuous Wavelet Transform for signal imaging and a dedicated deep network architecture called Coordinate Attention Residual Network (CooAtten-Resnet). This model combines the representational power of Residual Networks (ResNet) with the feature refinement capability of the Coordinate Attention mechanism, enabling it to precisely locate and amplify fault-related features in the time-frequency maps while maintaining trainability in deep layers. The performance and robustness of this approach are rigorously validated under various signal-to-noise ratio conditions.

Methodology: From Raw Vibration to Intelligent Diagnosis

Data Preprocessing and Augmentation

Training a deep CNN model with strong generalization ability requires a substantial amount of labeled data. In many industrial scenarios, obtaining vast datasets for every possible fault condition is impractical. To mitigate data scarcity and enhance model robustness, data augmentation techniques are essential. For one-dimensional time-series vibration data, overlapping sampling is an effective and logical augmentation strategy. Unlike random cropping in images, overlapping sampling respects the temporal continuity of the signal. The procedure involves sliding a window of a fixed length across the full-length vibration signal with a step size smaller than the window length. This generates multiple, slightly different samples from a single recording, increasing the dataset size and providing the model with more varied perspectives of the same fault condition.

The key parameters for overlapping sampling are the sample length ($X_w$) and the sliding step ($L_s$). The sample length should be sufficient to capture at least one complete revolution of the gear to include its meshing pattern. It can be determined by:

$$ X_w \geq \frac{60}{n} \cdot F_s $$

where $n$ is the rotational speed in revolutions per minute (RPM) and $F_s$ is the sampling frequency in Hertz (Hz). The number of samples ($N$) generated from a signal of total length $L_t$ is given by:

$$ N = \left\lfloor \frac{L_t – X_w}{L_s} \right\rfloor + 1 $$

Here, $\lfloor \cdot \rfloor$ denotes the floor function. Choosing $L_s < X_w$ results in the desired overlap between consecutive samples. This process is applied to vibration signals from all channels and under all health conditions to build a comprehensive sample library.

Signal Imaging via Continuous Wavelet Transform

To leverage the power of CNNs, the one-dimensional vibration samples are converted into two-dimensional time-frequency images. The Continuous Wavelet Transform (CWT) is chosen for this task due to its ability to provide multi-resolution analysis, offering good time resolution for high-frequency components and good frequency resolution for low-frequency components—a property ideal for analyzing non-stationary vibration signals.

The CWT of a signal $x(t)$ is defined as the inner product of the signal with a family of scaled and translated versions of a mother wavelet $\psi(t)$:

$$ CWT_x(a, b) = \frac{1}{\sqrt{a}} \int_{-\infty}^{\infty} \psi^*\left(\frac{t-b}{a}\right) x(t) dt $$

where $a$ ($a > 0$) is the scale parameter (inversely related to frequency), $b$ is the translation parameter (related to time), and $\psi^*$ denotes the complex conjugate of the mother wavelet. The choice of mother wavelet is crucial. For fault diagnosis in noisy environments, the Bump wavelet is often preferred due to its excellent frequency localization properties and inherent noise robustness. The Bump wavelet is defined in the frequency domain as:

$$ \hat{\psi}_{\text{Bump}}(\omega) =
\begin{cases}
\exp\left(1 – \frac{1}{1 – (\omega – \mu)^2 / \sigma^2}\right), & \omega \in (\mu – \frac{1}{\sigma}, \mu + \frac{1}{\sigma}) \\
0, & \text{otherwise}
\end{cases} $$

with parameters $\sigma > 0$ and $\mu > 0$ satisfying $\sigma \mu > 1$. Applying the CWT with the Bump wavelet to a vibration sample yields a 2D matrix of coefficients, representing the signal’s energy distribution across time and scale (frequency). This matrix is then converted into an RGB image by mapping the coefficient magnitudes to a color map (e.g., jet), creating a visual “fingerprint” for that specific spiral bevel gear condition. The resulting time-frequency image makes the transient impact patterns caused by a cracked tooth or the distributed energy changes from wear clearly visible as distinct spatial patterns.

Core Network Architecture: CooAtten-Resnet

The designed network architecture is a fusion of two powerful concepts: Residual Learning and Coordinate Attention. The base structure is built upon Residual Networks (ResNet), which tackle the degradation problem in deep networks by using skip connections or “shortcuts.” A fundamental building block is the Residual Block. Let the input to a block be $x$. The block aims to learn a residual function $\mathcal{F}(x)$, and the original input is added back to its output. The final output $y$ of the block is:

$$ y = \mathcal{F}(x, \{W_i\}) + x $$

where $\mathcal{F}$ represents the stacked convolutional, batch normalization, and ReLU layers, and $\{W_i\}$ are their weights. This identity mapping via the shortcut allows gradients to flow directly through the network, enabling the training of very deep architectures (e.g., ResNet-18, ResNet-34) effectively. Our custom network uses multiple such residual blocks with increasing feature map channels.

While ResNet provides depth, it lacks an explicit mechanism to focus on the most informative parts of the feature maps. This is where the Coordinate Attention (CA) mechanism is integrated. The CA module enhances the representational power of the network by embedding positional information into channel-wise attention, allowing the model to capture long-range spatial dependencies with minimal computational overhead. It operates in two distinct steps:

Step 1: Coordinate Information Embedding. Instead of using global average pooling which collapses spatial information, CA decomposes the pooling into two parallel, one-dimensional operations: one along the horizontal direction and one along the vertical direction. For an input feature map $X$ with $C$ channels, the context information for each channel is encoded along the spatial axes:

$$ z_c^h(h) = \frac{1}{W} \sum_{0 \leq j < W} x_c(h, j) $$
$$ z_c^w(w) = \frac{1}{H} \sum_{0 \leq i < H} x_c(i, w) $$

Here, $z_c^h(h)$ is the output of the $c$-th channel at height $h$ from horizontal pooling (width average), and $z_c^w(w)$ is the output at width $w$ from vertical pooling (height average). These transformations aggregate features along one spatial direction while preserving precise positional information along the other.

Step 2: Coordinate Attention Generation. The concatenated horizontal and vertical encodings are then transformed via a shared $1 \times 1$ convolutional transform $f$:

$$ f = \delta(F_1([\mathbf{z}^h, \mathbf{z}^w])) $$

where $[\cdot,\cdot]$ denotes concatenation, $F_1$ is the $1 \times 1$ convolution, and $\delta$ is a non-linear activation (e.g., h-swish). The resulting feature map $f$ is split back into separate horizontal and vertical components, $f^h$ and $f^w$. Another pair of $1 \times 1$ convolutions followed by sigmoid activations ($\sigma$) generate the final attention weights:

$$ g^h = \sigma(F_h(f^h)) $$
$$ g^w = \sigma(F_w(f^w)) $$

The final output $Y$ of the CA module is obtained by recalibrating the input feature map $X$ with the generated attention weights:

$$ y_c(i, j) = x_c(i, j) \times g_c^h(i) \times g_c^w(j) $$

This process allows the network to selectively emphasize informative regions (e.g., the time-frequency region where a fault impact occurs) and suppress less relevant ones, making it particularly effective for analyzing the structured patterns in spiral bevel gear time-frequency images.

The proposed CooAtten-Resnet architecture strategically places these Coordinate Attention modules after groups of residual blocks. The typical structure is outlined below:

Stage	Layer / Block Type	Output Size	Details
Input	Time-Frequency Image	280×280×3	RGB Image from CWT
Stem	Conv 7×7, BN, ReLU, MaxPool	70×70×64	Initial feature extraction and downsampling
Stage 1	Residual Block × 2	70×70×64	Basic feature learning
Stage 2	Residual Block × 2	35×35×128	Downsample, increase channels
Stage 3	Residual Block × 2	18×18×256	Downsample, increase channels
Stage 4	Residual Block × 3 + CA Module	9×9×512	Deep feature learning with spatial attention
Output	Global Avg Pool, Fully Connected, Softmax	1×1×N_classes	Classification into fault types

Overall Diagnostic Framework

The complete workflow for the spiral bevel gearbox fault diagnosis is as follows:

Data Acquisition & Sampling: Collect raw vibration acceleration signals from the gearbox under various health states. Apply the overlapping sampling technique described above to generate a large set of 1D signal samples.
Signal Imaging: Transform each 1D signal sample into a 2D time-frequency image (e.g., 280×280 pixels) using the Continuous Wavelet Transform with the Bump wavelet. Assemble these images into a labeled dataset.
Dataset Construction & Noise Injection: Split the dataset into training and testing subsets. To evaluate robustness, create multiple dataset versions by artificially adding different levels of Gaussian white noise to portions of the samples, simulating challenging industrial environments.
Model Training: Train the proposed CooAtten-Resnet model on the training set. The model learns to map the input time-frequency images to their corresponding fault class labels using backpropagation and an optimizer like Adam.
Fault Diagnosis: Use the trained model to classify unseen time-frequency images from the test set. The model outputs a probability distribution over all possible fault classes, and the class with the highest probability is the diagnosed condition of the spiral bevel gear.

Experimental Validation and Analysis

Experimental Setup and Dataset Description

The proposed methodology was validated using vibration data collected from a spiral bevel gearbox test rig. The rig consisted of a drive motor, a torque/speed sensor, the test spiral bevel gearbox, a loading system (e.g., a magnetic powder brake), and a data acquisition system. Vibration signals were measured using ICP accelerometers mounted on the housing in multiple directions (vertical, horizontal, axial).

Seven distinct health conditions of the spiral bevel gear pair were simulated:

Healthy (Normal) Condition
Large Gear: Tooth Root Crack
Large Gear: Tooth Surface Wear
Large Gear: Missing Tooth
Small Gear: Tooth Root Crack
Small Gear: Tooth Surface Wear
Small Gear: Missing Tooth

Data was collected at a constant motor speed of 900 RPM with a sampling frequency of 5000 Hz. For each condition, a 10-second time series was recorded. Following the overlapping sampling procedure ($X_w = 883$ points, $L_s = 310$), 640 samples were generated per health condition from the multi-channel data, resulting in a base dataset of 4480 samples. Each sample was then converted into a 280×280 RGB time-frequency image using CWT.

To rigorously test the model’s noise immunity, four distinct datasets were constructed by adding Gaussian white noise to the original signals, as summarized in the table below:

Dataset	Description	SNR 20dB Noise	SNR 10dB Noise	Clean Samples	Total Samples
Dataset A	Mixed Noise (10% each)	10%	10%	80%	4480
Dataset B	Uniform High Noise	100%	0%	0%	4480
Dataset C	Uniform Severe Noise	0%	100%	0%	4480
Dataset D	Clean (No Added Noise)	0%	0%	100%	4480

Each dataset was split into a training set (70%) and an independent test set (30%).

Results and Comparative Analysis

The proposed CooAtten-Resnet model was trained and tested on all four datasets. Its performance was benchmarked against several well-established deep learning models: AlexNet, ResNet-18, and ResNet-34. The primary metric for comparison was the classification accuracy on the unseen test set. The results are compiled in the following table:

Diagnosis Model	Dataset A Accuracy	Dataset B Accuracy	Dataset C Accuracy	Dataset D Accuracy
AlexNet	94.79%	93.08%	91.65%	99.70%
ResNet-18	95.76%	94.57%	90.31%	100%
ResNet-34	96.28%	95.01%	91.80%	100%
Proposed CooAtten-Resnet	96.43%	95.54%	93.44%	100%

Key Observations:

Superior Accuracy: The proposed CooAtten-Resnet model achieved the highest test accuracy across all four datasets. This demonstrates the effectiveness of integrating the Coordinate Attention mechanism to refine feature learning specifically for spiral bevel gear fault patterns.
Robustness to Noise: As expected, diagnostic accuracy decreased for all models as the noise level increased (from Dataset D to C). However, the performance advantage of CooAtten-Resnet over other models became more pronounced in noisier conditions. The accuracy gap between CooAtten-Resnet and the next best model (ResNet-34) widened from 0.15% in Dataset A to 1.64% in the severely noisy Dataset C. This indicates that the attention mechanism is particularly effective in helping the network “look past” noise and focus on the genuine fault signatures in the time-frequency map.
Convergence and Stability: Analysis of the training curves showed that the CooAtten-Resnet model converged faster and more stably than the compared models. The loss decreased more smoothly, and the validation accuracy plateaued at a higher level with less fluctuation. This is attributed to the residual connections facilitating gradient flow and the attention mechanism providing clearer learning signals by highlighting relevant features.

A deeper look at the confusion matrices for the CooAtten-Resnet model revealed that misclassifications under noise primarily occurred between the “Healthy” condition and “Wear” faults, and occasionally between different severity levels of the same fault type (e.g., crack vs. severe crack). This is logical because the diffuse nature of wear and the subtle signature of early cracks are the first to be obscured by strong noise. Nevertheless, the model maintained a high overall recognition rate.

Feature Visualization and Interpretation

To gain insight into what the network learns, Gradient-weighted Class Activation Mapping (Grad-CAM) was used to generate visual explanations. Grad-CAM produces a heatmap highlighting the regions in the input time-frequency image that were most influential for the model’s prediction. When applied to the CooAtten-Resnet model, the heatmaps revealed a critical finding:

In standard residual blocks, the highlighted regions (high activation) often corresponded to the shape and contour of energy concentrations in the time-frequency map—the “blobs” representing impact events.
In layers following the Coordinate Attention module, the highlighted regions were broader and more focused on the temporal and spectral location of these energy concentrations. The network learned not just what the fault pattern looks like, but also where in the time-frequency plane it typically occurs relative to gear rotation.

This demonstrates that the attention mechanism successfully guides the network to prioritize spatial-contextual information, which is crucial for distinguishing faults in spiral bevel gears where the timing of impacts relative to the mesh cycle is a key discriminant.

Conclusion

This work presented a novel, intelligent fault diagnosis framework for spiral bevel gearboxes that addresses the challenges of automatic feature extraction and noise robustness. The core innovation lies in the synergistic combination of Continuous Wavelet Transform-based signal imaging and a custom deep neural network, the Coordinate Attention Residual Network (CooAtten-Resnet). The CWT provides a rich and informative 2D representation of the non-stationary vibration signals, transforming the fault diagnosis problem into an image classification task well-suited for CNNs. The CooAtten-Resnet architecture then excels at this task by leveraging residual learning for stable training depth and, more importantly, by employing coordinate attention to dynamically focus on the most discriminative spatial regions within the time-frequency images.

Experimental results on a spiral bevel gearbox test rig under various noise conditions conclusively demonstrate the advantages of this approach:

It achieves very high diagnostic accuracy, reaching 100% on clean data and maintaining over 93% accuracy even under strong noise contamination without any pre-processing denoising.
It outperforms other standard deep learning models (AlexNet, ResNet-18, ResNet-34) consistently, with a growing performance margin as environmental noise increases.
The model converges faster and more stably during training, making it efficient to deploy.
The integration of attention provides a degree of interpretability, showing that the model learns to prioritize the location and shape of fault-induced energy concentrations in the time-frequency domain.

The proposed method provides a powerful, end-to-end solution for the condition monitoring of spiral bevel gears. It reduces reliance on expert knowledge for manual feature design and shows remarkable resilience to noisy industrial environments. Future work will focus on extending this framework to handle variable-speed operating conditions, validating it on larger and more diverse industrial datasets, and exploring its application to other critical rotating machinery components like bearings and planetary gear sets.