A Deep Learning Approach for Robust Fault Diagnosis in Rotary Vector Reducers Under Severe Noise Interference

The reliable operation of industrial robots is heavily dependent on the health of their core components, among which the rotary vector reducer plays a pivotal role due to its superior load capacity, high rigidity, and compact design. Accurate fault diagnosis of the rotary vector reducer is therefore critical for predictive maintenance and avoiding costly downtime. In practical operating environments, however, the vibration signals collected for diagnosis are invariably contaminated by random noise from various sources, which can severely mask the characteristic fault signatures and degrade the performance of diagnostic models. This challenge necessitates the development of robust diagnostic algorithms capable of maintaining high accuracy under noisy conditions.

Traditional machine learning approaches for fault diagnosis often rely on manually designed features and shallow models, which may lack the representational power and robustness needed for complex signal patterns corrupted by noise. Deep learning, particularly Convolutional Neural Networks (CNNs), has emerged as a powerful alternative, capable of automatically learning hierarchical features directly from raw data. Yet, standard CNN architectures, while effective on clean data, often exhibit significant performance drops when confronted with unseen, noise-corrupted signals, as they may overfit to the specific patterns present in the training set and fail to generalize to the perturbed features caused by noise.

To address this critical gap, I propose a novel anti-noise convolutional neural network (ANNet) specifically designed for the fault diagnosis of rotary vector reducers operating in noisy environments. The core philosophy of ANNet is to explicitly train the model to be invariant to random corruptions in the input signal, thereby forcing it to learn more robust and generalizable features. This is achieved through two synergistic innovations: a unique Input Signal Dropout technique that simulates noise during training, and a Multi-Scale Convolution Kernel Module that extracts and fuses complementary features from the corrupted input.

Methodology of the ANNet Model

The proposed ANNet framework processes the one-dimensional vibration signal through a sequence of transformations and learning stages to achieve noise-robust classification.

1. Signal Stacking for 2D Representation

The raw one-dimensional vibration time-series signal is first converted into a two-dimensional grayscale image to provide a structured input suitable for 2D convolutional operations. This transformation, known as signal stacking, reorganizes the sequential data into a spatial format. Given a one-dimensional signal sequence, it is segmented into $ n $ consecutive sub-sequences, each of length $ m $. These sub-sequences are then stacked row-wise to form a 2D matrix of size $ m \times n $, which is treated as a single-channel image. In my implementation, I set $ m = n = 32 $, resulting in a compact $ 32 \times 32 $ grayscale image. This process can be formally described as follows: Let the original signal be $ S = [s_1, s_2, …, s_{m \times n}] $. The resulting 2D image $ I $ is constructed where the element at row $ i $ and column $ j $ is given by:

$$ I(i, j) = s_{(j-1) \cdot m + i} $$
for $ i = 1, 2, …, m $ and $ j = 1, 2, …, n $.

2. Input Signal Dropout for Explicit Noise Simulation

The key to fostering noise robustness lies in the training phase. Instead of, or in addition to, adding Gaussian white noise, I employ a Dropout operation directly on the input layer. Standard Dropout is typically applied to hidden neurons in fully connected layers to prevent co-adaptation. Here, it is repurposed to randomly corrupt the input image itself, simulating the random loss or corruption of signal points—a phenomenon analogous to impulse noise or severe signal attenuation in real sensors. For an input image $ X $, the corrupted image $ X’ $ is generated by:

$$ X’ = R \odot X $$
$$ r_{i,j} \sim \text{Bernoulli}(p), \quad R = [r_{i,j}] $$
$$ p \sim \text{Uniform}(l, 0.9) $$
$$ l = 0.1 + (0.9 – 0.1) \cdot \frac{s}{S} $$

Here, $ \odot $ denotes element-wise multiplication. $ R $ is a binary mask matrix with entries independently drawn from a Bernoulli distribution with probability $ p $ of being 1. Crucially, the lower bound $ l $ of the uniform distribution for $ p $ increases linearly with the training iteration $ s $, up to a total of $ S $ iterations. This curriculum learning strategy starts with a high corruption rate (low $ p $, as low as 0.1) and gradually reduces it, allowing the model to first learn very robust features from heavily corrupted data before fine-tuning on less corrupted inputs.

3. Multi-Scale Convolutional Feature Extraction and Fusion

To effectively learn from the variably corrupted inputs, the model must capture features at multiple scales. A single kernel size may be optimal for certain fault signatures but inadequate for others, especially when parts of the signal are missing. Therefore, I design a core building block that employs convolutional kernels of three different sizes simultaneously. The architecture of this Multi-Scale Block is as follows:

The corrupted input (or feature map from the previous layer) is fed in parallel into three separate 2D convolutional paths.
Each path uses a different kernel size: $ 15 \times 15 $, $ 7 \times 7 $, and $ 3 \times 3 $. All convolutions use ‘same’ padding and a stride of 1 to preserve spatial dimensions.
The output of each convolution undergoes Batch Normalization (BN) and a ReLU activation function.
The resulting three feature maps, $ \text{Re}_1, \text{Re}_2, \text{Re}_3 $, are then concatenated along the channel dimension to form the block’s output $ \text{Re} $:

$$ \text{Re} = \text{concat}(\text{Re}_1, \text{Re}_2, \text{Re}_3) $$

This design allows the network to concurrently capture long-range, medium-range, and short-range dependencies within the 2D signal representation, making the feature representation more comprehensive and resilient to localized corruption.

4. Overall Network Architecture

The complete ANNet model is constructed by stacking five such Multi-Scale Blocks. The number of filters in each convolutional path is increased progressively in deeper blocks to learn more complex features. The detailed parameters are summarized in Table 1.

Table 1: Detailed Architecture Parameters of the Proposed ANNet Model
Network Layer	Output Tensor Size	Key Parameters
Input Signal	32×32×1	–
Input Dropout	32×32×1	p ~ U(l, 0.9)
Multi-Scale Block 1	32×32×48	Kernels: 15×15, 7×7, 3×3 (16 each)
Multi-Scale Block 2	32×32×96	Kernels: 15×15, 7×7, 3×3 (32 each)
Multi-Scale Block 3	32×32×192	Kernels: 15×15, 7×7, 3×3 (64 each)
Multi-Scale Block 4	32×32×384	Kernels: 15×15, 7×7, 3×3 (128 each)
Multi-Scale Block 5	32×32×768	Kernels: 15×15, 7×7, 3×3 (256 each)
Global Average Pooling	1×1×768	Pool size: 32×32
Dropout	1×1×768	Rate = 0.5
Fully Connected + Softmax	5	Number of fault classes

The final block’s output is fed into a Global Average Pooling layer, which reduces each feature map to a single value, creating a 768-dimensional feature vector. This is followed by a standard Dropout layer (rate=0.5) and a final fully connected layer with a softmax activation for classification into the five health states. The model is trained using the Adam optimizer with an initial learning rate of 0.001, which decays linearly, and a batch size of 16.

Experimental Validation and Comparative Analysis

Experimental Setup and Data Description

The performance of ANNet was validated using vibration data collected from a dedicated rotary vector reducer test rig. Vibration signals were acquired from the axial direction under a constant motor speed of 400 RPM and a load of 40 N·m. Five distinct health states were investigated: one normal condition and four fault conditions, including single-component faults (planet gear, cycloid gear) and composite faults (planet gear + pin, cycloid gear + pin). For each state, a large number of 32×32 image samples were generated using the signal stacking method. The dataset was partitioned for training and testing, with the test set further augmented by adding Gaussian white noise at different Signal-to-Noise Ratio (SNR) levels to simulate increasingly harsh environments. The dataset composition is shown in Table 2.

Table 2: Composition of the Rotary Vector Reducer Fault Dataset
Dataset Purpose	SNR Level	Samples per Class	Total Samples
Training	None (Clean)	4,500	22,500
Testing	None (Clean)	500	2,500
	15 dB	500	2,500
	12 dB	500	2,500
	9 dB	500	2,500
	6 dB	500	2,500
	3 dB	500	2,500

Comparison with State-of-the-Art Methods

To objectively evaluate the anti-noise capability of ANNet, I compared it against several well-established deep learning models adapted for fault diagnosis: a standard CNN (inspired by LeNet), a deep Residual Network (ResNet), and a Training Interference CNN (TICNN), which applies Dropout to its first convolutional layer. All models were trained on the same clean data and evaluated on the identical noise-corrupted test sets. The average test accuracy over 20 independent runs for each noise level is presented in Table 3 and discussed below.

Table 3: Average Test Accuracy (%) of Different Models Under Various Noise Levels
SNR	Standard CNN	ResNet	TICNN	Proposed ANNet
None (Clean)	98.7	99.2	99.0	98.9
15 dB	95.1	97.8	98.1	98.5
12 dB	90.3	95.4	96.9	97.8
9 dB	82.5	89.7	93.5	96.1
6 dB	71.2	75.3	85.4	91.7
3 dB (Severe Noise)	58.6	53.1	70.8	81.5

Analysis of Results: All models perform excellently on the clean test data. As noise intensifies, the accuracy of all models declines, but the rate of degradation varies significantly. The standard CNN and ResNet show considerable sensitivity to noise, with ResNet’s performance dropping particularly sharply under severe noise (3 dB), likely due to overfitting to clean data features that are easily disrupted. The TICNN model demonstrates better robustness, attributable to its first-layer Dropout which introduces some regularization during training. However, the proposed ANNet consistently outperforms all others across all noisy conditions. The performance gap widens as the noise becomes more severe. Crucially, under the extremely challenging 3 dB SNR condition, ANNet maintains an accuracy above 80%, which is approximately 10-20 percentage points higher than the other models. This unequivocally demonstrates the superior anti-noise generalization capability of the ANNet framework for diagnosing faults in a rotary vector reducer.

Interpretation and Discussion of Model Robustness

The exceptional performance of ANNet under noise interference can be attributed to the deliberate and synergistic design of its two core components.

1. The Role of Input Signal Dropout

Applying Dropout directly to the input layer serves three critical functions for enhancing robustness in rotary vector reducer fault diagnosis:

Explicit Noise Augmentation: It acts as a powerful and computationally efficient data augmentation technique that simulates a wide spectrum of random signal corruptions (similar to salt-and-pepper noise). By training on these countless corrupted variants of the original samples, the model is forced to learn feature representations that are invariant to such random perturbations, directly improving its generalization to real, unseen noise.

Feature Learning Guidance: By randomly erasing parts of the input signal, it breaks low-level, short-range correlations and local structures that might be specific to the clean training set but fragile under noise. This compels the network to rely on more distributed, global, and therefore more robust patterns across the 2D representation of the rotary vector reducer vibration signal for making decisions.

Dynamic Training Curriculum: The linearly increasing probability $ p $ implements a curriculum learning strategy. Initially high corruption forces the network to learn the most fundamental and robust discriminative features. As training progresses and corruption decreases, the network can refine these features with more detailed information from less corrupted inputs, leading to a more precise and stable model.

2. The Advantage of Multi-Scale Kernels

The multi-scale convolutional block is perfectly suited to complement the input Dropout strategy for diagnosing a complex system like the rotary vector reducer.

Comprehensive Feature Coverage: Faults in different components (gears, bearings) of a rotary vector reducer manifest as vibration signatures with characteristic frequency components and temporal spans. The simultaneous use of large ($15\times15$), medium ($7\times7$), and small ($3\times3$) kernels allows the network to inherently capture features corresponding to low-frequency (long-period) trends, mid-frequency modulations, and high-frequency (short-duration) impulses without requiring manual kernel size selection or multiple network branches.

Robustness to Corrupted Data: When the input is corrupted by Dropout (simulating noise), different types of information may be lost at different scales. A single-scale kernel might fail if its specific receptive field is heavily affected. The multi-scale architecture provides redundancy; if local details are corrupted, the larger kernels can still capture the overall shape or envelope of the signal, and vice-versa. The subsequent concatenation (feature fusion) allows the classifier to weigh these complementary pieces of evidence optimally.

To isolate the contribution of the multi-scale design, an ablation study was conducted by training variants of ANNet where all kernels in the blocks were set to a single size. The results confirmed that while a $3\times3$ kernel was relatively robust to severe noise and a $15\times15$ kernel helped in very noisy conditions, neither single-scale variant matched the performance of the multi-scale fusion across the entire noise spectrum. This synergy is the cornerstone of ANNet’s effectiveness for the fault diagnosis of rotary vector reducers in noisy industrial settings.

Conclusion

This work addresses the critical and practical challenge of fault diagnosis for rotary vector reducers under strong noise interference. The proposed ANNet model introduces a novel paradigm that integrates aggressive input corruption during training with a multi-scale feature learning architecture. The Input Signal Dropout mechanism explicitly trains the model on a vast array of randomly corrupted signal patterns, fostering the learning of noise-invariant features. Concurrently, the Multi-Scale Convolutional Blocks ensure that diverse and complementary signal characteristics—from broad trends to fine impulses—are extracted and fused, providing a robust descriptive basis for classification even when parts of the signal are unreliable.

Extensive experimental comparisons on a rotary vector reducer dataset under varying noise levels demonstrate that ANNet significantly outperforms several state-of-the-art deep learning models, especially in low Signal-to-Noise Ratio (SNR) scenarios. The performance advantage, reaching over 20 percentage points in accuracy under severe noise (3 dB) compared to some benchmarks, underscores its practical value for real-world industrial applications where signal quality is often poor. The interpretative analysis confirms that the robustness stems from the synergistic effect of its two core design innovations. This approach provides a promising and generalizable framework for developing reliable health monitoring systems for critical mechanical components like the rotary vector reducer operating in challenging acoustic environments.