VSB Power Line Fault Detection

28 min readMar 1, 2021

This blog post aims to provide one of the approaches used to solve the Kaggle problem of detecting the Partial Discharge (PD) pattern in the medium voltage power line signals.

The blog is structured in the following way:

Business Problem
Constraints
Machine Learning problem statement
Data overview
Performance metric
Exploratory Data Analysis
Data pre-processing
Featurization
Final model
Results
Conclusion
Future Works
Profile

Business Problem

Overhead power line signals run for 100s of kilometers transferring power from one region to another. These distances make it difficult and expensive to manually inspect for any damages caused to the power lines. The damages could be caused by means of a tree branch hitting the line or a flaw in the insulator. These damages lead to a phenomenon called Partial Discharge (PD) in the insulators of the power line.

Partial Discharge (PD) is a localized dielectric breakdown (DB) (which does not completely bridge the space between the two conductors) of a small portion of a solid or fluid electrical insulation (EI) system under high voltage (HV) stress.
- Wikipedia

Basically, PD is an electrical discharge that does not bridge the electrodes between the insulation system completely. PD if left undetected and unrepaired on time, will slowly damage the power line leading to power outages or start a fire.

Figure 1: An illustration of how a PD affected insulation might look like. Left indicates the power line affected by the PD pattern. Right indicates PD pattern not affected by PD pattern. Image Source

Therefore, it is imperative to detect the PD pattern present in the medium-voltage overhead power line signal which is the main objective of the Kaggle problem.

Timely detection and addressal:

Reduces the maintenance cost
Prevents power outages
Prevents fire from starting

Constraints

The cost of missing a PD pattern in a signal is very high as this would lead to a power outage thereby affecting the livelihood of the people or it could even start a fire. This implies False Negatives should be very low.

Detecting a PD pattern in a healthy signal should also be avoided as this would increase the maintenance cost because the resources will be put to repair a healthy power line. This implies False Positives should be less.

There is no strict latency requirement. The model can take some time (up to a few seconds) to predict the PD pattern. This is because PD does not affect the power line immediately rather it slowly deteriorates the health of the power line.

There is no strict interpretability requirement.

Machine Learning problem statement

The objective of the problem is to check if a PD pattern is present in the given signal (x(t)) or not, as shown in Figure 2.

Figure 2

Therefore, this is a binary classification problem.

Data overview

The data used to detect the presence of the PD pattern is provided by the VŠB. This data is acquired directly from the overhead power lines with the new meter designed at ENET Centre. Therefore, this is real-life data, and providing a viable solution to this problem would directly benefit the stakeholders at VŠB.

VŠB has provided the following two types of data for the Kaggle problem:

Signal data

The signal data measured directly from the new meter designed at the ENET Center.
Each signal has 800,000 floating-point data.
There are 8,712 signals in the training data and 20,337 signals in the test data.
The signal data is provided in the parquet file format, where each column represents a signal. Therefore, the size of the train signal data is 800000x8712.

Metadata

Metadata contains the following information about the signals:

Signal id — A unique integer used to identify each signal. A signal id of ‘0’ corresponds to column ‘0’ in the signal data.
Phase id — 3 conductors are used to transfer the power from one region to another. Each conductor carries a signal. The phase of the signal in each conductor is different. The phase id mainly refers to the conductor in which the signal is being carried. Each signal id is associated with a unique phase id of 0, 1, or 2. Please refer to this for more information.
Measurement id — Since the signals are measured using a meter, each signal is associated with a measurement id. Each measurement id is associated with 3 signals corresponding to the 3 phases of the signal.
Target — This field provides information on whether a given signal has a PD pattern in it or not. A value of 0 corresponds to the PD pattern not present and a value of 1 corresponds to the PD pattern present for the respective signal id as shown in Figure 3. This field is present only for the training data and has to be determined for the test data.

Figure 3: ‘target’ field in metadata_train.csv

Figure 3 shows how the sample metadata for the training data looks like:

Performance metric

The key performance metric used by Kaggle to evaluate the model performance is the Matthews Correlation Coefficient (MCC). The formula to calculate MCC is shown in Figure 4.

Following are the important properties of MCC:

MCC gives information about how well the predicted class is correlating with the actual class in binary classification.
MCC considers all 4 values of the confusion matrix in its calculation. Therefore, it gives a complete picture of how a model is performing, unlike say F1 score which does not consider TN in its calculation.
MCC has a range between -1 and +1. Figure 5 shows what the extreme values indicate in MCC.

MCC is perfectly symmetric implying swapping the classes only changes the sign of the correlation coefficient while keeping the value the same. This is not the case for the F1 score as swapping the classes changes the F1 score value. Please refer to this for more information on MCC.

Exploratory Data Analysis

Presence of Null or NaN values

The data needs to be checked for the presence of Null or NaN values. This will give an idea of how ‘clean’ the data is and if it needs to be brought to a workable format. The code shown below calculates the Null or NaN values in the train and test data. The result of the same is shown in Figure 7 and Figure 8.

Figure 7: Null and NaN values in train data.

Figure 8: Null and NaN values in test data.

Number of unique values

As discussed in the data overview part of the blog, the training metadata contains 4 fields namely: signal_id, measurement_id, phase, and target. The number of unique values in each field is shown below:

Understanding the target distribution

The following code shows the distribution of the target data, i.e., the number of signals that have PD present in them and the number of signals that do not have PD.

Figure 9: Distribution of signals with PD pattern (target=1) and without PD pattern (target=0)

From Figure 9, it is clear that the data is highly imbalanced. The number of signals which does not have a PD pattern overwhelms the number of signals that have the PD pattern in the ratio of 16:1 (approximately).

Understanding target distribution — phase-wise

The following code shows the distribution of the target data, phase-wise, i.e., how many signals with a specific phase value have the PD pattern in them.

The distribution of the target data in each phase looks similar indicating that there is no skewness of the PD pattern towards a specific phase. This can be further verified by calculating the imbalance ratio as shown below:

Figure 11: Imbalance Ratio

Phase 2 has a relatively more number of signals with a PD pattern present in them followed by Phase 1 and Phase 0 respectively.

Raw signal — outlook

The following code and Figure 13 show raw signals with and without a PD pattern for a specific measurement_id.

The following observations can be made from the above raw signal data:

Each signal of a measurement id is of a different phase
Signals are extremely noisy
Visually identifying the PD pattern in the signal is not possible

The following are the information about the raw signal data:

Number of samples per signal = 800,000
Frequency of the signal = 50Hz
Time period of the signal = 1/50 = 20ms
Sampling rate = 20ms/800,000 = 25ns
Sampling frequency = 800,000/20ms = 40MHz

Data pre-processing

The signal data cannot be used as-is, as it is extremely noisy. In order to extract any useful feature from the signal, the noise has to be removed from it. In fact, removing noise is one of the most important aspects of the problem.

Tomas Vantuch, a domain expert and a data scientist from ENET at VŠB, who is also the competition host has the following to say about the denoising of the signal:

To use any kind of wavelet transformation is very reasonable, butterworth filter was helpful for me to suppress the sine shape, DWT to obtain its close approximation — sometimes it is disrupted, and denoising with feature extractions are the alchemy of this competition. — Tomas Vantuch, Data Scientist from ENET at VŠB.

The discussion can be found here. The same is emphasized in this paper and in this thesis by Tomas. Hence, denoising is considered an important pre-processing step for all the signals.

Following denoising procedure is used on the signal:

Apply DWT on the signal using ‘db4’ wavelet
Apply High Pass Filter

Discrete Wavelet Transform (DWT)

The power line signals are not smooth signals. They exhibit transients/abrupt changes as shown in Figure 14.

The transients are typically signals of very high frequency and exist for a short duration. They usually contain a lot of information about the power signal, hence they need to be processed and studied.

Wavelet transforms are used to analyze accurately the signals that have abrupt changes in them. Wavelets are rapidly decaying wave-like oscillations. The following Figure 15 shows examples of various types of wavelets.

The wavelet used in the problem to denoise the signal is Daubechies 4 (db4) as per the discussion in this paper.

Once the wavelet is selected, the DWT is performed on the signal in the following way:

Step 1: Decomposition

The signals are first decomposed to get the Detail and the Approximation coefficients. They are obtained by recursively passing the signal through the High Pass Filter (HPF) and the Low Pass Filter (LPF) as shown in Figure 16.

Figure 16: Fast DWT computation process.

The HPF and the LPF contain the respective coefficient values of the db4 wavelet as shown in Figure 17.

Figure 17: Left to Right: db4 wavelet, Decomposition HPF coefficients, Decomposition LPF coefficients. Image Source

After passing through the HPF and LPF, the Detailed and Approximation coefficients are obtained by performing a correlation operation between the signal and the respective coefficient values. For the current problem, only the 1st level of Detail (D1) and Approximation (A1) coefficients are obtained. The signal is downsampled by 2 at each level. Therefore, the number of coefficients also reduces by 2 at each level.

The following code shows the decomposition of the signal into Detail and Approximation coefficients. Python’s ‘pywavelet’ library is used for all the DWT operations.

Step 2: Thresholding

The D1 contains the information regarding the 1st level high frequencies of the signal while the A1 contains the information regarding the 1st level low frequencies of the signal.

Most of the high-frequency content contains both noise and information. To retain the information present in the abrupt changes while removing the noise, thresholding has to be performed.

The thresholding is performed on the 1st level detail coefficient values that are obtained upon the correlation of the signal with 1st level HPF. If the coefficient values are less than the threshold, then they are made to 0 and if the coefficient values are greater than the threshold, then they are kept unchanged. This type of thresholding is called hard-thresholding. The same is illustrated in Figure 18.

The formula as shown in Figure 19 is used to perform hard-thresholding on the output of the detail coefficient values as per this thesis paper:

Figure 19: Threshold value formula. Reference

Figure 20: Left: Signal before hard-thresholding. Right: Signal after hard-thresholding. Image Source

The following code shows the hard-thresholding operation performed on the 1st level Detail coefficients of the signal.

Step 3: Reconstruction

After thresholding the detailed coefficients, the signal is reconstructed back to get the denoised version of the signal.

The process of reconstruction is similar to the decomposition of the signal.

The DWT coefficients are first upsampled by 2 by placing zeros in between the samples, thereby doubling the length at each level.
The coefficients are then correlated with the respective reconstruction HPF and LPF coefficients.
The process is repeated recursively to get the denoised signal.

The HPF and LPF reconstruction coefficients are shown in Figure 21.

Figure 21: Left to Right: db4 wavelet, Reconstruction HPF coefficients, Reconstruction LPF coefficients. Image Source

The following code shows the reconstruction done using the wavelet coefficients.

High Pass Filter

After performing DWT on the signal, it is then passed through an HPF with a 10kHz cut-off frequency. This is because the PD pattern is usually observed at frequencies above 10kHz as per this paper.

The realization of this is done using the Butterworth HPF with a strict cut-off at 10kHz. The following code shows the implementation using the scipy’s butter library.

The 10th order Butterworth filter is used to provide a strict cutoff frequency. Figure 22 shows the ‘strictness’ of the cutoff frequency with the order of the filter.

Figure 22: Transfer function ( |Hn(jw)| Vs. angular frequency (w) ) . Image Source

The following Figure 23 shows how the signal looks like upon performing DWT and passing it through the HPF.

Figure 23: Plots of a randomly selected signal without and with a PD pattern before and after denoising.

Featurization

Following features are extracted from the signal:

Statistical features
Peak features
Phase-resolved features
Cluster features
Fractal features
Entropy features

Statistical features

The entire signal comprising 800,000 data points is divided into 160 windows each of size 5000 data points. The following 19 statistical features are calculated for each window of the signal:

Mean
Standard deviation
Standard deviation — top, bottom
Percentile values — 0, 1, 25, 50, 75, 99, 100
Max range — 0th percentile value subtracted from 100th percentile value
Relative percentiles — mean value subtracted from all the percentile values

This will result in 160x19 feature values per signal, where each row represents the window being analyzed while each column represents the feature that is being calculated.

Each measurement id has 3 signals associated with it each of a distinct phase (Phase 0, 1, 2). Hence, the calculated statistical features of all the 3 signals of a measurement id are concatenated column-wise to get a 160x57 set of features per measurement id. The following Figure 24 illustrates the same:

Figure 24: m_id refers to the measurement id value

Peak features

As seen in Figure 23, the DWT denoised signal also has a lot of peaks in them. These peaks are of 2 types as per this paper:

Corona discharge peak/ false hit peak
PD peak

The corona discharge peaks are the result of Random Pulse Interference (RPI) which is one of the major sources of noise in the medium-voltage overhead power lines. This will create a false-hit peak which can be easily mistaken as a peak due to PD. Therefore, identifying the corona discharge peaks (or false peaks) in the signal becomes an important aspect in identifying the PD pattern in the signal.

Identification of corona discharge peak involves detecting the symmetric peak pair as per this paper. A pair of peaks is said to be symmetric peak pair if they match the following criteria:

The distance between the current peak and the next peak is less than the maxDistance units.
The signs of the pair of peaks should be opposite to each other.
The ratio of the heights of the pair of peaks should be greater than the maxHeightRatio units.

Once the symmetric peak pair is identified, all the peaks that are present within the maxTicksRemoval unit range also have to be considered as corona discharge peaks. Below Figure 25 illustrates how a corona discharge peak looks like.

Figure 25: Corona Discharge Peak Pattern. Image Source

The parameter values that are to be used to identify the symmetric peak pair in the signal, as per this paper are shown in Figure 26.

Figure 26: Parameter values. Image Source

The implementation of identification of the corona discharge peaks is shown in the below code. The below code returns the following:

true_peaks — The non-corona discharge peak indices in the signal
false_peaks — The corona discharge peak indices in the signal
sym_peak_pair — The symmetric peak pair indices in the signal
p_peaks — The positive peak indices of true_peaks in the signal
n_peaks — The negative peak indices of true_peaks in the signal

The following Figure 27 and Figure 28 show the corona discharge peaks identified in an undamaged (no PD present) and in damaged signal respectively.

Figure 27: Corona Discharge Peaks in Undamaged signal

Figure 28: Corona Discharge Peaks in Damaged signal

Peak threshold selection

From Figure 26, we can determine the upper limit of the peak irrespective of a corona discharge or a non-corona discharge peak as 100 units. This means that peaks whose value is greater than 100 units are not considered.

However, either Figure 26 or this paper, does not mention the lower limit of the peak. Therefore, the lower limit has to be determined. This is done in the following way:

Get all the peak features for the peak threshold values: [1, 3, 5, 7, 10]
Fit an XGBoost model on the peak features selected for the respective threshold.
Perform 5-fold cross-validation and get the average MCC value for each threshold.
Select the peak threshold with the highest average MCC.

The code performing the above-mentioned steps is shown below.

The best average MCC score values for each threshold is shown in Figure 29.

Figure 29: Best average MCC scores for each threshold.

From Figure 29 it is clear that the average MCC score for the peak threshold value of 5 is the highest. Therefore, we can set the minimum limit of 5 for the peak to be considered for further processing.

Final peak features

After setting the lower limit as 5 and the upper limit as 100, the following peak features are calculated for each signal:

The number of true positive peaks, true negative peaks, false peaks, and symmetric peak pairs.
The mean, maximum, minimum, and median height values of all the true peaks.
The mean, maximum, minimum, and median width values of all the true peaks.
The mean, maximum, minimum, and median prominence values of all the positive peaks.
The mean, maximum, minimum, and median prominence values of all the negative peaks.

This will come up to a total of 20 peak features for each signal.

As with the statistical features where the features are obtained with respect to the measurement id, even in the case of peak features, they are obtained with respect to measurement id. This is done by taking the mean of each feature across all the phases of the signal for each measurement id.

Phase-resolved features

As discussed above in the ‘Metadata’ section of the blog, each signal of the measurement id is shifted in time (phase-shifted) by a certain degree. This is to ensure that constant power is made available to the load at all times. The illustration of the phase-shifted sine wave is shown in Figure 30 and the code to generate the same is shown below.

Therefore, the peaks of the signals observed in the previous sections are in different quadrants with respect to the 0-degree phase of the sinusoidal signal of the same frequency (50 Hz). Hence, the phases of each signal have to be resolved so that the quadrant to which each peak belongs can be identified with respect to the 0-degree phase of the sinusoidal signal.

Figure 31 shows how the signal looks like after resolving the phase and also the number of peaks present in each quadrant after phase-resolution.

Figure 31: Peaks in the phase-shifted signal (A), Peaks in the same signal after correcting the phase (D). Image Source

The phase resolution involves the application of the Discrete Fourier Transform (DFT), which is given in Figure 32.

DFT is used to:

Identify the phase of the signal
Identify the quadrant in which the peak is observed

Identifying the phase angle of the signal

Referring to the equation of the DFT in Figure 32, X(k) gives the weighted sum of a vector rotating in a complex plane such that it covers N samples in k cycles.

The value of k refers to the number of cycles that are present in the given signal for N samples. The current signals have 1 cycle for N samples. Therefore the value of k is 1.

This implies that X(1) gives the complex vector representation of the given signal and finding the angle of X(1) gives the phase angle of the signal.

Identifying the quadrant in which the peak is observed

The equation in Figure 33 gives the complex vector representation of nᵗʰ sample of a 0-degree phase-shifted signal.

Figure 33: Complex vector of the nth sample of the signal

The xₙ gives the amplitude of the complex vector while (-2πn/N) gives the phase angle of the 0-degree phase-shifted signal at nᵗʰ sample.

Replacing the value of n in Figure 33 with the index location where the peak is observed gives the respective amplitude and the phase angle of the peak in a 0-degree phase-shifted signal.

To obtain the phase angle of the peak in a signal whose phase is shifted by a certain degree, the calculated phase angle of the peak has to be added to the phase angle of the signal where the peak was observed, obtained in the previous section.

Finally, after getting the phase angles of all the peaks in a given signal, they are binned in the following way:

0ᵒ — 90ᵒ: 1st quadrant
90ᵒ — 180ᵒ: 2nd quadrant
180ᵒ — 270ᵒ: 3rd quadrant
270ᵒ — 360ᵒ: 4th quadrant

Final phase-resolved features

The following phase-resolved features are obtained after following the above-mentioned steps:

Number of true peaks in quadrants I, II, III, and IV
Number of false peaks in quadrants I, II, III, and IV
Number of positive peaks in quadrants I, II, III, and IV
Number of negative peaks in quadrants I, II, III, and IV

This will come up to a total of 16 phase-resolved features for each signal. The code performing the above-mentioned steps is shown below:

Cluster features

The cluster feature refers to the clustering of peaks in each signal. It provides the following information about the peaks in the signal:

Number of clusters (groups) of true peaks present in the signal (total_clusters)
Number of clusters (groups) that have more than 2 true peaks in the signal (num_cluster_groups)

The following process is followed to identify the peak clusters in the signal:

Get the indices of all the true peaks in the signal
Get the difference between the consecutive true peak indices of the signal
2 peak indices are considered to be part of the same cluster if the difference between them is less than a threshold value

The code implementing the above-mentioned process is shown below:

The threshold value here is the hyperparameter which is tuned based on the correlation of the final cluster features with the target. The following threshold values were considered for hyperparameter tuning: (10, 25, 50, 75) where 10 refers to the 10th percentile value of the difference between the consecutive true peak indices and so on.

The code shown below performs the hyperparameter tuning of the threshold value. Hyperparameter tuning is done in the following way:

Get the total_clusters and num_cluster_groups features from all the signals in the training data for a given threshold value.
Perform 5-fold Cross-Validation (CV) by fitting an XGBoost model. Use MCC as the performance metric which is to be maximized.
Select the best among the 5 CV scores as the representation for the threshold value.
Repeat the process for all the threshold values.

The following Figure 34 shows the results of all the threshold values:

Figure 34: Results of the threshold selection for the cluster features

From Figure 34 it can be said that the threshold value of 50 which is the 50th percentile of the difference between the consecutive true peak indices of all the signals, gives the best correlation between the cluster features calculated using this threshold and the target value.

Fractal features

In mathematics, a fractal is a self-similar subset of Euclidean space whose fractal dimension strictly exceeds its topological dimension. Fractals appear the same at different levels, as illustrated in successive magnifications of the Mandelbrot set. Fractals exhibit similar patterns at increasingly small scales called self-similarity, also known as expanding symmetry or unfolding symmetry; if this replication is exactly the same at every scale, as in the Menger sponge, it is called affine self-similar. — Wikipedia

Fractals are used to measure the self-similarity of a given object. In signal processing, fractals roughly help us measure how smooth/rough the signal is. Please refer to this for more information about fractals.

The following illustration in Figure 35 shows the application of fractals on various types of signals.

Figure 35: Fractal application on various type of signals

The signals a, b, and c in Figure 35 are smooth signals which are phase-shifted by a certain degree, while the signal d is a rough signal. The various type of fractals like Petrosian, Katz, and Detrended Fluctuation Analysis (DFA) calculated on the above signals indicate that:

The difference between the fractal values is less between the signals a, b, and c.
The difference between the fractal values is more between the signals a, b, c, and signal d.

A similar kind of analysis is performed on the current medium voltage power line signals to measure the ‘roughness’ of the signal with and without the presence of the PD pattern.

As mentioned above, the following fractal dimension values are obtained for each signal:

Petrosian Fractal Dimension
Katz Fractal Dimension
Detrended Fluctuation Analysis (DFA)

These fractal values are calculated on the DWT denoised signal.

Petrosian Fractal Dimension

Petrosian Fractal Dimension (FD) is used to provide a fast computation of FD of a signal by translating the signal into a binary sequence.

Here a binary sequence is created based on the difference between the consecutive samples of the signal. If the difference between the consecutive samples is greater than the standard deviation of the signal then the respective sequence value is made 1 else it is made 0.

Once the binary sequence is created, the Petrosian FD is calculated using the formula shown in Figure 36.

The code shown below performs the above-mentioned operation. Code reference

Katz Fractal Dimension

Katz FD is computed directly from the waveform thereby eliminating the preprocessing step of creating a binary sequence as in the case of Petrosian FD.

The formula to calculate the Katz FD is shown in Figure 37.

The code shown below calculates the Katz FD for a given signal. Code reference

Detrended Fluctuation Analysis (DFA)

The fractal dimension of the signal using DFA is calculated in the following way:

Divide the signal into windows of size w.
Fit a polynomial curve of degree 2 in each window such that the root of squared errors in each window is minimized.
Calculate the average error over the entire signal for that window size
The error is directly proportional to the window size was shown in Figure 38.

Figure 38: Detrended Fluctuation Analysis

5. Solving the equation in Figure 38 gives the Fractal Dimension (D) as shown in Figure 39.

Figure 39: Fractal Dimension calculation using DFA

Here the slope of the plot of ln(E) vs. ln(w) gives the Fractal Dimension.

The code shown below calculates FD using DFA. Code reference

Higuchi Fractal Dimension

The Higuchi FD is calculated as shown in Figure 40.

The code shown below calculates the Higuchi FD. Code reference

Entropy features

In information theory, the entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes. — Wikipedia

It is also known as Shannon’s Entropy and is defined as shown in Figure 41.

This paper discusses some of the drawbacks of Shannon’s Entropy, which are mentioned below:

It neglects the temporal relationships between the values of the time series.
It requires some prior knowledge about the system. The PDF of the time series under analysis should be provided beforehand.
It is best designed to deal with linear systems and poorly describe non-linear chaotic systems.

Following entropy measures were formulated to overcome the limitations of Shannon’s entropy measure for time series data.

Permutation entropy
SVD entropy

Permutation entropy

The permutation entropy of a time series data is calculated using the formula shown in Figure 42.

The permutation entropy is calculated by creating a set of sub-series (vectors) of data from the given time series data (population). The sub-series are obtained by sampling the time series data based on the following two parameters:

Order — Defines the length (number of samples) of the sub-series.
Delay — Defines the number of values in the time series data (population) to skip while sampling for the sub-series. A delay of 1 refers to a sample of consecutive values in the time series data.

Upon creating a set of sub-series, the values of each sub-series of data are sorted and their respective original index values are obtained, i.e., ‘numpy.argsort’ is performed on each sub-series of the data. This is referred to as π as per the equation shown in Figure 42.

The illustration of the calculation of the permutation entropy is shown in Figure 43.

Figure 43: Permutation Entropy calculation example

The code to calculate the permutation entropy for a given time series data is shown below. Code reference.

SVD entropy

Here SVD refers to Singular Value Decomposition. The entropy of the time series data is calculated using the singular values of the matrix formed from the sub-series of the data.

The SVD entropy is calculated using the formula shown in Figure 44.

Similar to permutation entropy, a set of sub-series vectors are formed from the time-series data using the order and delay parameters. The created sub-series vectors are then arranged in a matrix form of size (number of sub-series vector) x (order). The singular values of this matrix are obtained using SVD. This is referred to as σ in the equation shown in Figure 44.

The illustration of the calculation of SVD entropy on time series data is shown in Figure 45.

Figure 45: SVD Entropy calculation example

The code to calculate the SVD entropy for a given time series data is shown below. Code reference.

Final Model

The final binary classification is performed using a Deep Learning (DL) model comprising of:

Bi-directional LSTM
Bi-directional GRU
Attention layer
Dense layer

The architecture of the DL model is as shown in Figure 46.

Bi-directional LSTM

Long Short Term Memory (LSTM) is used when the input is a sequence, mainly to capture the relationship between a given sequence and its past. A uni-directional LSTM captures the relationship of the current sequence with its past sequence while a bi-directional LSTM captures the relationship of the current sequence with its future sequence as well.

Figure 47 illustrates how a cell of a uni-directional LSTM looks like. Kindly refer to this blog to understand more about LSTMs.

The statistical features that were discussed earlier which are of dimension (160x57) are fed as input to the Bi-directional LSTM. Here,

160 is the length of the input sequence
57 is the dimensionality of each input sequence

A total of 128 Bi-directional LSTM units are used in the first layer. The illustration of the same is shown in Figure 48.

Figure 48: Left — Sequence fed in from left to right. Right — Sequence fed in from right to left.

The code snippet to implement the same is shown below.

In the above code:

‘return_sequences=True’ ensures that the LSTM takes input as a sequence.
‘merge_mode=`concat`’ concatenates the outputs of the forward and the backward LSTM units to whom xᵢ,ⱼ is given as input. The illustration of this is shown in Figure 49.

Figure 49: Forward and Backward LSTM unit output concatenation.

Bi-directional GRU

Gated Recurrent Unit (GRU) is a slight variant of LSTM. GRU combines forget gate and input gate into a single update gate and other slight modifications. It has fewer parameters than LSTM, hence it is relatively faster than LSTMs. Just like Bi-directional LSTM, a Bi-directional GRU also helps in finding the relationship of the current input sequence with its previous and next input sequence.

The illustration of GRU is shown in Figure 50.

The concatenated output of the LSTM of the respective forward and backward units for that sequence is fed as input to the forward and backward units of GRU at the same sequence. The illustration of the same is shown in Figure 51.

A total of 64 GRU units are used in both forward and backward directions in this layer. The illustration of the same is shown in Figure 52.

Figure 52: Left — Output of LSTM fed in from left to right. Right — Output of LSTM fed in from right to left.

The code snippet to generate the above-mentioned GRU units is shown below.

Just like in Bi-directional LSTM, the Bi-directional GRU also has the following:

‘return_sequences=True’ which enables the GRU to take a sequence of inputs.
‘merge_mode=`concat`' which will also concatenate the outputs of the forward and the backward LSTM units to whom x`ᵢ,ⱼ is given as input. The illustration of the same is shown in Figure 53.

Figure 53: Forward and Backward GRU unit output concatenation.

Attention layer

LSTMs/GRUs are designed and are good at handling sequential inputs whose sequence length is short. Typically they are good at handling sequences of length 10 to 20. With the increase in sequence length beyond these values, the performance LSTMs/GRUs drop. This can be seen in Figure 54, which shows the translation accuracy (measured using BLEU score) of the LSTMs (RNNs in general) with an increase in sequence length. Here, the accuracy of various RNNs in translating a sequence of words from one language to another is considered.

Figure 54: BLEU score vs. Sentence length. Image Source

RNNsearch-50 and RNNsearch-30 are the models with the Attention layer while RNNenc-50 and RNNenc-30 are the ones without the Attention layer. The Attention layer used here is from this paper which was proposed by Bahdanau.

Although the Attention layer proposed by Bahdanau performs well for longer sequences than the ones without Attention, their performance also drops when the sequence length goes beyond 100. To increase the sequence length even further, Raffel et. al proposed a slight modification to the Attention layer proposed by Bahdanau in this paper.

The modification is mainly in the calculation of the context vector as shown in Figure 55.

Figure 55: Left — Bahdanau’s Attention. Right — Raffel’s Attention. Image Source

The schematic of Raffel’s Attention is shown in Figure 56.

Figure 56: Schematic of Raffel’s Attention. Image Source

The code snippet of the implementation of the Raffel’s Attention layer is shown below.

The dimension of the output of the Attention layer is the same as the number of GRU units in the previous layer, i.e., 128. This 128-dimension vector is further concatenated with other features like peak features, phase-resolved features, cluster features, fractal features, and entropy features which make up to 98 features. Therefore, a total of the 226-dimensional vector (128+98) is fed as input to the dense layer.

Dense layers

The concatenated output is fed as input to the Dense layer which consists of 64 units each having ReLU as an activation unit. The output of this dense layer is further fed as input to another dense layer with a single unit having Sigmoid as the activation function.

The entire Neural Network (NN) minimizes the binary cross-entropy loss function using ‘Adam’ as the optimizer during the backpropagation. Matthews Correlation Coefficient (MCC) is the evaluation metric.

The code snippet of the implementation of the entire NN is shown below.

Threshold selection

The output of the single-dense layer is the final output which is a floating-point value between 0 and 1, indicating the probability of a PD-pattern present in the given signal. This floating-point output value has to be converted to an integer value of 0 or 1 to tell if the PD-pattern is present in the signal or not. This is done by selecting a suitable threshold value such that it maximizes the MCC on the training data.

The code snippet of the implementation of the threshold selection is shown below:

The obtained best threshold value is used to convert the output of training and test data which floating-point value to an integer of 0 or 1.

Since the signals of each phase are combined for a given measurement id by concatenation in the case of statistical features and by taking the average of the three phases in the case of other features, the final output is with respect to the measurement id and not with respect to the signal. To get the output with respect to each signal, the obtained output value for each measurement id has to be replicated for all the signals in that measurement id. This ensures that if a particular measurement id has a PD-pattern present, then all the three conductors (phases) carrying the signal corresponding to that measurement id are considered faulty and proper maintenance is administered to those conductors carrying the signal.

Results

Training data

The model achieved an MCC score of 0.7105 on the training data and an AU-ROC value of 0.8217 as shown in Figure 57.

The confusion matrix of the training data is shown in Figure 58.

Test data

The Private and Public scores of 0.66657 and 0.67378 respectively were obtained on the test data. This resulted in a position of 40 (top 3 percentile) in the private leaderboard of the competition. The screenshot of the same is shown in Figure 59.

Figure 59: Public and Private Leaderboard Scores

Conclusion

In conclusion, the following aspects in this approach were found to be more useful than the others in classifying the PD pattern in the signal:

Signal de-noising using DWT
Domain specific features (peak features, coron discharge pattern)
Bi-directional LSTM based Deep Neural Network

Future Works

Based on the discussion in the Kaggle forums, the following approaches could lead to a better MCC score:

Use better hyper-parameter tuning of the existing approach
Use better domain specific features
Use a entirely different architecture based on CNN