Effectiveness of Partition and Graph Theoretic Clustering Algorithms for Multiple Source Partial Discharge Pattern Classification Using Probabilistic Neural Network and Its Adaptive Version : A Critique Based on Experimental Studies

Partial discharge (PD) is a major cause of failure of power apparatus and hence its measurement and analysis have emerged as a vital field in assessing the condition of the insulation system. Several efforts have been undertaken by researchers to classify PD pulses utilizing artificial intelligence techniques. Recently, the focus has shifted to the identification of multiple sources of PD since it is often encountered in real-time measurements. Studies have indicated that classification of multi-source PD becomes difficult with the degree of overlap and that several techniques such as mixed Weibull functions, neural networks, and wavelet transformation have been attempted with limited success. Since digital PD acquisition systems record data for a substantial period, the database becomes large, posing considerable difficulties during classification. This research work aims firstly at analyzing aspects concerning classification capability during the discrimination of multisource PD patterns. Secondly, it attempts at extending the previous work of the authors in utilizing the novel approach of probabilistic neural network versions for classifying moderate sets of PD sources to that of large sets. The third focus is on comparing the ability of partition-based algorithms, namely, the labelled (learning vector quantization) and unlabelled (K-means) versions, with that of a novel hypergraph-based clustering method in providing parsimonious sets of centers during classification.


Introduction
Among various techniques for insulation diagnosis, partial discharge (PD) measurement is considered a vital tool since it is inherently a nondestructive testing technique.PD is an electrical breakdown confined to a localized region of the insulating system of a power apparatus.PD, which may result in physical deterioration due to chemical degradation of the insulation system of a power apparatus, may occur as internal discharges in cavities, voids, blow-holes, gaps at the interfaces, and so forth or as external discharges on the surface imperfections, at sharp points and protrusions (corona discharges) etc.It is of major practical relevance for researchers and operators handling utilities to be able to discriminate sources of PD, geometry, and location since such measurements are intimately related to the condition monitoring and diagnosis of the insulation system of such equipments.A few pertinent attributes [1] of PD pulses are their magnitude, rise time, recurrence rate, phase relationship of occurrence, time interval between successive pulses, discharge inception, and extinction voltage.Due to the advances in digital hardware systems, increase in the computational speed of processors and coprocessors and advancements in associated data acquisition systems there has been renewed focus among researchers in carrying out PD analysis [2].More so, in recent years, the trend has shifted to recognition of patterns due to multiple sources of PD since these are often encountered during on-site, real time measurements wherein distinguishing various sources of PD becomes increasingly challenging.
Three major facets have been taken up for detailed study and analysis during the classification of multi-source PD patterns.The first aspect pertains to ascertaining the ability of the PNN versions without clustering algorithms in handling ill-conditioned and large training datasets, assessing the role of partition-based clustering algorithms (labelled: versions of LVQ algorithms and unlabelled: versions of K-Means algorithms) as compared to a novel graph theoretic based clustering techniques (hypergraph) in providing frugal sets of representative centers/during the training phase and analysis of the role played by the preprocessing/feature extraction techniques in addressing the curse of dimensionality and facilitating the classification task.In addition, a wellestablished estimation method that utilizes the inequality condition pertaining to various statistical measures of mean has been implemented as a part of the feature extraction technique to ascertain the capability of the proposed NNs in classifying the patterns.Further, exhaustive analysis is carried out to determine the role played by the free parameter (variance parameter) in distinguishing the classes, number of iterations, and its impact on computational cost during the training phase in NNs which utilize the clustering algorithms and the choice of the number of clusters/codebook vectors in classifying the patterns.

Preprocessing, Feature Extraction, and Neural Networks for Partial Discharge Pattern Classification: A Review
2.1.Preprocessing and Feature Extraction.A wide range of preprocessing and feature extraction approaches have been utilized by researchers worldwide for the task of PD pattern classification.Researchers involved in studies related to identification and discrimination of PD sources have usually resorted to the phase-resolved PD (PRPD) approach wherein methods based on statistical operators which include measures based on moments (skewness and kurtosis) [25][26][27][28], measures based on dispersion (range, standard deviation, variance, quartile deviation, etc.), central tendency (arithmetic mean, median, moving average etc.), cross-correlation, and discharge asymmetry and have been widely utilized.In studies related to time-resolved PD analysis, pulse characteristic tools which include parameters such as pulse rise time, decay time, pulse width, repetition rate, quadratic rate, and peak discharge magnitude have also been attempted.Feature vectors consisting of average values of the spectral components in the frequency domain in analysis wherein signal-processing-related tools are utilized.

Neural
Networks for Pattern Recognition.The prelude to PD pattern recognition studies can be traced to [29] wherein the multilayer perceptron-(MLP-) based feedforward neural network (FFNN) with back propagation algorithm (BPA) that has been attempted for training of the network was a remarkable success.Though the initial study was noteworthy and provided exciting avenues, further analysis pertaining to exhaustive data indicated that the basic version was computationally expensive due to long training epochs.
Further studies with radial basis function (RBF) neural networks as reported in [30] showed improved performance and convergence during the supervised training phase with better discrimination of the decision surface of the feature vectors.However the tradeoff between unreasonably long training epochs and improved classification rate continued to present challenges to researchers.Subsequently, unsupervised learning neural networks such as self-organized map (SOM), counter propagation NN (CPNN) [31], and adaptive resonance theory (ART) [32] have been utilized for classification of single-source PD signatures with a considerable level of satisfaction.However aspects such as complications related to the inherently non-Markovian nature of pulses further aggrandized by varying applied voltages during normal operation, apparently predictable incidence of ill-conditioned data obtained from modern digital PD measurement and acquisition systems which present considerable hurdles during large dataset training, and complexities in discriminating fully overlapped multisource PD signatures in practical insulation systems clearly substantiate on the need for a renewed focus on realizing a comprehensive yet simple NN scheme as a tool for the classification task.
Incidentally, the initial studies taken up earlier by the authors of this research in classifying small dataset PD patterns using PNN and its adaptive version [33,34] clearly offer interesting solutions to difficulties related to large dataset training and classification in addition to providing a seemingly conceivable opportunity of utilizing a straightforward yet a reliable tool, since the PNN stems from a background based on sound theory related to statistics and probability.The standard version of the PNN (OPNN) and its adaptive version (APNN) are based on the strategy that combines utilizing a nonparametric density estimator (Parzen window) for obtaining the probability density estimates with that of a Bayesian classifier for decision making wherein the conditional density estimates are utilized for obtaining the class separability among the categories of the decision layer.It is pertinent to note that the only tunable part of the NN that requires to be tweaked for ensuring appropriate training is the variance (smoothing) parameter thus making the topology of the NN a plain yet a robust approach.It is evident, hence, that motivation for this research is on ascertaining the capability of basic PNN versions (without and with clustering algorithms) in classifying multiple sources of PD at varying applied voltages.The effectiveness of these algorithms in to tackle large and ill-conditioned datasets acquired from the digital PD measurement and acquisition system which may lead to overfitting during the training phase is also studied.

Probabilistic Neural Network and Its Adaptive Version
PNN [35][36][37] is a classifier version based on "multivariate probability density estimation."It is a model which utilizes the competitive learning strategy: a "winner-takes-all" attitude.The original (OPNN) and the adaptive versions of PNN (APNN) do not have feedback paths.PNN combines the Bayesian technique for decision-making with a nonparametric estimator (Parzen window) for obtaining the probability density function (PDF).The PNN network as described in Figure 1 consists of an input layer, two hidden layers (one each for exemplar and class layers), and an output layer.Some of the merits of the PNN [38] include its ability in training with several orders of magnitude faster than the multilayer feedforward NN, capacity in providing mathematically credible confidence levels during decision making, inherent strength in handling the effects of outliers etc.One distinct disadvantage pertains to the need for large memory capability for fast classification.However, this aspect has been circumvented successfully in recent times since versions which have been implemented with appropriate modifications have been developed.Recently, the authors of this research have also successfully utilized a few variants of such modifications for multi-source PD pattern classification [39,40].
Each exemplar node produces a dot product of the weight vector and the input sample, wherein the weights entering the node are from a particular sample.The product passes through a nonlinear activation function, that is, exp[(x T w ki − 1)/σ 2 ].The second hidden layer contains one summation unit for each class.Each summation (class) node receives the output from the pattern nodes associated with a given class given by Nk i=I exp[(x T w ki − 1)/σ 2 ].The output layer has as many neurons as the number of categories (classes) considered during the study.The output nodes are binary neurons that produce the classification decision based on the condition Nk i=I exp[(x T w ki − 1)/σ 2 ] > N j i=I exp[(x T w ki − 1)/σ 2 ].

Normalization Procedure in Modelling Pattern
Unit.The pattern unit in Figure 1 requires normalization of the input and exemplar vectors to unit length.A variety of normalization methods such as Euclidean, Minkowski (city block), and Mahalanobis may be utilized during the NN implementation, though the most popular being the Euclidean and the city block norms.Figure 2 can be made independent of the requirement of unit normalization by adding the length of both vectors as inputs to the pattern unit.
A basic variant of the PNN called the adaptive PNN (APNN) [41,42] offers a viable mechanism to vary the free parameter "σ" (variance parameter) or the smoothing parameter within a particular category (class node).While the OPNN utilizes a common value for all of the classes, the APNN employs different values of σ for each class based on computing the average distance σ = g • d ave from Euclidean distances among various feature vectors while "g" is a constant which necessitates adjustment.An additional aspect of this approach is that a simplified formula of probability density function (PDF) is used which obviates the necessity for normalization and hence a considerable amount of computation is reduced.

Partitioning and Graph Theoretic Clustering Algorithms: An Overview
Clustering deals with segregating a set of data points into nonoverlapping groups or cluster points wherein the points in the group are "more similar" to one another than to points in other groups [43].The term "more similar" when used to clustered points, usually refers to closeness by a credible quantification of proximity.When a dataset is clustered, each point is allocated to a particular cluster and every cluster can be characterized by a single reference point usually an average of the points in the cluster.A wide range of clustering algorithms has been utilized by researchers in diverse engineering applications which fall under eight major categories [44].These are based on similarity and sequence similarity measures, hierarchy, square error measures, mixture density estimation, combinatorial search, kernel, and graph theory.
While the hierarchical clustering groups data with sequence of partitions from solitary cluster to a cluster including all clusters, partition clustering on the other hand divides data objects into prefixed clusters without the hierarchical composition.Partition-based clustering methods include square error; density estimate includes vector quantization, K-Means, and expectation maximization (EM) with maximum likelihood (ML).Any specific segregation of all points in a dataset cluster is called "partitioning".Data reduction is accomplished by replacing the coordinates of each point in a cluster with the coordinates of the appropriate reference point.The effectiveness of a particular clustering method depends on how closely the reference points represent the data as well as how fast the algorithm proceeds and gets processed.If the data points are tightly clustered around the centroid, the centroid will be representative of all the points in that cluster.The standard measure of the spread of a group of points about its mean is the variance or the sum of the square of the distance between each point and the mean.If the data points are close to the mean, the variance will be small.The level of error "E" as a measure indicates the overall spread of data points about their reference points.To achieve a representative clustering, E should be as small as possible.When clustering is done for the purpose of data reduction, the goal is not in finding the best partitioning but rather a reasonable consolidation of "N" data points into "k"

Class layer
Decision layer  clusters and if possible some efficient means to improve the quality of the initial partitioning.In this aspect a family of iterative-partitioning algorithms either of labelled or unlabelled versions has been developed by researchers.Over the years several clustering algorithms have been proposed by researchers which include the hierarchical clustering (agglomerative, stepwise optimal), online clustering (leaderfollower clustering), and graph theoretic clustering.Though the graph theoretic representation of data may also provide avenues for clustering, its limitation from the viewpoint of complex applications stems from the fact that it utilizes binary relations which may not comprehensively represent structural properties of temporal data, the nature of association being binary neighbourhood.In this context it is worth noting that only recently, hypergraph (HG) theory and its relevant properties have been exploited by researchers for designing computationally compact algorithms for preprocessing data in various engineering applications such as image processing and bioinformatics etc [45] due to its inherent strength in representing data based on both topological and geometrical aspects while most other algorithms are topology based only.Hypergraph (HG) deals with finite combinatorial sets and has the ability to capture both topology and geometrical relationships among data.
Hence, it is apparent from this discussion that the choice of the appropriate type of clustering technique would play a vital role in handling the classification of large dataset PD.

Labelled Partition-Based Clustering Learning
Vector Quantization Versions.Kohonen's [46] learning vector quantization (LVQ) is basically a pattern-classification-supervised learning version wherein each output neuron represents a particular class/category.The weight vector for an output neuron is usually called as a reference (codebook) vector of the class that the unit signifies.During training, the output units are placed by adjusting the weight vector to approximate the decision hypersurface of the Bayesian classifier.During testing of the PNN and its adaptive version using LVQ clustering technique [47], the LVQ classifies an input vector by assigning it to the same class as the output unit which has its weight vector the closest.

LVQ1.
This simple algorithm proposes updating the weight towards the new input vector (x i ) if the input and the weight vector belong to the same class or updating the weight away from the input if the input and the weight vector belong to different classes (determined by finding the output pertaining to minimum distance, i.e., x i − w j ).

LVQ2. The modification in this version relates to
updating the weights for the runner up distance based on the constraint that the ratios of runner up (d r ) and closest distance (d c ); that is, d r /d c > (1 − ε) and d c /d r < (1 + ε) (ε is the window describing the error in the variance) in addition to the restriction that the distance between x i and codebook belongs to two different classes for closest and runner up distance and that x i belongs to codebook whose target is runner up.When both the closest and next closest distance are not the target output, updating of d r and d c is swapped.When the target is the nearest codebook, then the updating of weight for that particular exemplar is not carried out.

LVQ3. Additional enhancements on the previous versions enable the learning of two closest vectors which satisfy the window condition min(d
In such a case the weights are updated as y c (t + 1) = y c (t) + β(t)[x(t) − y(t)] for both y c1 and y c2 .The learning rate β(t) is a multiple of learning rate α(t), and its typical value ranges between 0.1 and 0.5 with smaller values corresponding to a narrower window.

Unlabelled Partition Based Clustering: K-Means
Algorithm Verions.K-means algorithm [48] locates and obtains the "c" mean (cluster center) vectors (μ 1 , μ 2 , μ 3 , . .., μ c ).This rudimentary unlabelled clustering algorithm is called K-means algorithm commonly referred to as Lloyd (or) Forgy's K-Means.In order to facilitate in having better sets of cluster representatives and for ensuring a reasonable choice of the initial seed vector, various variants have been developed which include McQueen K-means, standard Kmeans, continuous K-means, and fuzzy K-means etc. to provide a better choice on the initial seed and consequently better sets of cluster representatives.

Forgy's K-Means.
The algorithm describing this method is illustrated in Figure 3.

Standard K-Means.
The distinct distinction from the Forgy K-means is in its more appropriate utility of the data at each step.Though the basic process for both algorithms is similar in the context of choice of the reference points and in the allocation of clusters to all data points, then using the cluster centroids as reference points in subsequent partitioning the distinctness is in nature of adjusting the centroids both during and after each partitioning.For a data "x" in cluster "i" if the centroid z i is the nearest reference point, then adjustments are not carried out and the algorithm proceeds to the next sample data.On the other hand, if the centroid z j of the cluster " j" is the reference point closest to the data "x," then x is reassigned to cluster j in addition to recomputing the centroids of the "losing" cluster "i" (minus point x) and the "gaining" cluster (plus point x) and moving the reference points z i and z j to the fresh centroids.

Graph Theoretic Clustering Algorithm: Hypergraph.
A HG [49] "H" is a pair (X, ξ) consisting of a nonempty set X together with a family i∈I E i = X, I = {1, 2, . . ., n}, n ∈ N. Figure 4 shows a generic HG representation.
An important structure that can be studied in a HG is the notion of an intersecting family.An intersecting family of hyperedges of a HG "H" is a family of edges of H which have pairwise nonempty intersections.There are two types of interesting families: (1) intersecting families with an empty intersection and (2) intersecting families with a nonempty intersection.A HG has the Helly property if each family of pair wise intersecting hyperedges has a nonempty Randomly ordered Q feature vector and prototypes of classes) Assign each of M feature vectors to the nearest prototype to form K classes: use index c|q| = k to designate x (q) belonging to class k; count cluster size with s[k] New feature vector?
For each class: For the given class: If (not first pass) and (no change in class) Stop for q = 1 to Q Input "K" number of classes vectors as "seed vectors" (initial For each component "n": Select the first "K" of Q feature intersection (i.e., they belong to a star).Figure 5, represents two types of intersecting hyper-edges.Several researchers in allied fields [50,51] of engineering have utilized a variety of properties of HG such as the Helly, transversal, mosaic, and conformal for obtaining clustering algorithms pertaining to a diverse set of applications.The neighbourhood HG representation utilizes the Helly which plays a vital role in identifying homogeneous regions in the data augurs as well as serves as the main aspect for developing segmentation and clustering algorithms.
In the case of studies based on HG-based clustering and classification, the preprocessed data obtained as discussed in Section 6 is represented as V i = (ϕ i , q i , n i ), i = 1, 2, 3, 4, . . ., m where "m" is the number of vertices of the data per cycle.The data is grouped in terms of feature vectors which act as the best representatives of entire database.Hence if pair-wise intersecting edges are created from the entire data base, the Helly property of HG can be invoked to find the common intersection which in turn provides the feature vectors that represent the centers of a particular set of data pertaining to the source of PD.Hence, a minimum distance metric scheme (Euclidean) is developed to obtain the nearest among various intersections of the intracluster and intercluster dataset so as to obtain the optimal set of common intersection vectors that serve as the centers representing the dataset.These feature vectors are taken as training vectors of the PNN.

PD Laboratory Test Setup.
Comprehensive studies pertaining to single-and multi-source PD pattern recognition have been carried out using a W.S. Test Systems Make (Model no.: DTM-D) digital PD measurement system suitable for measuring PD in the range 2-5000 pC with Tektronix builtin oscilloscope (TDS 2002B) provided with a tunable filterinsert (Model: DFT-1) with a selectable center frequency in the range of 600 kHz-2400 kHz at a bandwidth of 9 kHz.PD pulses acquired from the analogue output terminal are exhibited on the built-in oscilloscope.The measured partial discharge intensity is displayed in picocoulomb (pC).PDGold software developed by HV Solution UK is interfaced with the PD measurement system to acquire the PD patterns.Window gating facility is provided by the PD acquisition system to suppress background noise.
The test setup and various stipulations of the test procedure comply with IEC 60 270 [52].Further, in order to improve the transfer characteristics of the test system, a 1 nF coupling capacitor is integrated to the test setup.An electronic reference calibrator (Model: PDG) ensures appropriate resolution of pulses during measurement and data acquisition.The straight detection and measurement test setup as recommended in IEC is utilized in carrying out the test.Figures 6, 7, and 8 show the test arrangement for PD measurement and acquisition system.

Artificially Simulated Laboratory Benchmark Models
for PD Pattern Classification.Five categories of laboratory benchmark models have been fabricated to simulate distinct classes of single and multiple PD sources, namely, electrode bounded cavity, air corona, oil corona and electrode bounded cavity with air corona which would in turn serve as a validation technique to replicate the reference patterns as recommended in [53].Internal discharges are simulated by an electrode bounded cavity of dimension 1 mm diameter and 1.5 mm depth on 12 mm thick polymethyl metha acrylate (PPMA) of diameter 80 mm as shown in Figure 9.One category of external discharge (surface discharge) is simulated with 12 mm thick Perspex of 80 mm diameter as indicated in Figure 10.A second category of external discharge called the air corona discharge is replicated by an electrode of apex angle 85 • attached to the high voltage terminal as shown in Figure 11.Corona discharge in oil is produced with a similar arrangement immersed in transformer oil as shown in Figure 12.Electrode bounded cavity with air corona is produced by inserting a needle configuration (2 mm) from the HV terminal in addition to a 2 mm bounded cavity in Perspex at the high voltage electrode as replicated in Figure 13.

PD Signature and Pattern Acquisition
System.PD Gold is a data acquisition software which provides a system to acquire high-resolution PD signals at a high-sampling rate (1 sample per 2.5 nanoseconds).The system detects PD on a 50 Hz power cycle base thus enabling display of PD pulses in sinusoidal or elliptical forms usable in either auto or manual mode which in turn enables the user to observe the shape of the PD pulses detected and representing the PRPD patterns in real time.In the manual approach, the user has the facility to record the data for a considerable duration (in this study 5-15 minutes) which is acquired from a minimum of 240 to a maximum of 750 waveforms per channel.
Incidentally, for carrying out PD testing which would ensure credible acquisition of data it is essential to acquire fingerprints of PD signals under well-defined conditions.Hence, before testing, the test specimen is preconditioned in line with the requirements of the relevant technical committee.Since methods of cleaning and conditioning test specimens play a vital role during acquisition of the test data, preconditioning procedures indicated in [54] are adopted.
It is observed during exhaustive studies that for discharge sources listed in Tables 1 and 2, a time period of 5 minutes is usually sufficient to capture the inherent characteristics of PD.Figures 14 and 15 show typical PD pulses acquired during the testing, measurement, and acquisition process.

Preprocessing and Feature Extraction
For carrying out extensive training and testing of the PNN versions, the raw data is preprocessed in order to ensure compactness without compromising on unique details of the characteristic input feature vector.The significance of utilizing a wide variety of preprocessing methods is to enable in ascertaining the performance of the proposed NNs so   measures of various types of mean utilized successfully by a few researchers in the field of target recognition serves as an effective yet simple technique in reducing the dimensionality of the input feature vector space.Hence, it has also been adapted in this research work to ascertain its effectiveness in providing a compact set of extracted features.
The acquisition of raw PD dataset was carried out as deliberated in Section 5, preliminarily for moderate set of multiple source PD patterns and subsequently for large datasets for single and multiple PD sources.The first studies are conducted for dataset consisting of a total of two sets of training database, that is, 20 and 25 sets.A total of 56 PD fingerprints samples were collected from 6 samples of benchmark models described in Section 5 of which 10 patterns are due to internal discharge (electrode bounded cavity), 10 pertain to oil corona, 10 patterns correspond to surface discharge, 6 fingerprints belong to air corona patterns and 10 patterns belong to electrode bounded cavity with air  corona (multisource PD).The database obtained is indicated in Table 1.
The second analysis pertains to PD signatures for large dataset patterns acquired from the laboratory testing of 4 models simulating sources of PD.The total number of fingerprints in the database comprises ninety patterns of each type of the defect with thirty samples pertaining to each of the various applied voltages.It is to be noted that these patterns have been acquired online wherein the statistical variations in the pulse patterns for each cycle of the sinusoidal voltage exhibits the inherent non-Markovian nature, thus making the task during classification more difficult.The task becomes even more demanding due to different applied voltages which make the process of classification of pulse patterns complex.Rigorous study and analysis on the classification capability of the proposed NN is carried out for only one applied voltage for each category of PD.However, the limitations and aspects related to complexities in classifying large dataset due to varying applied voltages are also summarized.Table 2 shows the patterns acquired for large dataset from various sources of PD.It is pertinent to note from Table 2 that only eighteen sets (20% of the training dataset) pertaining to each source of PD (referred to as prototype/codebook vectors in the case of labelled clustering or random cluster centers in the case of unlabelled clustering) were taken up for finding the centers since it has been observed from our study that these representative fingerprints were sufficient for obtaining considerable number of centers which led to reasonable classification capability of PNN versions.This is notwithstanding the fact that NN literature studies have indicated the usual practice of using at least 50% as representative samples for the training phase though it would ideally be suitable to have two-thirds as the basis for training the NNs.However, further studies were taken up by the authors with 40% of codebook vectors for obtaining centers.It was made evident that enhanced classification capability was evinced by the NNs.

Neural Network Verification
The most prevalent verification methods, namely, the alphabet character recognition and the Fischer's Iris Plant database [56], are used for training and testing of the PNN versions for ascertaining the performance of the proposed PNN versions.Coding for versions of PNNs is developed using MATLAB 6.1, Release 12.The ability of  the clustering algorithms and hence the number of codebook reference vectors/centroids as appropriate to the type of clustering formed have also been studied and found to be reasonably precise in classifying the divergent input vectors.

Analysis of the Performance of OPNN
(1) Since the basic version of PNN is an unsupervised learning scheme (without feedback for learning), the exemplar nodes are themselves the weight vector and hence these are not updated during the training phase (training phase is not a part of the rudimentary scheme).Hence, it is obvious that for effective learning higher number of exemplar nodes which are representative of the category of PD source during training would ensure enhancing The classification capability is summarized in Table 3.
(2) Since it is also made evident during detailed study that issues related to overfitting would be an important aspect while training large non-Markovian PD datasets, this algorithm suffers from the drawback of requirement of large memory during the training phase.

Analysis on the Performance of APNN
(1) It is also evinced from detailed study that since the adaptive version provides a mechanism for having independent variance parameter for unique class labels, this version in almost all cases learnt well during the training phase (though this network structure also does not include supervised learning).This feature is evident from the modifications made in the structure of the APNN (due to the separate values of the variance parameter pertaining to each class decision boundaries).Table 3 and Figure 16 substantiate this aspect.(2) Nevertheless, since the basic variant of PNN also does not involve training and supervision during learning, considerable numbers of misclassifications are noticed, more so, pertaining to fully overlapped multi-source (electrode bounded cavity with surface discharge) PD signatures.The difficulties during classification of such overlapped signatures are evident from the nature of also from the nature of Φ-q max -n Φ-q max -n Φ-q min -n Φ-q min -n Φ-q-n max hyperboundary separation, wherein values of the smoothing parameter are indicated in Table 5.   observed for fully overlapped PD source considered in this study.Results of the comprehensive set of studies are shown in Table 4 and Figure 17.

Case
(2) It is also of considerable importance to note from Table 5 that the decision hyperboundaries that separate the various categories of PD sources are found to be very sharp (small values of the variance parameter).This clearly indicates that the complexities pertaining to classification of multi-source PD signatures in addition plausible inconsistencies during data acquisition for subsequent training and testing by PNN variants.
(3) Another prominent feature made evident from Table 5 is the similarity in the range of values of variance parameter for various categories of PD sources.Incidentally the values of variance parameter in the case of APNN are found to be almost similar thus signifying the similar nature of both Bayesian-based strategies in creating hypersurface boundaries.The performance of PNN versions which utilize the variants of LVQ algorithms is summarized in Figure 17.

Case Study 3: Role of the Trainable Part in Unsupervised and Supervised PNN Versions
(1) It is pertinent to note from Table 5 that in the case of all the versions of LVQ clustering-(LVQ1, 2, and 3) based PNNs, the range of the variance parameter is between 0.01 and 0.05 which describe the feature for void defect.Similarly the value of σ 4 , that is, void-corona overlapped pattern, is also reasonably similar but for one specific case with LVQ3 only.This establishes the fact, already stated by researchers in identifying and classifying the overlapped voidcorona patterns.In addition, from the viewpoint of decision of the boundary hyperplane, considerable clarity in separation of class boundaries is noticed.
(2) However, in the case of void-surface overlapped patterns the value of variance is considerably divergent in various versions of LVQ.This is vividly observed in the case of input feature vector using measures based on minimum and maximum values of number of pulses.
( In this context, it is to be emphasised that these codebook vectors become the weight (centers/centroids) vectors which are now the representatives of the samples.Table 6 summarizes the classification capability of the LVQ-PNN variants.
(2) It is also evident from Table 6 the superiority of the LVQ 2 version as a clustering algorithm for large dataset training as compared to the other types.This characteristic noticed in the course of this study by the authors of the research work has also been concurred by a few researchers in other allied areas of engineering [57].
(3) When the study was extended to that of doubling of the number of reference vectors during training, the improved classification rate is noticed (about 90-95%) for almost all categories and types of preprocessing schemes of varying levels of compactness.
(4) A perceptible difference in the classification capability of patterns pertaining to the feature extraction scheme that utilizes the inequality relation based on the measures related to the types of mean values (with both 30 • and 10 • phase window input features) has been observed.

Case Study 5: Performance of OPNN and APNN for Large Dataset with Traditional Statistical Operators and Inequality Measures of Mean with Unlabelled (K-Means Versions) Algorithms
(1) It is obvious that the classification rate is quite inferior as compared to the labelled clustering algorithms   (2) Φ-q max -n (10 (3) Φ-q min -n (30 based on 10 • .However, it is obvious that this has been achieved at the cost of higher number of centers as observed in Table 8. (5) Tables 8 and 9 clearly enunciate the fact that the number of centers that essentially describe the source of PD is dependent on the dimensionality of the HG centers.It is evident that the classification capability is enhanced with number of representative centers while a slightly inferior classification rate is obtained for a larger dimensionality (tuple) though with substantially larger number of centers.Though "curse of dimensionality" is a vital aspect in designing computationally effective clustering algorithms, the nature of centers obtained provides a much broader value of the smoothing parameter, thus circumventing the stated aspect previously discussed.[20] also provide substantial guidelines in the appropriate selection of the order and level of the selected wavelet, it is found relevant to use higher-order and lowerlevel (scale) wavelet representation for pattern recognition tasks.Hence, in this study the Daubechies wavelet with order 7 and level 3 was taken up for obtaining the approximate and detailed coefficients.Based on the coefficients obtained, postprocessing and further studies have been carried out It is obvious from Table 10 that the number of feature extraction bins (during the extraction of the wavelet coefficients based on statistically processed measures) plays a vital role in the capability of classification of the WT-PNN.It is pertinent to observe that with increased dimensionality of the extracted the capability is not enhanced, in fact, detrimental to classification.This aspect clearly exemplifies the need for appropriate center selection strategies (such as HG-based clustering).
Further it is evident from the detailed analysis and from case study shown in Table 10 that good classification capability of the wavelet PNN is obtained for considerably larger number of tuples of extracted features as compared to considerably lesser-dimensioned features obtained from simple statistical measures based on HG methodology.Thus much more parsimonious sets of centers are obtained with more compact feature representatives with the HGbased center selection and clustering technique though with slightly inferior classification capability.However, it would be worth mentioning in this context that this limitation may be attributed to the utility and exploitation of only one of the preliminary property of HG, namely, Helly, while several other powerful salient properties of HG such as transversal, mosaic, and conformal have not be taken up in this research.Such properties are expected to provide enhanced results.

Conclusions
The role played by both partition and graph theorybased clustering algorithms in discriminating multi-source PD patterns utilizing the two basic variants of PNN are summarized as follows.
(1) During the training phase-labelled versions of LVQ clustering augurs well as a good learning scheme and are able to handle ill-conditioned dataset and overlapped multiple PD sources considerably well.It is also evident that this method may be appropriate during offline studies wherein under controlled testing conditions, appropriate training of prototype vectors pertaining to a particular class would ensure a compact and reasonable codebook vector for further classification by PNNs.
(2) The unlabelled clustering algorithm offers fresh insight into possible schemes for cluster validation which may consequently present a likely methodology for recognition of unknown class of PD sources during real time studies.Though this scheme may appear to be more associated with its counterpart (weak learning strategy), it is essential to note that since PD source discrimination is fundamental for successful insulation diagnosis it may be reasonable that the sources of PD signatures are classified from the viewpoint of strong learning strategy.The authors of this research are engaged in attempting a cluster validation-based scheme which is ongoing presently.(3) It is evident from the studies that HG-based center selection/clustering algorithm provides an exciting and a viable option for obtaining reasonably parsimonious set centers that describe the class of PD.Though the properties of the HG algorithm was utilized only to cluster and classify the PD patterns in this research, this scheme provides an exciting opportunity to correlate the relationship/association of PD pulses in terms of geometric aspects also.This research aspect is presently ongoing.Since much larger sets of representative centers are observed during this study, more appropriate properties of HG such as transversal, conformal, and mosaic can be attempted to further validate the approach.

Figure 2 :
Figure 2: Normalization in a pattern unit: original PNN.

Figure 3 :
Figure 3: Flow chart of K-means clustering algorithm.

Figure 6 :
Figure 6: Typical laboratory test setup for PD pattern recognition studies.

Figure 7 :
Figure 7: Laboratory experimental test setup indicating the direct detection PD measurement methodology.

Figure 10 :
Figure 10: Model simulating surface discharge with electrode bounded cavity.

Figure 11 :
Figure 11: Laboratory model simulating air corona discharge with point configuration as high voltage electrode at an 85 • apex angle.

Figure 12 :
Figure 12: Laboratory model simulating oil corona discharge with point configuration as high voltage electrode at an 85 • apex angle.

Figure 13 :
Figure 13: Laboratory model replicating electrode bounded cavity discharges overlapped on air corona discharges (multiple source discharges).

Figure 14 :
Figure 14: Typical waveform representation on the oscilloscope depicting air corona discharges on sinusoidal base.

8. 1 .
Case Study 1: Discrimination Capability of OPNN and APNN without Clustering Algorithm for Moderate PD Datasets.Based on the training and testing of PNN and its adaptive version with two sets of training data which include overlapped and single PD source patterns comprising 4 sets (3 nos. of single PD source and 1 number of void corona overlapped) and 5 sets (3 single PD sources and 2 numbers void corona and void-surface discharge overlapped), extensive observations and analysis are summarized.

Figure 15 :
Figure 15: Typical sample of laboratory model testing of electrode bounded cavity with air corona PD acquired from the PD measurement and acquisition system.
Classification capability of OPNN-and APNN-multiple PD sources

Figure 16 :
Figure 16: Classification capability of OPNN and APNN with five types of feature inputs with 4-and 5-types of overlapped patternswithout clustering algorithm (in the histogram dotted chequered blocks refer to 4 and 5 type inputs to PNN; striped and brick blocks refer to 4-and 5-type inputs to APNN).

Figure 17 :
Figure 17: Classification capability of OPNN and APNN with six types of feature inputs with 4-and 5-types of overlapped patternswith LVQ clustering algorithms (in the histogram dotted, chequered blocks refer to 4-and 5-type inputs to PNN; striped and brick blocks refer to 4-and 5-type inputs to APNN).

Table 1 :
Moderate dataset of PD laboratory models.

Table 2 :
Large dataset PD database of laboratory models with varying applied voltages.

Table 3 :
Classification capability of PNN and APNN for moderate database-without clustering algorithm.

Table 5 :
Comparison on the role of variance parameter in classifying multiple PD sources.

Table 6 :
Comparison of classification capability of OPNN and APNN with LVQ versions' clustering algorithms.

Table 7 :
Comparison of classification capability of OPNN and APNN with versions of K-means clustering algorithms.

Table 9 :
Classification capability of HGPNN for multiple source PD patterns.

Table 10 :
Capability of wavelet transform-PNN in classifying multiple source PD signatures.(range,standarddeviation, mean, skewness, and kurtosis) for a phase window of 30 • and 10 • .Table10summarizes the analysis carried out utilizing wavelet transform.