^{1}

^{2}

^{1}

^{3}

^{1}

^{2}

^{3}

One of the most important aspects in semisupervised learning is training set creation among a limited amount of labeled data in such a way as to maximize the representational capability and efficacy of the learning framework. In this paper, we scrutinize the effectiveness of different labeled sample selection approaches for training set creation, to be used in semisupervised learning approaches for complex visual pattern recognition problems. We propose and explore a variety of combinatory sampling approaches that are based on sparse representative instances selection (SMRS), OPTICS algorithm, k-means clustering algorithm, and random selection. These approaches are explored in the context of four semisupervised learning techniques, i.e., graph-based approaches (harmonic functions and anchor graph), low-density separation, and smoothness-based multiple regressors, and evaluated in two real-world challenging computer vision applications: image-based concrete defect recognition on tunnel surfaces and video-based activity recognition for industrial workflow monitoring.

The proliferation of data generated in today’s industry and economy raises the expectations for approaching towards the solutions of data-driven problems through state-of-the-art machine learning and data science techniques. One of the obstacles towards this direction, especially apparent in complex real-world applications, is the insufficient availability of ground truth, which is necessary for training and fine-tuning supervised machine learning (including deep learning) models. In this context, semisupervised learning (SSL) appears as an interesting and effective paradigm. Semisupervised learning approaches make use of both labeled and unlabeled data to create a suitable learning model given a specific problem (usually a classification problem) and related constraints. The acquisition of labeled data, for most learning problems, often requires a skilled human agent (e.g., to annotate background in an image, segment, and label video sequences for action recognition) or a physical experiment (e.g., determining the 3D structure of a protein). The cost associated with the labeling process, thus, may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, SSL can be of great practical value.

One major advantage is the easy implementation on existing techniques; SSL can be directly or indirectly incorporated in any machine-learning task. Semisupervised SVMs approaches are a classical example of direct usage of SSL assumptions into the minimization function [

In real life, there are several fields of SSL testing, assuming that there is data availability. The work of [

Regarding the limitations and requirements pertaining to the selection of labeled data in SSL, there is a set of desirable properties that the utilized data should have: Firstly, representative samples are needed. The labeled samples should be able to describe (or reproduce) the original data set in the best possible way. Secondly, at least one sample per classification category is required, so that model can be able to adjust to the class properties. Finally, the existence of outliers should be considered, given that most data sets contain outliers which could lead to poor performance especially when used as labeled data (all by themselves).

In this paper, we provide a deeper insight on the effectiveness of different data sampling approaches for labeled dataset creation to be used in SSL. The data sampling approaches explored are based on sampling techniques including KenStone algorithm [

The typical data selection approach in several SSL techniques, including the aforementioned ones, is, to our knowledge, the random selection of the training set. Usually, a small portion of the data, i.e., less than 40% is selected (and considered labeled); as the amount of available data increases, the fraction of the required labeled instances decreases [

The remainder of this paper is structured as follows: In Section

Given a set of feature values for a data sample, a two-step process is adopted in the analysis conducted in this study. The first step involves data sampling, i.e., the selection of the most descriptive representatives in the available data set. The second step employs popular data mining algorithms; i.e., predictive models are trained over the descriptive subsets of the previous step.

The main purpose of data sampling is the selection of appropriate representative samples to provide a good training set and, thus, improve the classification performance of predictive models. In this section, we present seven (7) data sampling approaches, which are based on the combination or adaptation of four (4) main known sampling techniques [

The most important factor in data selection is the definition of distance function. For any two given data points

Most of the proposed approaches are based on the Euclidean distance (i.e.,

Ordering Points to Identify the Clustering Structure (OPTICS) is an algorithm for finding density-based clusters in spatial data [

OPTICS requires two parameters: the maximum distance (radius) to consider (

Sparse modeling representative selection (SMRS) focuses on the identification of representative objects through the solution of the following optimization problem [

Using the classic KenStone algorithm, we can cover the experimental area in a uniform way, since it provides a flat data distribution. The algorithm’s main idea is that to select the next sample, it opts for the sample whose distance to those that have been previously chosen (called calibration samples) is the greatest.

Therefore, among all possible points, the algorithm selects the point which is furthest from those already selected and adds it to the set of calibration points. To this end, the distance is calculated between each candidate point

The primary goal of sampling approaches is the removal of redundant and uninformative data. Using the algorithms described earlier in Section

All of the proposed approaches are applied over all available data, labeled or not. As such, it is possible for many of the selected training data to be unlabeled. In that case, an expert would be summoned to annotate the selected data, as would have been the case in any annotation attempt. However, in this case, the annotation effort will be less considerable compared to traditional supervised approaches, which use a significantly higher percentage of the available data for training purposes.

In this work, four of the most popular types of SSL techniques will be considered: two graph-based approaches, along with low-density separation, and multiple smoothness assumption-related regressors.

Graph-based semisupervised methods define a graph over the entire data set,

The nodes represent the labeled and unlabeled examples in the dataset; edges reflect the similarity among examples. In order to quantify the edges (i.e., assign a similarity value), an adjacency matrix

Practically, each label is only connected to its

Graph methods are nonparametric, discriminative, and transductive in nature. Intuitively speaking, in a graph that various data points are connected, the greater the similarity, the greater the probability of having similar labels. Thus, the information (of labels) propagates from the labeled points to the unlabeled ones. These methods usually assume label smoothness over the graph. That is, if two instances are connected by a strong edge, their labels tend to be the same.

An indicative paradigm of graph-based SSL is the harmonic function approach [

The problem has an explicit solution, which allows a soft label estimation for all the edges of the graph, i.e., investigated cases.

Anchor graph estimates a labeling prediction function

The designing of matrix

Nevertheless, the creation of matrix

The Laplacian matrix

Each sample label is, then, given by

The low-density separation assumption pushes the decision boundary in regions where there are few data points (labeled or unlabeled). The most common approach to achieving this goal is to use a maximum margin algorithm such as support vector machines. The method of maximizing the margin for unlabeled as well as labeled points is called the transductive SVM (TSVM). However, the corresponding problem is nonconvex and thus difficult to solve [

Low-density separation (LDS) is a combination of TSVMs [

The problem can be stated in the following form, which allows for a standard gradient-based approach:

Such a formulation allows the use of a nonlinear kernel, calculated over a fully connected matrix,

The final step towards the kernel’s creation involves multidimensional scaling [

The safe semisupervised regression (SAFER) approach [

The problem lies in the solution of the following equation:

We will hereby examine the applicability and effectiveness of each of the above-described data selection techniques for the SSL approaches presented. SSL is particularly useful in cases where there is limited availability of labeled data and/or the creation of appropriately sized labeled data sets requires a prohibitive amount of resources, as is the case in real-world visual classification problems. Two prominent examples of such applications are (a) automated image-based detection and classification of defects on concrete surfaces in the context of visual inspection of tunnels [

MATLAB software has been used for the implementation of the proposed approaches. The SSL approaches code, i.e., Harmonic functions, Anchor graph, LDS, and SAFER, were provided by the corresponding authors of [

The tunnel defect recognition dataset (henceforth referred to in this paper as the

Examples of cracked areas from the Tunnel dataset.

To represent each pixel, we use the same low-level feature extraction techniques as in [

Illustration of the extracted low-level features in the Tunnel dataset: (a) original image, (b) edges, (c) frequency, (d) entropy, (e) texture, and (f) HOG.

We, hereby, briefly describe the features used to form vector

A typical K-fold validation approach is adopted, resulting in eight (approximately) equal partitions, i.e., disjoint subsets, of the

Action or activity recognition from video is a very popular computer vision application. A significant application domain is automatic video surveillance, e.g., for safety, security, and quality assurance reasons. In this experiment, we will make use of real-world video sequences from the surveillance camera of a major automobile manufacturer (NISSAN) [

The production cycle on the industrial line included tasks of picking several parts from racks and placing them on a designated cell some meters away, where welding took place. Each of the above tasks was regarded as a class of behavioral patterns that had to be recognized. The activities (tasks) we were aiming to model in the examined application are briefly the following:

One worker picks part #1 from rack #1 and places it on the welding cell

Two workers pick part #2a from rack #2 and place it on the welding cell

Two workers pick part #2b from rack #3 and place it on the welding cell

One worker picks up parts #3a and #3b from rack #4 and places them on the welding cell

One worker picks up part #4 from rack #1 and places it on the welding cell

Two workers pick up part #5 from rack #5 and place it on the welding cell

Workers were idle or absent (null task)

The WR dataset includes twenty full cycles, each containing occurrences of the above tasks. Figure

Indicative example of key-frames corresponding to the execution of a task (Task 2).

In all video segments, holistic features such as Pixel Change History (PCH) are used. These features remedy the drawbacks of local features, while also necessitating a far less tedious computational procedure for their extraction [

Each of the seven data sampling approaches described in Section

Illustration of the training set data size per sampling approach (averages over all tests).

Row labels | KenStone | kmeansRandom | kmeansSMRS | OPTICS extrema | OPTICS-SMRS | Random | SMRS | Entire set |
---|---|---|---|---|---|---|---|---|

WR | 156 | 181.25 | 422.37 | 289.75 | 532.39 | 156 | 23.62 | 5199 |

Tunnel | 36.37 | 38 | 37.75 | 55 | 141.76 | 36.37 | 14.12 | 1200 |

The classification results in terms of averaged accuracy and F-measure for each combination are depicted in Figure

Performance scores for all data selection and SSL combinations (Tunnel dataset).

Performance scores for all data selection and SSL combinations (WR dataset).

Confusion matrices for OPTICS-SMRS sampling in (a) Tunnel dataset, using anchor graph and (b) WR dataset, using harmonic functions.

Figure

Figure

In order to derive further conclusions regarding the results and the relative performance of the technique combinations explored, we performed an analysis of variance (ANOVA) on the F1 scores for the test samples. ANOVA permits the statistical evaluation of the effects of the two main design factors of this analysis (i.e., the sampling schemes and the SSL techniques). As shown in Table

ANOVA results.

Source | Sum sq. | d.f. | Mean sq. | F | |
---|---|---|---|---|---|

Sampling | 3.5488 | 6 | 0.5915 | 167.0981 | 0 |

Classifier | 3.1569 | 5 | 0.6314 | 178.2768 | 0 |

Number of classes | 0.2687 | 1 | 0.2687 | 75.9157 | 0 |

Sampling |
0.3766 | 30 | 0.0126 | 3.5469 | 0 |

Sampling |
0.7855 | 6 | 0.1309 | 36.9865 | 0 |

Classifier |
0.4715 | 5 | 0.0943 | 26.6411 | 0 |

Error | 2.1769 | 615 | 0.0035 | ||

Total | 10.7920 | 668 |

Apart from the above basic ANOVA results, we use the Tukey honest significant difference (HSD) post hoc test so as to derive conclusions about the best performing approaches, taking into account the statistical significance of the variations in the values of metrics presented. Figures

F1 scores by classification method.

F1 scores by data selection method.

As far as SSL techniques are concerned, harmonic functions and anchor graph appear to have a statistically significant superiority over all alternatives. The outcome verifies previous analysis outcomes (see Figures

Finally, as regards data selection techniques, we observe that the OPTICS-based approach combined with SMRS creates training sets that lead to clearly the highest performance rates among all examined techniques, including the traditionally used random sampling. Furthermore, we can see that cluster-based samplers in general yield results that are at least as good as random sampling. On the other hand, SMRS alone provides results significantly worse than all competing schemes.

The creation of a training set of labeled data is of great importance for semisupervised learning methods. In this work, we explored the effectiveness of different data sampling approaches for labeled data generation to be used in SSL models in the context of complex real-world computer vision applications. We compared seven sampling approaches, some of which we proposed in this paper, all based on OPTICS, k-means, SMRS, and KenStone algorithm. The proposed data selection approaches were used to create labeled data sets to be used in the context of four SSL techniques, i.e., anchor graph, harmonic functions, low-density separation, and semisupervised regression. Extensive experiments were carried out in two different and very challenging real-world visual recognition scenarios: image-based concrete defect recognition on tunnel surfaces and video-based activity recognition for industrial workflow monitoring. The results indicate that SSL data selection schemes, using density-based clustering prior to sampling, such as a combination of OPTICS and SMRS algorithms, provide better performance results compared to traditional sampling approaches, such as random selection. Finally, as regards the SSL techniques studied, graph-based approaches (harmonic functions and anchor graph) appeared to have a statistically significant superiority for the two visual recognition problems examined.

The WR dataset is publicly available as described in [

Part of the work presented in this paper has been included in the doctoral thesis of Dr. Eftychios Protopapadakis titled “Decision Making via Semi-Supervised Machine Learning Techniques.”

The authors declare that there is no conflict of interest regarding the publication of this paper.

The research leading to these results has received funding from the European Commission’s H2020 Research and Innovation Programme under Grant Agreement no. 740610 (STOP-IT project).