Image Retrieval Using Low Level and Local Features Contents: A Comprehensive Review

Billions of multimedia data ﬁles are getting created and shared on the web, mainly social media websites. The explosive increase in multimedia data, especially images and videos, has created an issue of searching and retrieving the relevant data from the archive collection. In the last few decades, the complexity of the image data has increased exponentially. Text-based image retrieval techniques do not meet the needs of the users due to the diﬀerence between image contents and text annotations associated with an image. Various methods have been proposed in recent years to tackle the problem of the semantic gap and retrieve images similar to the query speciﬁed by the user. Image retrieval based on image contents has attracted many researchers as it uses the visual content of the image such as color, texture, and shape feature. The low-level image features represent the image contents as feature vectors. The query image feature vector is compared with the dataset images feature vectors to retrieve similar images. The main aim of this article is to appraise the various image retrieval methods based on feature extraction, description, and matching content that has been presented in the last 10–15 years based on low-level feature contents and local features and proposes a promising future research direction for researchers.


Introduction
Humans have been using images for communication since pre-Roman times. Ancestors living in caves used to paint and carve pictures and maps on walls for communication. In the last two decades, exponential advances are visible in digital image processing technologies, network facilities, data repository technologies, smartphones, and cameras.
is has resulted in videos and multimedia data being generated, uploaded on the Internet, and shared through social media websites, leading to an explosion in the amount and complexity of digital data being generated, stored, transmitted, analyzed, and accessed [1]. Access to a desired image from the repository involves searching for images portraying specific types of objects or scenes, identifying a particular mood, or simply searching the exact pattern or texture. e process of finding the desired image in a large and diverse collection is becoming a vital issue. e challenges in the field of image retrieval are becoming widely recognized, and the search for a solution is turning into a sought-after area for research. e traditional way of an annotated image using text images are described using one or more keywords. It lacks the automatic and useful description of the image [2]. As compared to text retrieval, content-based image retrieval (CBIR) has been widely used in recent decades. Image retrieval using the content is considered one of the most successful and efficient ways of accessing visual data. is method is based on the image content such as shape, color, and texture instead of the annotated text. e fundamental difficulties in image retrieval are the intention gap and the semantic gap. e problem to accurately convey the expected visual content using a query at hands, such as a sample image or a sketch map, is called the intention gap. e difficulty in depicting high-level semantic concepts using the low-level visual feature is called a semantic gap [3]. Extensive efforts have been made by researchers to reduce these gaps.
ere are three steps in image retrieval using contents: feature extraction, feature description, and image similarity measurement ( Figure 1). An image is converted to some form of feature space for ease of comparison. e feature must be represented in the descriptive and discriminative form to differentiate related and unrelated images. e features extracted should be unaffected by various anomalies, such as differences in illumination, resizing, rotation, and translation changes.
For image retrieval using contents, a query is to be formed, which the user wants to search in the dataset. e query can be represented by giving an image as an input that can work as an example or reference. e text can be specified to search the dataset containing the object or a scene image similar to the text specified. e query can be given in the form of a sketch or clipart, which can work as a source to search the related images in the dataset, for example, a sketch of a human face, boat, or ball. A query can also be formed by specifying the color layout or concept present in the image. Here, the image retrieval system, which uses query specified with an example image, is in focus.
e article comprises the following sections: Section 2 discusses the various techniques based on color features. Section 3 describes the various techniques based on texture features. Sections 4 and 5 focus on shape features and local feature extraction techniques, respectively. Section 6 reviews the various feature fusion-based image retrieval techniques. Sections 7,8, and 9 describe the various commonly available datasets, similarity measures, and performance measurement criteria used to evaluate the retrieval techniques, respectively. Section 10 presents the conclusion and future directions in image retrieval methods based on image contents.

Color Features Used in Image Retrieval
Color features are steady and robust as compared to other features. Most of the color features are invariant to scale, translation, and rotation changes. Various methods have been experimented and suggested in literature based on color features such as "color averaging," "color histogram," "color coherence," and "block truncation coding (BTC)" and its variants such as " epades Sorted Block Truncation Coding" (TSBTC)." e techniques developed based on color histograms have high effectiveness, simplicity, and low storage requirement.
Color moments are the low-level image features that can be used to measure the similarity between two images. e central color moments are standard deviation, mean, and skewness. ese color moments give the distribution of colors in an image. Each image has green (G), red (R), and blue (B) color planes. erefore, there are nine color moment features. e color moments can be defined as in the following equations.
where i indicates the R, G, and B planes, and I (j, k, i) indicates the pixel intensities of the corresponding plane with size m * n.
e color histogram of the image is formed for R, G, and B planes. e histogram gives the probability distribution of the color intensities present in an image. e global histogram is calculated by considering the complete image as a whole. A local color histogram is calculated by dividing the image into parts, and then, the histogram of each part of the image is calculated. Color histograms are easy to compute and are less sensitive to small changes in viewpoints. Color histograms cannot provide spatial information and are sensitive to changes in illumination. ere are various extensions of color histogram techniques such as the fuzzy histogram [4] and MPEG-7 dominant color descriptor [5]. In [5], the RGB color image is converted to HSV color space and is quantized to 72 levels for reducing the feature vector space (hue: 8 levels, S: 3 levels, and V: 3 levels). e histogram of the quantized image is used as a feature vector. Histogram intersection is used to find the images similar to the query image. e color coherence vector technique [6] classifies each pixel as coherent if it is part of a large group of pixels having the same color; else, it is considered an incoherent pixel. e pixel group regions are created by connected components formed by checking the neighborhood pixel colors. If the connected component pixel count is greater than the threshold, it is considered a coherent region. e query images are compared with dataset images using the number of coherent pixels and incoherent pixels of a specific color. e color correlogram technique represents the local spatial correlation of colors with the global distribution of these features [7]. Color correlogram represents the image as a table of color pair (p, q) where r entry in (p, q) cell indicates the probability of pixels with color q at a distance r from a pixel of color p. e color autocorrelogram technique represents the spatial correlation between the same color intensities.
In image retrieval using block truncation coding (BTC) [8,9], the mean value is calculated for each plane of an image. e upper average is derived using pixels having a value above the mean. A lower average is derived using the pixels value having less than the mean. e pixels having a value less than the mean are assigned a value of lower average. If the pixel value is above the mean, then the pixel is assigned a value of upper average. is process can be applied iteratively by dividing the image into two blocks: the first block containing pixels having a value less than the mean and the second block containing pixels with a value greater than the mean. e upper average and lower average of the B, G, and R planes represent features of an image. A similar process is applied to query image, and matching images are retrieved from the dataset.
In image retrieval using TSBTC [15][16][17], the pixel values of an image are sorted in the ascending order, and the median value is calculated. en, the lower mean is calculated using the pixel values below the median, and the upper mean is computed using pixel values above the median. is process is repeated iteratively on each block. Blocks are created by dividing the image into two parts with pixels below the median as one block and a pixel having a value above the median as another block. e upper and lower means of each color plane represent the features of the image which are used for image matching.
In [18], color planes of an image are binarized using the Niblack threshold selection method. ese thresholds are used for calculating the upper and lower means of all the planes. ese means and standard deviation for each plane are stored as the image feature vector. e image is compared using city block distance to identify the class of the image. e class of the image is only used for retrieving the relevant images. is method works well only if the query is directed to the correct class.
In [19], the RGB color space image is transformed to nonuniform HSV color space, and 72 color features are extracted from it through quantization. e color histogram is calculated from these features to find the dominant color features. e query image and dataset image similarity are measured using the dominance granule structure similarity method. Table 1 shows a summary of various colors-based feature techniques. Color-based techniques are illumination variant but invariant to rotation and translation. e computational cost of color moment-based techniques is less, but accuracy is very low. e computational cost of histogram-based techniques is high, but accuracy is also high. e computational cost of BTC-based techniques is low, and accuracy is better.

Texture Features Used in Image Retrieval
e texture is another significant feature in retrieval techniques based on image contents. Image texture represents the variation in the local illumination in a small region. It represents the spatial layout of the gray intensities of pixels in a region. If the change of brightness is high in a small region, the image is called a coarse-textured image; else, it is called as a fine-textured image. e texture-based algorithms can be classified into two categories: statistical methods and structural methods. Structural methods identify the basic structure and their location in the image. ese methods are useful in images containing textures that are very regular and works for images with human-made objects that have regular patterns. Statistical methods are simple and widely used methods that employ quantitative measurements of intensity arrangements in a region. Examples of such methods are the gray level histogram, edge histogram, "local binary pattern (LBP)" [20], "local ternary patterns (LTP)" [21], "Local Tetra Patterns (LTrP)" [22], "gray level co-occurrence matrix (GLCM)" [23], "wavelet coefficients" [24], "ridgelets and curvelets" [25], "Tamura features" [26], and "Gabor wavelet filter" [27].
LBP is a thresholding-based technique in which the center pixel is compared with its neighborhood pixels in the radius r. If the intensity value of the neighborhood pixels is larger than the center pixel, the code bit 0 is assigned to it; else, code bit 1 is assigned [20]. e binary codeword is generated for the center pixel by concatenating these neighborhood codewords. is codeword is then converted into a decimal number. e histogram of the decimal codewords is calculated for the image and can be used as the feature vector. e total number of bins in the histogram is  Applied Computational Intelligence and Soft Computing 3 2 n , where n is the number of neighborhood pixels considered to generate the codeword. us, the local binary pattern descriptor for the pixel (x c , y c ) can be defined as in the following equations.
where i c is the value of the center pixel, and i n is the neighborhood pixel value. f � 0 if i c ≤ i n ; else, f � 1. LTP [21] is the extension of LBP and resistive to monotonic gray level transformations, in which the threshold k is used to generate the code for the center pixel. If the value of the neighborhood pixel is equal to or larger than the sum of the threshold and center pixel value, the code is +1. If the neighborhood pixel intensity is equal to or if less than the sum of the threshold and center pixel intensity, the code is −1. Else, the code is 0. us, the local ternary pattern code for the pixel (x c and y c ) can be defined as in the following equation: where p n is the neighborhood pixel; p c is the center pixel; i c and i n are the intensities of the center pixel and neighborhood pixel, respectively. e ternary codeword is then converted into two LBP for designing uniform descriptors by concatenating the histogram of these LBPs. ere are many extensions of LBP proposed in literature, which use local information in various directions such as "Local Tetra Pattern (LTrP)" [22], "Local Binary Extrema Pattern (LBEP)" [28], "Local Derivative Pattern" [29], and "Utilizing multiscale LBP" [30].
Tamura et al. [26] have defined six texture pattern features based on human perception, which can be used to define the image. ese features are coarseness, contrast, directionality, line-likeness, regularity, and roughness. e gray level variations and biasness in the distribution of gray levels are measured using the contrast features. e contrast, roughness, and regularity are defined as in equations (7)-(10): where μ 4 is the kurtosis, i.e., the fourth moment about mean, and σ 2 is the variance.
In an image, coarseness is the measure of granularity; directionality gives the direction and quality of the edges. It is calculated by convolving the image with Prewitt's horizontal and vertical edge detectors. Line-likeness defines the Roughness � contrast + coarseness.
Regularity represents the repetitiveness of patterns. It is defined as where r is the normalizing factor, and σ 2 is the variance of the respective feature.
In [31], discrete cosine transform (DCT) coefficients are used to represent image texture features as it has an excellent compression capability of energy compaction. e image is divided into subblocks for space localization. DCT is applied to each subblock. e feature vector is calculated using the DC coefficients and some AC coefficients containing the direction-related information. Nine features are extracted from each subblock of size 64. In [32], the Gaussian pyramid is applied to extract the multiresolution images of the R, G, and B planes. DCT is applied on all multiresolution images. e feature vector is generated by concatenating all the DC coefficients and statistical parameters of significant AC coefficients selected through all multiresolution planes. In [33], the color image is converted into a grayscale image. e image is divided into nonoverlapping blocks, and DCT is applied on each block. e histogram is formed using DC coefficients and selected the first three AC coefficients. Six statistical features are calculated using quantization bins for all the blocks.
In [24], the Haar wavelet is used to extract a fixed number of salient points from the image. Gabor texture features and color moments are extracted using the neighborhood pixels of salient points. As the features are extracted for the fixed number of salient points, the computational complexity is better than considering the whole image.
In [34], the discrete wavelet transform is applied on the image up to three scales, and LTrP is used to describe the features of each subband. e artificial neural network is used for image matching and retrieval.
In [25], curvelet transforms are used to find the loworder statistical features of an image. e image and curvelet are transformed to the Fourier domain. e image is then convolved with the curvelet. e curvelet coefficients are calculated by applying the inverse Fourier transform. Standard deviation and the mean of curvelet coefficients represent the image features. us, the image is represented by a 2n size feature vector, where n indicates the number of curvelet used.
In [35], the ranklet transform is used to generate three images of different orientations, vertical, horizontal, and diagonal, as a preprocessing step for each plane. Ranklet transform is applied on each R, G, and B plane, resulting in the generation of nine images. Each image's standard deviation, mean, and histogram color moments are determined.
us, a feature vector of size 27 is generated by concatenating all the moments of an image. K-means clustering algorithm is used to cluster images into categories, and the centroid of each category is computed. e query image feature vector is compared with the centroid of each category to find the smallest distance category. All the images that belong to the smallest distance category are compared with the query image for image retrieval. Table 2 shows a summary of image retrieval techniques based on texture features. Structure-based texture techniques are not suitable for generic image retrieval as the images do not have regular patterns or structures. Statistical texture-based techniques are widely used in generic image retrieval as these techniques are illumination invariant, but the feature vector size is more than other techniques.

Shape Features Used in Image Retrieval
Besides texture features and color features, shape features are also used for searching the analogous images as humans observe the objects based on their shape [36][37][38]. e detailed review of the various shape-based feature extraction and description techniques are presented in [39][40][41]. Figure 2 shows the various shape-based feature extraction and description techniques. e shape-based features extraction and description techniques can be broadly classified into region-based and contour-based techniques. Contour-or boundary-based techniques basically describe the boundary of the objects, whereas the region-based technique uses all the pixel values of the object. Contour-based methods are categorized as complete object shape-based, if the boundary is represented as a whole shape or primitive/structure-based or if the boundary is segmented into parts and described. Region-based methods are classified as spatial domain-based and transform-domain based. Spatial domain-based techniques are again divided into two types, complete objectbased and primitive-based, depending upon the part of the object described.
Shape-based features are not used widely for image retrieval as it requires segmented objects in an image that is challenging to find in heterogeneous dataset images. Generally, the shape-based features are combined with other low-level image features and local features to represent the image for generic applications of image retrieval. Shapebased features are generally used for object retrieval [42,43].

Local Feature Extraction Techniques
e image retrieval techniques can be categorized into local and global techniques. Global image retrieval techniques consider the whole image for extracting and describing. Global feature extraction techniques are suitable to retrieve the duplicate image and can be used for detecting natural scenes. e local feature extraction techniques are useful for detecting human-made objects. In an image, local techniques identify salient regions called interest points or keypoints and express the neighborhood patch of these key points for describing the image.
e key points that are Applied Computational Intelligence and Soft Computing detected must be highly repeatable so that they can be detected with various transformations such as rotation and illumination. e local features should have the properties such as distinctiveness (high variations), locality (should be small enough to avoid occlusion under different viewing angles), quantity (sufficiently large number of features should be detected for matching purpose), accuracy (feature should be accurately identified in various scales), efficiency (fast to detect), invariance to large deformations, and robustness to small deformations [44]. e commonly used methods for local feature detection techniques are the Harris corner detector [45], Harris-Laplace detector [46], Hessian-Affine detector [47], SURF [48], Shi and Tomasi corner detector [49], difference of Gaussian [50], FAST [51], SUSAN [52], and MSER [53]. e Harris detector uses the edge and corner detector based on autocorrelation function, i.e., a second-moment matrix for local texture description [45]. It finds the intensity differences in all directions for the shift of (u, v) using the following equation.
e window function is given as E (p, q) for small changes shift (p, q) can be expressed as and M is given as where Ip and Iq are the gradients in p and q direction, respectively; α and β are Eigenvalues of matrix M; then, trace of M is α + β, and the determinant of M is αβ. e region is a corner region if CR has a positive value.  6 Applied Computational Intelligence and Soft Computing e Shi and Tomasi corner detector works faster than the Harris corner detector. It works in a similar manner as the Harris corner detector with a slight change in conditions to detect a region as a corner. If the CR value is greater than the threshold, then the region is detected as a corner [49].
e Harris and Shi-Tomasi corner detectors are not so useful for searching similar images of various sizes and scales as they are very unstable to scale change. In [52], a low-level feature detector SUSAN based on a circular mask is presented. SUSAN is fast and accurate to detect lines, edges, and corners with noise reduction. In [54], the Laplacian of Gaussian is approximated with the difference of Gaussian.  Applied Computational Intelligence and Soft Computing e image is made smooth by convoluting it with a Gaussian filter having some width σ1. e original image is made smooth by convoluting it with the Gaussian filter having some width σ2. e difference between these two Gaussian filter images is taken to find the local features of the image. In [48], Fast Hessian is presented. It uses the Hessian matrix and approximates LoG with a box filter using integral images. It can be applied on multiple scales simultaneously. In [46], Mikolajczyk and Schmid have presented the Harris-Laplace or Hessian-Laplace method of detecting local interest point features at various scales using Laplacian of Gaussian with the Harris corner detector. e points having maximal Laplacian over scales are selected.
e selected interest points are not variant to the rotation, scale, and translation.
In [51], FAST, a high-speed machine learning-based corner detection technique, is presented. It uses the segment test criterion, which considers sixteen neighborhood pixels in a circle around a corner candidate keypoint p. It classifies p as a corner if n neighborhood pixels are brighter than Ip + t or darker than the Ip-t. Where Ip is the intensity of candidate keypoint p, and t is the threshold. is technique is very fast as compared to other keypoint detection techniques, but it is not robust to noise and orientation and is dependent upon the threshold value. e local feature (interest point) can be described with floating-point or binary point descriptors.
e detected interest point should be described in a highly differentiable way so that it can be identified and correlated if it exists in some other image. e local feature detection and description techniques need to be elevated for faster retrieval.
One of the most popular local features, the floating-point descriptor, is the scale invariant feature transform (SIFT) that shows the excellent results [50]. SIFT algorithm can be divided in 4 major phases, scale-space extrema detection, keypoint localization, orientation assignment, and keypoint descriptor. SIFT identifies the repeatable features in an image that can be identified in various scales and views by the scale-space function using the difference of Gaussian function between two nearby scale separated images. Local minima and maxima are detected for finding candidate keypoints. Hessian and a derivative of candidate keypoints are performed to reject the noise-sensitive keypoints using detailed fit to the nearby data for location, scale, and the ratio of principal curvatures. Hessian matrix is used to reject the keypoints localized along an edge. A 36 bin orientation histogram is created using the Gaussian-weighted gradient orientation of the keypoint neighborhood pixels by considering the scale at which the keypoint is detected. e dominant direction of local gradients forms the highest peak in the orientation histogram, which is used to create the orientation of the keypoint. ese three steps make the image invariant to location, scale, and orientation changes. e keypoint descriptor is formed using the gradient orientations and magnitudes of the keypoint neighborhood pixels by considering the scale and location used to detect the keypoint. e gradient orientations are rotated relative to the keypoint orientation to make them rotation invariant. SIFT provides a floating-point, 128 elements keypoint descriptor. Keypoints between two images are matched by identifying their nearest neighbors with minimum Euclidean distance between descriptors. SIFT shows a good performance in the change of the rotation and scale. It has an excellent performance in images that have a simple background. SIFT represents them without noise. SIFT is good; however, it is not fast enough. To overcome this issue, many variants of SIFT have been proposed such as root-SIFT [55], affine-SIFT [56], color-SIFT [57], edge-SIFT [58], CSIFT [59], NSIFT [60], and PCA-SIFT [61].
In [48], the "Speeded Up Robust Features" (SURF) algorithm is introduced. is keypoint detector and descriptor algorithm is the scale and rotation invariant. It is based on the Hessian matrix and approximates LoG with a box filter using integral images and can be applied on multiple scales simultaneously. It uses the Hessian matrix determinant for calculating the keypoint location and scale. It employs Haar wavelet responses for orientation assignment and description. Haar wavelet responses are represented as a vector with a total of 64 dimensions. SURF is not affine invariant.
In [62], BRIEF, a binary descriptor, is presented. e image block descriptor is calculated by taking a simple intensity difference between pixels of an image block. e BRIEF keypoint feature descriptor is suitable for real-time applications due to its speed. However, it has a low tolerance for transformations such as rotation, scale, and image distortions. In [63], a binary keypoint detection, description, and matching technique, BRISK is presented. In this, keypoints are detected with a scale-space pyramid consisting of octaves and intraoctaves. It uses a 9-16 mask of the FAST feature detection technique, which requires at least 9 pixels to be lighter or darker than the center pixel and FAST 5-8 mask on octave c0 to obtain the FAST scores. e keypoint detected is described with a bit string of length 512 in a binary format by considering the results of the brightness comparison test with the direction of keypoint to make it rotation invariant. e BRISK keypoint descriptors are fast and easy to match, since the simple Hamming distance is calculated between them. Hamming distance gives the dissimilarity between matched keypoints. In [64], the "Oriented FAST and Rotated BRIEF-(ORB-)" based binary feature descriptor is presented. It is quicker to compute as compared to SURF and SIFT but has limitations in the descriptive power and scale invariance in some situations. In [65], a binary feature descriptor inspired by the human eye behavior "Fast Retina Keypoint (FREAK)" is presented. e human eye uses the difference of Gaussian to extract the features from an image at various sizes and encodes them. It uses the retina sampling grid. It is circular in nature with high-density points near the center. e sample points are made smooth to remove the noise with different kernel sizes. e receptive fields are overlapped to capture more information and improve the discriminative power and performance. e descriptor is created by the one-bit encoding of the difference between the receptive fields and the Gaussian kernel. e receptive fields which are selected should be uncorrelated or low correlated and highly discriminate. us, the difference of Gaussian from coarse to fine ordering is selected. Initially, the FREAK descriptor's first 16 bytes are compared for matching the keypoints. If the distance 8 Applied Computational Intelligence and Soft Computing between the first 16 bytes is less than the threshold, the next bytes are compared. us, searching is also performed from the coarse to a fine level. Matching the first 16 bytes increases the speed of matching. e rotation of keypoint is calculated using the sum of local gradients. e local feature detectors can be compared on the basis of image contents and structures such as corners, blobs, or regions and discriminative powers with respect to various invariances [66]. Table 3 shows a comparison of various local feature detection techniques. e selection of local feature detection is completely based on the type of images in the dataset. e local feature detection techniques are still not highly robust to the scale and affine transformations and have limited repeatability and robustness properties. Table 4 shows the average retrieval accuracy for the augmented Wang dataset using various feature extraction and description techniques. e augmented Wang dataset contains 1100 images of 11 different categories. e existing techniques are reimplemented and tested on the datasets to normalize the test environment. Mean square error is used as the distance measure. BRISK, ORB, FAST, MSER, and SURF are used for interest point detection. Feature descriptors such as BRISK, FREAK, SURF, and ORB are used. e average retrieval accuracy value is 1.11% when BRISK is used for feature extraction and description. When features are extracted with BRISK and described using FREAK, the retrieval accuracy is 6.08%. e average retrieval accuracy is 3.62% when ORB is used. With MSER as a feature extractor and SURF as a feature descriptor, the retrieval accuracy is 15.65%. When FAST is used for feature detection and FREAK is used for description, average retrieval accuracy is 12.61%. When SURF is used as a feature detector and FREAK is used as a feature descriptor, the average retrieval accuracy is 12.16%. e highest average retrieval accuracy is 22.10% when SURF is used as a feature detector and descriptor. e floating-point descriptors such as SURF have high retrieval accuracy but have high memory requirements and not suitable for real-time applications. e binary descriptors are good for fast matching, computation, and low memory requirements but face the issues of low descriptive power, robustness, and generality.

Feature Fusion-Based Techniques Used in
Image Retrieval e image datasets contain images that are highly diverse and nonhomogeneous in nature. It is very difficult to retrieve the images by using simple and individual low-level image features. erefore, in literature, the performance of image retrieval systems has been enhanced by combining the lowlevel features (color, shape, and texture), global features, and local features for representing the feature vectors.
In [67], the image shape and color features are merged to generate hybrid image features. Color moments mean, standard deviation, and skewness are extracted as color features, and seven invariant moments are extracted as shape features from the second and third moments to represent an image. ese features of the query image are compared with dataset images using L2 similarity measure. For performance evaluation, precision and recall parameters are used.
In [68], image features are generated using texture and color features. HSV color moments, i.e., mean, skewness, and standard deviation, are calculated. Texture features are generated using a 2D Gabor filter by varying the scale and rotation. Euclidean distance is used as a similarity measure. Precision is used as a performance metric.
In [69], scale and illumination robust feature vectors are generated by the fusion of texture and color features. Here, multilevel Haar wavelet features are combined with a color histogram to increase retrieval accuracy.
In [70], color and edge features are combined to generate a robust color volume histogram-based feature vector. e HSV color space image is generated from an RGB color image. e H, S, and V components are uniformly quantized into 72 bins. Sobel edge detection operator is applied to the V component to generate a quantized edge map of the image containing 32 bins. L1 distance is used as similarity measurement criteria.
In [71], the local feature descriptor technique is combined with a bag of words. SURF-and SIFT-based local features are generated. K-means clustering is used to generate visual words from these extracted features. Images that match the query image are retrieved from the dataset using the SVM classifier. For performance evaluation, precision and recall parameters are used.
In [72], the color contents, shape, and color texture are used for generating the features of the image. e color contents are extracted by calculating the summation of median and variance from the histogram of each plane R, G, and B. Features are made independent of rotation and illumination by extracting shape features. e RGB image is first converted to a grayscale image. Salt and pepper noise is removed by applying the median filter. An image feature vector based on the shape is generated by applying the neutrosophic clustering algorithm and the Canny edge detection algorithm. e texture and color feature's standard deviation, mean, contrast, energy, and homogeneity are calculated based on horizontal, vertical, and diagonal directions, using GLCM by applying the Gaussian filter and dividing the image into 4 × 4 blocks. All these features are stored in the database as feature vectors. For similarity measurement and retrieval of matching images, memetic algorithm based on genetic and great deluge algorithm are used.
In [73], chromaticity moments, co-occurrence, and color moments features are fused to generate a feature vector. Shape and distribution chromaticity moments are calculated using CIE xyY color space. For each color plane, standard deviation, mean, and skewness color moments are determined using RGB color space. Contrast, energy, homogeneity, correlation, and entropy color co-occurrence statistical features are calculated using RGB color space. Inverse variance weighted Euclidean distance is used as a similarity measure to improve accuracy.
In [74], local feature extraction techniques are combined with a bag of visual words (BoVW). For each image in the training dataset, the SURF and FREAK features are calculated. K-means++ clustering algorithm is used to reduce Applied Computational Intelligence and Soft Computing  Applied Computational Intelligence and Soft Computing feature vector space, and clusters are generated. Each visual word represents the center of the cluster, and it is used to generate the codebook or vocabulary. e visual words of FREAK and SURF are fused together by concatenation. A histogram is constructed for each image of the dataset and given to the support vector machine (SVM) as the input. e Euclidean similarity measure method is used to calculate the similarity score of the query image and dataset image collection.
In [75], local and global features are fused together by considering the SURF and histogram of gradient (HoG) features of the image. e SURF and HoG features are obtained from the image. Visual words vocabulary using the K-means algorithm is generated from training image features. e histogram of visual words is input to train the SVM Hellinger kernel function. Euclidean distance is utilized to retrieve the images from the image dataset collection.
In [76], "Weighted Average of Triangular Histograms (WATH)" of visual words are considered to add spatial content information of the image. is helps in reducing two problems: first, interpretation gap issues due to low-level image features and high-level image semantic and second, overfitting problem due to the large visual dictionary.
In [77], image retrieval based on the multiregion has been presented using the curvelet transform and color features of significant regions. ere are three major steps involved: important regions identification from RGB images, representation of regions using several features, and retrieval of relevant images using regions from query and target images from the dataset. e regions which engage users' attention are called important regions. An image can have multiple important regions. Important regions are extracted using a saliency map, location, size, and region homogeneity. e hue component is used to find the significant regions of the image. Significant region is represented using histogram-based color descriptors. e RGB image is converted into HSV color space. e hue component is divided into 16 bins. S and V components are divided into four bins each. Twenty-four features are extracted from each significant region. e texture feature descriptors of each significant region are computed using the curvelet transform. e histogram intersection technique is applied to measure the color closeness between images. e texture closeness is computed using Euclidean distance. e total distance is the summation of the distance between the color feature and texture feature of the query image and dataset image. e system is evaluated using precision, recall, and Fmeasure.
In [78], authors have presented Sphere/Rectangle Tree indexing and locality sensitive hashing techniques with bag of visual words. SURF is used to describe the image features. Images visual vocabulary is created using bag of visual words (BoVW). Locality sensitive hashing, Sphere/Rectangle Tree, L1 norm (Manhattan distance), and L2 norm (Euclidean distance) are used to find the nearest visual words of the query image's vector.
In [79], authors have presented the technique to combine texture, edge, and color features of the image. It uses modified color difference histogram features in Lab color space to extract texture and color features. e edge orientation features are calculated in Lab color space using the Sobel operator. Query execution complexity is reduced by stagewise execution. Initially, similar images are selected based on color features. From this selected set of images, texture features are compared and given as an input for edge feature matching, and finally, similar images are retrieved. Precision, recall, and bull's eye performance measures are used to evaluate the method.
In [80], authors have presented a technique to fuse the texture and color features. e color features are generated using a histogram of the quantized HSV color space image. Texture features are generated using GLCM, LBP, and normalized moment of inertia (NMI). NMI is calculated using the particle swarm optimization-based pulse code neural network (PCNN) and 2D Otsu image segmentation method. e technique fuses these features based on the weight assigned to each feature.
In [1], authors have presented a method to fuse the texture features and color features. e RGB color space intensity values of the pixels are combined with quantized HSV color space values. e V component values are used to find the quantized edge and intensity information. An extended weighted L1 distance is used for similarity matching.
In [81], the image retrieval technique based on the fusion of color, texture, and shape features is presented. e color moments average, standard deviation, skewness, and kurtosis are calculated for each plane of the color image. e gray image is used to find the texture feature by dividing the image into 8 × 8 blocks and applying DCT on it. e feature vector is created using DC components and specific AC components. For generating the shape features, the image is first segmented into salient regions by applying the c-means algorithm using color and texture feature vectors. en, the principal axis of the region is calculated. e shape feature vector is generated using the endpoints of the principal axis. Matching images are retrieved by using SVM from the dataset. Table 5 shows a summary of various feature fusion-based techniques reviewed. Significant work is carried out in fusing color features with texture features or local features. More focus is given on color features in RGB or HSV color space combined with SURF-based local features.
Local feature extraction techniques generate discriminative features based on corners, edges, and blobs. ese techniques work well for the images containing objects, monuments, and artifacts. Features are detected from occluded object images also. ese are invariant to scaling, translation, and illumination conditions. But these techniques do not work well for scenic images such as images of forest, mountains, beaches, and sky, as the regions contain a large number of corners and edges. ese methods do not work well for objects that are smooth in texture and does not have many edges or corners. Global features or low-level feature extraction techniques work well for such images but are not suitable for occluded objects and illumination variation conditions. us, fusion techniques that combine the local features with low-level features based on color and statistical features give a better performance as compared to other techniques studied.

Datasets for Image Retrieval Used in Experimentation
Various datasets are available in the literature. ese datasets contain the images of human-made objects, natural objects, buildings, landmarks, animals, natural scenes of beaches, mountains, and water. e images are taken with variations in conditions of illumination, rotation, scaling, and occlusion. e commonly used datasets for image retrieval are Flickr Logos 27, FlickrLogos-32, Flickr1M, Amsterdam Library of Object Images (ALOI), UKBench, INSTRE, ZuBuD, Corel-1000, COIL, Caltech 101, and Caltech-256. Figure 3 shows some of the images of these datasets. ALOI dataset [83] contains 110250 color images in the PNG format of 1000 small objects with more than 100 images per object. Each object is captured by changing the viewing angle in 72 directions, illumination direction with 24 configurations, and illumination color with 12 configurations. Each image is of size 768 × 576 pixels.
COIL dataset [84] contains 7200 images of 100 objects with 72 images per class. ese images were captured by placing the object on a turntable with a black background. e images are in the PNG format, with a size of 128 × 128 pixels.
UKBench [85] dataset contains 10200 images of 2550 classes, with four images per class. e images are blurred and rotated. Each image is of size 640 × 480 pixels. All the images can be used as query images.
Stanford Mobile Visual Search [86] dataset contains images clicked using various camera phones. e images are of size 640 × 480 pixels with varying distortions and illumination conditions of objects such as text documents, landmarks, CD covers, books, and paintings. e images are categorized into 1200 categories with 3300 query images.
Holiday [87] dataset contains 1491 of 500 classes, with scenic holiday images of nature, water, fire, and humanmade objects with changes in rotation, viewpoint, illumination, and blurring. e first image of each class is the query image.
Oxford-5K [88] dataset contains 5062 images of 11 Oxford building landmarks with 55 query images and some distracter images. e images are of size 1024 × 768 pixels. ese images are downloaded from Flickr and manually annotated. Paris dataset [89] contains 6412 images of 12 Paris landmarks such as the Eiffel Tower, Hotel des Invalides, Moulin Rouge, La Defense, Louvre, Notre Dame, Musee d'Orsay, Pantheon, Pompidou, Sacre Coeur, and Arc de Triomphe collected from Flickr with 500 query images. e images are of size 1024 × 768 pixels. ZuBuD dataset [90] contains 1005 images of 201 Zurich buildings with five viewpoints. Each image is of size 320 × 240 pixels. ere are 115 query images with varying viewpoints and illumination conditions.
Flickr Logos 27 dataset [91] is a labeled dataset created using downloaded logo images of brands such as Adidas, BMW, Coca-Cola, Pepsi, Vodafone, FedEx, DHL Intel, Google, Nike, and Puma from Flickr. ere is a total of 27 classes and 30 images per class in the training dataset. ere is a total of 270 images in the query dataset with five images per class and 135 images that do not belong to any class. FlickrLogos-32 [92] dataset contains images of 32 logo brands of size 1024 × 768 pixels. ese images were downloaded from Flickr. e images are partitioned into training, validation, and query set. Out of 8240 images present in the dataset, 6000 are distracter images. Flickr1M dataset [93] consists of 1197398 images downloaded from Flickr. e image categories are broadly classified as objects (such as bicycle, birds, chairs, cats, tables, and trees), landmarks (such as Golden Gate Bridge, Tower Bridge, and Colosseum), scenes (such as the beach, city, people, sunset, and desert), and activities (such as baseball, sailing, sailboat, Christmas, and wedding).
INSTRE [94] is an object dataset. e dataset is divided into three datasets: INSTRE-S1 of 100 classes for single object case with 11011 images, INSTR-S2 of 100 classes for single object case with 12059 images, and INSTRE-M for

Similarity Measures Used in Image Retrieval
ere are many distance metrics or similarity measures defined in the literature to compare the query image with the images in the dataset such as Manhattan distance, Euclidean distance, Chebychev distance, Minkowski distance, cosine distance, square chord distance, fidelity distance, Sorensen distance, Canberra distance, squared chi-squared distance, and Mahalanobis distance [15,[99][100][101][102]. e distance metric uses a distance function to compare the images. It is selected depending upon the features that are used for representing the image. e distance metrics discussed here use numerical or continuous values.
Minkowski distance (Lp distance) between two feature vectors q and d is the p th square root of the sum of p th power of the absolute difference between the image feature vector pair as given in the following equation. If p � 1, it is called as Manhattan distance (L1/city block/taxicab distance) between two feature vectors is given by the sum of the absolute difference between each vector dimension pair as given in the following equation. It is used to calculate the distance in a grid-like path between two feature vectors. is metric is robust to outliers, but it is sensitive to variations in the background such as color, illumination, light direction, and size.
Extended L1 distance [1] is the distance between the query image and dataset image feature given by where Euclidean distance (L2 distance) is the square root of the summation of the square difference between each vector dimension pair of q and d as given in the following equation. It can be used to calculate the distance between two data points in a plane. is distance metric is most commonly used for similarity measurement in image retrieval because of its efficiency and effectiveness.
Mean square error is the mean of the sum of the square difference between each vector dimension pair as given in the following equation. e computational complexity of mean square error is more than the sum of absolute difference as the square of differences is calculated. Distance is always a large positive number. It can be used in both spatial and transform domain images.
Chebychev distance, also called L∞ distance, is given by the following equation. e distance between image feature vectors is calculated as the maximum absolute difference between the pair of features of the image.
Square chord distance is defined as the sum of the squared difference between the square roots of the image feature vector dimension pair as given in the following equation.
Fidelity distance is the summation of the square root of the product between Q and D feature vector dimension pair as given in the following equation.
Sorensen distance is the summation of the absolute difference divided by summation of absolute addition between the feature vector dimension pair as given in the following equation.
Canberra distance is the summation of the absolute difference between feature vector dimensions pair divided by the addition of the absolute value of the feature vector dimension pair as given in the following equation. is method is useful for the data spread about the origin. is method is similar to the city block distance metric. City block distance gives larger values between dissimilar images. Canberra distance normalizes this by dividing it with the sum of the feature pairs.
Modified Canberra is the summation of the absolute value of the difference between feature vector dimensions pair divided by the addition of feature vector dimension pair as given in the following equation.
Squared chi-squared distance is the summation of the squared difference between feature vectors divided by the absolute addition of the feature vectors given in the following equation.
Mahalanobis distance is a quadratic form distance metric where Σ −1 is a covariance matrix of feature vector q and d as given in the following equation (30). It measures the similarity between two feature vectors by taking covariance into consideration. It is used for calculating distance in multivariate space.

Applied Computational Intelligence and Soft Computing
Cosine distance provides the angular difference between the feature vectors as given in the following equation. is metric is generally used when the orientation between the feature vectors is important and the magnitude does not matter.

Performance Evaluation of Image Retrieval
e quality of the image retrieval methods can be evaluated by the accuracy of the method based on the rank assigned to the images retrieved. e retrieved images can be classified as relevant and irrelevant images. In literature, recall and precision have been widely used to evaluate the quality of the image retrieval methods. e precision for a query image is defined as in equation (32). Recall for a query image is defined as in equation (33).
Pr � relevant images retrieved count total images retrieved count , Rc � relevant images retrieved count total relevent images in dataset count .
Precision and recall are combined to find the Fscore/ Fmeasure to compute the performance of the retrieval method. Fscore is defined as in equation (34).
F score � w * Pr * Rc w 2 * Pr + Rc , where w is the parameter used to give weightage to precision over recall.

Conclusion and Possible Future Research Directions in Image Retrieval
In this article, a detailed review of CBIR-based techniques proposed in the last 10-15 years, using various feature extraction and description techniques, has been presented. Low-level features are used to represent images with texture features, shape features, and color features. e standard dataset images are diverse and complex in nature due to rotation, translation, scale, and affine variances. erefore, one type of low-level feature cannot represent the image with high discriminative power. e fusion of multiple low-level feature representations can enhance the performance of the retrieval system. e global feature extraction techniques work well with nature's scenic images, but give less performance in the case of images containing human-made structures and objects. In the case of local feature extraction techniques, features are extracted from the image regions located near the interest point instead of the complete image, and thus work well for partially visible objects. ese techniques are not suitable for nature's scenic images, as many local features are extracted. e fusion of low-level image features with local features can improve the performance of the system. Blob-based SURF variant and region-based MSER variant techniques can be fused with texture features and color features for improving the accuracy of the system. Local feature extraction techniques generate large feature descriptors of varying sizes. e size of the feature descriptor needs to be converted into optimal length so that the speed of query execution can be improved. e fusion of machine learning algorithms with local image features, low-level image features, and statistical features might improve the performance of the system. Image retrieval using deep neural network-based algorithms gives better results as compared to the traditional local and global feature description techniques but requires high computing power and fine-tuning of the network. e low-level features and local features require less computing power. e fusion of these two methods is a possible research area. e performance of the image retrieval techniques can be improved by combining the other clues such as image annotations, web search history, text in the web pages, and speech present in the videos. e standard datasets that are used currently for image retrieval techniques have been designed majorly for image classification. ere is a need for datasets specifically developed for image retrieval with a large number and categories of images.

Data Availability
e data used to support this study are available in the form of earlier published literature.