Is Vehicle Plate Corner Prediction by Vision Transformer Better than CNNs?

. Te license plate recognition performance can be improved by converting the license plate photographed from the side to the front view. To perform this transformation, four vertex corner positions of the license plate are required. Existing deep learning methods to fnd these corner positions use a convolutional neural network (CNN). In this study, we propose a model using a vision transformer (ViT), a nonconvolutional method that has recently been attracting attention, as a backbone, and compare its performance with existing CNN models. Te ablation study is conducted by diversifying the ViTstructure. Trough these results, it was found that ViT has strengths in model size reduction but is similar in performance or inferior to CNN and that ViTtraining is more difcult than CNN.


Introduction
License plate recognition (LPR) generally consists of two steps.In step 1, the license plate area is found within the image, and in step 2, character recognition is performed for the area.If the license plate in the area is not photographed from the front, as shown in Figure 1, it negatively afects the character recognition.Terefore, in order to improve recognition performance, it is necessary to rectify the tilted license plate into the form viewed from the front.Tis can be done through a perspective transform, which is a warp operation that maps the coordinates of the four corners of the license plate to the positions of the rectangular corners.
Te rectifcation not only improves recognition performance but also increases the efciency of labeling work that creates training data.For example, it is easier to mark rectangular letters than slanted letters with rectangular bounding boxes.Also, it facilitates the training process of the deep models for LPR.With rectifed plate images, the training can lower the dependence on augmentation to consider tilting.For this reason, in the feld of document character recognition, it is a useful technique because it can increase accuracy by unwarping inclined or nonplanar document images.
However, the rectifcation task is challenging due to the real-time requirements of LPR and the limitations of available computing resources.When performing LPR, the computing resources are insufcient because the operation is performed on a low-profle device directly connected to the camera.Also, it should be considered that rectifcation is a secondary task that must be performed with a minimum of resources so that time and resources can be sufciently used in the process of locating the license plate area and character recognition.
In this study, we develop a model to detect corner positions for vehicle plate rectifcation using a vision transformer (ViT), which has recently been attracting attention as a nonconvolutional model, as a backbone, and compare the performance with existing CNN-based methods.ViT was frst applied to the vision classifcation feld [1], and its application has been expanding to the detection and segmentation feld [2].ViT is also versatile in its composition and is also used in hybrid form when combined with other well-known models [3].However, recently, there have been studies that question the claim that ViT performs better and is more robust than CNN [4].In order to broaden our understanding of the pros and cons of ViT, we intend to create 600 corner prediction models through a combination of structural parameters and analyze the efect of parameters through performance measurement.Te contributions of this study can be summarized as follows: (i) We developed a ViT-based corner prediction model that can be used for the rectifcation of tilted license plates in images (ii) Trough an ablation study on 600 models, the efect of structural parameters on model size and performance was analyzed (iii) By comparing the performance of ViT-based models with CNN-based ResNet and MobileNet, the advantages and disadvantages of ViT were analyzed Tis article is organized as follows: Section 2 examines license plate rectifcation-related studies and vision tasks using ViT, and Section 3 proposes a ViT-based plate corner prediction model.In Section 4, performance evaluation is carried out, and Section 5 concludes the study.

Related Works
Finding the four corner positions for the rectifcation of the tilted license plate image can be thought of as a kind of detection problem.If the outline of the license plate is clear, edge detection can be used.However, since there are many images with unclear outlines, there is a limit to its general application.As another method, we can think of a method to fnd the outermost line that includes all the letters, but it is not efcient because we have to detect the letters frst.
Methods [5,6] using CNN for corner prediction use features extracted from convolutional layers or predict corner coordinates from a latent representation created by an autoencoder.While these methods fnd the coordinates required for rectifcation through warping, a method for directly creating a rectifed image [7] has also been proposed.For example, there is a method using U-net [8].However, this method is difcult to apply to the license plate recognition task with realtime requirements.Tis is because denoising is required as a preliminary step to increase the quality of the rectifed image, and the image creation process takes more time than warping.Also, there are cases where blurring occurs in the resulting image, which leads to a decrease in recognition performance [9].A domain adaptation technique was tried in [10], where a prediction model trained for the plates of one country is adapted to the same task but for the plates of a diferent country.
Recently, ViT, which is expected to replace CNN in various vision tasks, was based on a representative transformer model [11] used in the feld of natural language processing.ViT converts the input image into a sequence of image patches and extracts context information through a self-attention structure with multiheads.Tis information is evenly distributed between layers and is known to be more advantageous for maintaining spatial information [12].In addition, it is more robust to adversarial perturbations as well as occlusion and domain shift compared to the existing convolution-based structures [13,14].As an example of ViT being used for rectifcation, it was used to unwarp document images with geometric distortion [15].Here, the transformer corrects the distortion through pixel-wise displacement.
Te self-attention mechanism, a core element of ViT, is limited to low-resolution image input due to computational complexity, and there have been questions about whether it will be efective for tasks that require high input resolution, such as detection or segmentation.However, for the detection task, a model has been proposed that uses ViT as a backbone and combines a common detection task head [16].In the segmentation task feld, it is showing better performance than CNN in the form combined with the existing U-Net structure [17].Image generation using generative adversarial networks is known to be the most difcult of vision tasks.Even in this feld, competitive results were obtained using only ViT without the help of CNN [18].Recently, it has been believed that ViT will completely replace the convolution operator, which was considered essential for vision tasks.Contrary to this belief, CNNs are favorably evaluated by claiming that convolution-based networks can have as much adversarial robustness as transformers if they follow a learning methodology similar to that of transformers [4].
Te reason we considered ViT instead of CNN as a backbone is to test whether the ViT backbone can show competitive performance while having a smaller size than CNN.As described above, since rectifcation is a secondary process in the license plate recognition process, it should use as few resources as possible, and it should be possible to omit it if necessary.Tis is important because most license plate recognition processes are performed on edge devices with limited computing resources.In such a low-profle environment, a new model that is smaller than the convolutionbased model but can perform comparably to it is needed.So, we want to judge the potential of ViT as a backbone, which has recently been receiving attention.

Plate Corner Prediction by Vision Transformer
Te proposed model for plate corner prediction consists of a ViT backbone and corner head, as shown in Figure 2. ViT takes the input image as a sequence of patches and generates encoded patches through a self-attention mechanism.Te corner head consists of a fully connected layer that predicts 8 values corresponding to 4 corner coordinates (x, y).
Te lower part padding is applied to the input image, as shown in Figure 1, in order to include the entire license plate area within the square shape.As a result, in most input images, the actual license plate occupies the top portion of the square image.Te license plates in the training and test images are of two types: those for cars and those for motorbikes.Te license plate for automobiles in Figure 1(a) is rectangular, and the license plate for motorbikes in Figure 1(b) has a complex shape with the upper part narrower than the lower part.

Scientifc Programming
Te input image is divided into patches of size P sz × P sz , as used in the existing ViT.Tese patches are composed of a sequence of image patches according to the row-major order and are converted into tokens through the patch embedding process.Position encoding is added to each patch and provided as an input to a transformer encoder where selfattention operation is performed.ViT for the classifcation task uses a separate class token, but it is not used in our proposed model.Tis is because corner prediction is similar to a detection task rather than a classifcation task, so all tokens, including spatial information about the image, must be used.
Figure 3 shows the positions in the license plate image of the four coordinates that the corner head should predict.Te eight values predicted by the model are x and y coordinate values between 0 and 1, which are normalized according to the image size.As shown in Figure 3(a), each number is associated with one of the four corner positions, such as topleft, top-right, bottom-left, and bottom-right.Tis matching relationship is equally applicable in the case of a motorbike plates, as shown in Figure 3(b).However, since it is not in a rectangular shape, the upper corner positions are determined so that rectifcation is better achieved when perspective transformation is performed.For example, the top corner positions are not located on the top-most part of the license plate but rather where the rectangular shape starts.
As summarized in Table 1, the proposed ViT model can have diferent confgurations by changing the combination of four parameters: patch size P sz , depth D attn , number of multi-heads H num , and embedding dimension, E dm .Regarding P sz , input 2D image I ∈ R H×W×C is spliced into a sequence of 2D patches I p ∈ R N×P sz ×P sz ×c , where (H, W) is the resolution of the original image, C is the number of channels, (P sz , P sz ) is the resolution of each image path, and N � HW/P 2 sz .D attn is the number of alternating layers of self-attention, vertically stacked inside the transformer encoder.H num is the number of heads in a multiheaded selfattention module.E dm is constant latent vector size by which image patches I p are mapped to E dm dimensions through a trainable linear projection.Te vector size is also called the token length.
Te values of E dm is constrained by H num because the embedded vector size E dm must be divisible by H num , satisfying the following equation: where k > 1, a positive integer.To keep computing and number of parameters evenly distributed over self-attention modules when changing H num , E dm is typically set to multiples of H num .
To understand the efect of these structural parameters on the sizes of the proposed ViT model, the parameters' correlations with the number of model parameters were analyzed for all of 600 models.As shown in Figure 4(a), E dm has the strongest positive correlation with number of model parameters, followed by D attn and H num , and P sz has the weakest correlation.Because we fatten image patches with size of P sz × P sz and map them to E dm dimensions through convolutional projection, the number of parameters directly related to P sz is not signifcant.On the contrary, the embedded tokens having the length of E dm are passed through all layers, the amount of related parameters proportionally increases as E dm grows.D attn represents the number of vertically stacked layers of self-attention, thus contributing to the increase of the model parameters.Te same explanation applies to H num .We also analyze the correlation between the four parameters.We observe that there exists positive correlation only between E dm and H num .Due to the structural characteristics of the ViT model, as we increase E dm , the possible value range of H num widens, as in (1), resulting in increased model size.Scientifc Programming performance of 600 ViT models corresponding to these combinations using the same data and training under the same conditions, and the results are described in detail in Section 4. Figure 5 shows the loss distribution of the validation dataset after training all 600 ViT models with the same training data for epoch � 50. Figure 5(a) shows the distribution of loss values according to patch size P sz .When it is the smallest patch size, with P sz � 13, the loss distribution range is the widest, and the loss values are generally larger than those of other patch sizes.When the patch size is small, it is difcult to achieve good performance because the spatial context of the image cannot be sufciently contained.Figure 5(b) shows the loss distribution according to the depth D attn .As D attn increases, the average loss also decreases.Tis is the same reason that performance generally increases as the number of layers included in the deep model increases.Figure 5(c) is the loss distribution according to the embedding dimension E dm .When it has the smallest dimension length with E dm � 4, the overall loss value is the largest and the distribution is the widest.Tis is because, similar to the reason presented in the case of P sz , the embedding tensor is too short to contain enough information.It should be noted that the loss value decreases as it increases up to E dm � 32, and then stagnates beyond that, which means that the performance improvement by increasing the dimension is limited.In Figure 5(d), it can be seen that the loss does not change signifcantly according to the number of heads H num , which means that H num does not directly afect the performance.

Performance Analysis and Ablation Study
Te results of Figure 5 helps understand the ViT model characteristics in the theoretical analysis aspects based on the discussion of Section 3. Unlike prior works using self- attention in computer vision, we do not introduce taskspecifc inductive biases into the proposed model architecture apart from the initial image patch extraction step.Instead, we interpret the ViT model as a general and scalable structure that can be confgured by adjusting the four key structural parameters.Such structural exploration by the combination of the parameters allows the ViT model to integrate critical regression information across the entire image even in the relatively initial stages, with the help of self-attention.Specifcally, we observe the most structurally profound impact on performance can be realized by D attn the depth, which most exploits the advantages of self-attention.Also, H num the number of multiheads with selfattention that implement attention distance is analogous to the receptive feld size in CNNs, resulting in linearly increasing accuracy as H num increases.
From the validation loss results, we selected one model with the best performance for each P sz , a total of 5 models.Table 2 shows the confguration parameter values and number of model parameters of these models.For easy identifcation of these models, they are named according to P sz .Te smallest size model is ViT-Patch-13, which has less than 1 million parameters.Te largest model is ViT-Patch-104, with 26.8 million parameters.All models were confgured to have a sufciently deep depth with D attn ≥ 64.
ResNet [19] and MobileNet [20,21], which show excellent results in various vision tasks, are used as the backbones of CNN-based models for the performance comparison with the proposed ViT model.Te reason that we choose ResNet for comparison is that it was typically in the original work [1] that proposed the vision transformer, ResNet was used as the baseline CNNs to compare the feature extraction capability of backbones.We consider MobileNet for a similar reason.In the original work, a variant of EfcientNet [22] was selected as a comparison example of state-of-the-art CNN.Since MobileNet shares similar structural characteristics with EfcientNet but with relatively small sizes, we consider it a proper example to compare the proposed model in the similar experimental setup of the original work.In addition, the size diferences between MobileNet and the proposed model provide another comparison metric.
ResNet shows high performance in most vision-related tasks by using residual connections and can make models of various sizes by adjusting the depth.MobileNet is a small  Scientifc Programming size model that is the basis of EfcientNet, which is often used as the backbone for vision tasks and shows good performance thanks to an inverted residual connection with a linear bottleneck.For performance measurement, CNNbased corner prediction models using ResNet 18, 34, 50, 101, 152, and MobileNet V2, V3-Small, and V3-Large as backbones were constructed.Features extracted from these backbones are given as inputs to the same corner head used in the ViT model.Tat is, the feature contexts extracted from the backbone are converted into 8 values normalized between 0 and 1 through the corner head.
Figure 6 shows the number of parameters of these CNNbased models and the proposed ViT-based models, and these values are directly related to the size of the model.Te size of ResNet-based models is relatively large, and MobileNet is generally small, whereas ViT models have various sizes depending on the confguration.Tese models were trained for 50 epochs with the same data, and in this process, an Adam optimizer set with learning rate � 1e − 4 and weight decay � 5e − 5 was used.For the trained models, tests were performed on the images of the Philippines license plate and the Korean license plate, respectively.

6
Scientifc Programming Figure 7shows the average error distance (in pixels) between the predicted corner location and the ground truth for each model for validation data.From the top row of the fgure, it corresponds to the results of ViT, ResNet, and MobileNet; on the left is the result for Philippines license plates, and on the right is the result for Korean license plates.And each mean error was displayed as top-left, top-right, bottom-left, and bottom-right according to the corner position.
In Figure 7(a), ViT-Patch-13, the smallest size among ViT models, has a larger error of at least 3∼max.8 pixels than other ViT models.It is noteworthy that, in the case of the second-smaller ViT-Patch-26 model, the error was superior to less than 5 pixels when compared with other larger-sized models.Figure 7(b), the test result for the Korean license plate that is not used for training, has a generally larger error than Figure 7(a), which shows the result for the Philippines license plate, but the error trends for each model are similar.It should be noted here that the largest model, ViT-Patch-104, had a larger error than other models in bottom-right for Korean license plates.Since this model showed adequate performance in the case of the Philippines license plate bottom-right, it is presumed that generalization failed due to model overft for the Philippines license plate.
In Figure 7(c), ResNet shows an error of about 8 to 12 pixels in most of the 5 models for the Philippines license plate.For Korean license plates, the overall error increased compared to the case of the Philippines, but the larger model showed better performance, and the error was the smallest when compared with other ViT and MobileNet-based models.Figures 7(e) and 7(f ) are the results of MobileNet, which show intermediate performance lower than the accuracy of the ResNet-based model and higher than that of the ViT-based model.In the case of the Philippines license plate, the average error is about 10 pixels, and in the case of the Korean license plate, the error distribution is diferent for each location.For example, lower corner errors are greater 8 Scientifc Programming than errors in upper positions.What is unusual is that, unlike in ViT and ResNet, where performance is proportional to model size, the error of MobileNetV3-large with the largest model size is lower or similar to that of the smaller MobileNet-based models.Tese results are evidence for two facts, one is that the feature representation ability of CNN is still efective in the corner prediction task.Another is that it is difcult to achieve excellent performance by simply replacing the CNN block with ViT.As in our experiment, when the backbone is replaced from CNN to ViT and undergoes the same training process, ViT shows inferior performance to CNN, suggesting that the use of ViT is more difcult than that of CNN.However, it is noteworthy that the small-sized ViT-Patch-26 model with a number of parameters of 1.9 million showed smaller errors for Korean license plates than the CNN-based models.Tis means that ViT has potential in terms of generalization considering the model size.

Conclusions
As the research results of ViT surpassing existing CNNs in various vision tasks increase, expectations are also growing.Te motivation of this study was the question of whether such a dominance is still possible in the license plate corner prediction task.We developed a corner prediction model using ViT as a backbone and compared the performance with existing models using ResNet and MobileNet.As a result, ViT was notable in terms of performance considering size, but ResNet was dominant in absolute performance.Tese results were obtained through the performance analysis of 600 ViT backbone models created through the combination of four ViT structural determinants.
Tese results show that performance improvement cannot be achieved by simply replacing the backbone from CNN to ViT.In conclusion, ViT is a more difcult model to handle than CNN; in that, it has to handle the input image more carefully, and the training process is more complicated.

Figure 3 : 4 Figure 2 :
Figure 3: Corner positions in plates and the matching relationship from predicted values.(a) Four corner positions in a Philippines plate; (b) corresponding four corner positions in a Philippines motorbike plate.

Figure 4 :
Figure 4: Correlation analysis of the ViT model size with the four structural parameters (a) and the number of model parameters according to varying E dm embedding dimension and D attn depths (b).

Figure 5 :
Figure 5: Te validation losses of the ViT models according to diferent combinations of the four structural parameters.(a) Validation loss according to patch size, (b) loss according to depth, (c) loss depending on embedding dimension, and (d) loss in relation with number of heads of the proposed architecture.

Figure 7 :
Figure 7: Average prediction errors according to corner positions of the ViT, ResNet, and MobileNet models.(a) ViT error distances for Philippines plates.(b) ViT error distances for Korean plates.(c) ResNet error distances for Philippines plates.(d) ResNet error distances for Korean plates.(e) MobileNet error distances for Philippines plates.(f ) MobileNet error distances for Korean plates.

Table 1 :
Diferent confgurations of the proposed ViT model, given the input image resolution of 416 × 416 according to combinations of four parameters: patch size P sz , depth D attn , number of multiheads H num , and embedding dimension E dm .In our experiments, a total of 600 possible confgurations can be generated under confguration constraints.

Table 2 :
Te number of model parameters of selected ViT backbone models along with corresponding structural parameters.