RWYI: ReadingWhat YouAre Interested inwith a Learning-Based Text Interactive System

As computer vision and human-computer interaction technology mature, vision-based auxiliary text reading has become the mainstream method to optimize the learning and reading experience. Most of the existing auxiliary text reading methods use scene text recognition combined with human gesture recognition to complete the task in multiple stages. However, these methods cannot accurately and eectively extract the textual information that readers are interested in complex and varied reading scenarios. To improve the text reading experience, we propose a human-centered fast auxiliary text reading method. It utilizes a hand-text hybrid object detection (HTD) model to instantly locate text of interest to readers, a font-consistent prior text image superresolution network (FCSRN) to recover low-resolution text images to enhance the accuracy of text recognition, and a convolutional recurrent neural network (CRNN) text recognition operator to obtain the content of the text, that is, interesting to readers. To verify the eectiveness of the proposed method, we tested the performance of the text localization module on a homemade HTD dataset and the performance of the FCSRN on the public text image superresolution dataset called TextZoom. Quantitative experiments on the overall performance of the fast auxiliary reading system, called reading what you are interested in (RWYI), were designed. e experiments indicate that the proposed method can meet the needs of human-computer interactive auxiliary reading in text reading scenarios and optimize the reading experience.


Introduction
e popularization of multimedia vision sensors has led to the development of various human-centered computer vision technologies, which have been gradually integrated into and changed our lives. Vision-based human-computer interaction tasks are mostly used in text reading comprehension [1], gesture interaction [2], human action recognition [3,4], face detection [5,6], and other elds. However, unfamiliar or forgotten words make the reading and learning experience negative for both children and adults.
To optimize the reading experience, applications that assist readers in reading are slowly becoming available to the public. e early reading aids based on optical character encoding can only work on printed books containing two-dimensional optical encoding, which directly lead to the failure of its popularization [7]. With the maturity of computer vision and human-computer interaction scienti c research technology, vision-based auxiliary text reading has become the mainstream method to optimize the learning and reading experience. In general, vision-based auxiliary text reading consists of ve modules: scene image input, image preprocessing, text-of-interest localization, text-ofinterest recognition, and feedback output. To accurately locate the text of interest, usually, from the perspective of object detection, the nger distribution is rst detected, and then the text of interest is located. Multistep detection is utilized, and engineering techniques are added to achieve a compromised result. e image quality of the positioning area will also interfere with the e ect of content recognition in the scene text image [8].
is method consisting of multiple stages of text detection and object detection needs to design an accurate position alignment strategy to achieve accurate text region localization, but a fixed alignment strategy cannot fully meet the complex and changeable text reading scenarios. Because the auxiliary text reading task has strict requirements on the processing speed and accuracy and the text reading scene itself is considerably complex, and includes the processing of low-quality images and the detection of small objects; these problems are all important to the auxiliary text reading task. e research presents great challenges.
In this study, a fast auxiliary text reading method is proposed to improve the efficiency of visual auxiliary text reading tasks from two perspectives: the rapid localization of interesting text and the effective enhancement of text image quality. On the one hand, in the stage of locating the text of interest, a one-stage target detection algorithm [9] is used to directly locate the opponent-text hybrid object, and the traditional multimodal combination of hand key point or finger distribution detection and local search text is not utilized.
e proposed method significantly reduces the processing time. On the other hand, the low-quality text image obtained by locating the text first uses the text superresolution technique [10] to improve the quality of the image and then uses the obtained high-resolution image for text recognition [11]. Instead of using low-resolution images for text recognition directly, the accuracy of text recognition can be improved. e main contributions of this paper are as follows: (i) We propose a systematic method to quickly read an auxiliary text and can more quickly and accurately identify the text content of interest to readers. (ii) A new method for locating interesting text using hand-text hybrid object detection (HTD) can efficiently locate the text of interest to readers. (iii) An HTD dataset containing 12400 hand-text hybrid object images and annotations are used for training an HTD model. (iv) A new superresolution network architecture for text images is proposed to improve the quality of text images so as to improve the accuracy of text recognition. e remainder of the paper is organized as follows. In Section 2, we survey the recent works regarding vision-based human-computer interaction and auxiliary text reading tasks. Each component in the proposed system is described in Section 3. In Section 4, we report and discuss our experimental results that leads to the conclusions in Section 5.

2.1.
Vision-Based Human-Computer Interaction. Human-machine interaction (HMI [12]) or human-computer interaction (HCI) is the convergence of computer science, behavioral science, artificial intelligence, design, and other applied disciplines and involves the in-depth study of the scientific implications and practices of the interface between humans and computers.
ere are two lines of related research. At a superficial level, related research includes the research and design of new technologies to make computers more convenient tools for human life. At a deeper level, related research includes the study of intelligent technologies that use the natural interaction between humans and computers, thereby enabling computers to become more harmonious human partners. With the rapid development of fields such as artificial intelligence and deep learning, human-computer interaction technology has made great progress. Now, an increasing number of human-oriented human-computer interaction applications are appearing in our lives, which has promoted the formation of smart cities. Based on the design concept of HCI, it has become a research hotspot to allow machines to have perception capabilities such as vision [1] and hearing [13], to complete specific tasks. Table 1 shows the existing humancomputer interaction technologies. In particular, in the scenario of human-computer nonverbal communication, vision-based human-computer interaction tasks require the establishment of communication channels that infer intentions from human behaviors, including facial expressions, human poses, and gestures [2]. Notably, the current implementation of these vision-based human-computer interaction tasks usually follows the process of image preprocessing, detection, and recognition, but the details of specific tasks are also different, and they depend on the data sets produced under specific functions to varying degrees. In fact, in the field of human-computer interaction, only visual or auditory-based interaction methods cannot fully meet the needs of human beings to disseminate and obtain information. erefore, the multimodality of interactive information between humans and machines will be the trend of future research.

Auxiliary Text Reading.
e earliest machine-auxiliary reading method used a reading pen to select a predetermined part of a supporting publication so that the optical tip of the reading pen recognizes the two-dimensional code printed in the publication, and then, the matching voice package could be played through the body circuit [7]. e technical areas that deep learning-based auxiliary text reading may cover are shown in Table 2. With the proposal of earlier two-stage object detection methods [22][23][24] and the birth of feature extraction backbone networks, such as feature pyramid networks (FPNs) [37] and PAN [38], later one-stage object detection methods [9,[25][26][27][28] have also been sequentially proposed and have provided a variety of possibilities for the realization of the localization of text of interest to a reader in the auxiliary text reading task. Even so, the recognized problem of small object detection in object detection tasks has not been effectively solved. In addition, due to the particularity of text objects in scene text detection and recognition tasks, based on the object detection method, a series of scene text detection methods [29][30][31] and text recognition methods [11,[32][33][34][35][36] are proposed. Although these methods cannot directly solve the problem of efficiently and accurately extracting textual information of interest to readers, they provide the possibility for the auxiliary text reading task to achieve effective localization of the text of interest to the reader, which also enables the auxiliary text reading task to progress steadily. Lighten AI's vision-based artificial intelligence method realized fingerpoint reading. It can locate the text of interest by combining multiple object detection models and local search approaches and then uses the text recognition operator to identify the located text. Finally, the built-in voice package enables the model to read aloud and explain the meaning [8]. However, this is very demanding on the image acquisition equipment, as only this method can obtain extremely highquality scene images to achieve good results. It is easy to overlook that image quality improvements [14][15][16] can also have beneficial effects on vision-based assisted text reading tasks. In particular, image superresolution techniques [17][18][19][20][21] can restore low-resolution images to high-resolution images to enhance image data quality and improve the effect of downstream tasks. TSRN [10] used the first real-world text-image superresolution dataset, TextZoom, and a baseline for text-image superresolution, which enables the reconstruction of low-quality text images, was proposed. Obviously, the text-image superresolution method, as an intermediate task of text detection and recognition, should be lightweight. When building a text-image superresolution network, it is necessary to balance the image quality improvement effect and the actual resource overhead.

Methodology
From a human-centered point of view and to improve a reader's text reading experience, we propose a fast auxiliary text reading method, which aims to obtain the text content of interest to a reader from the images acquired by a visual sensor in the text reading scene. Figure 1 illustrates the overall framework. In the first step, we use a single-stage hybrid object detection method to locate the target text area pointed at by a finger of the input image and obtain the text image with a lower quality than the original input image. In the second step, we perform superresolution processing on the detected low-quality text image to obtain the enhanced high-resolution text image. In the third step, we use highquality text images for text recognition and use conventional text recognition operators to recognize the content of the text of interest pointed at by a finger. erefore, the recognized text content can be recited and interpreted using devices such as speech, translation and word-finding devices, and used as feedback, which also takes full advantage of the interaction of visual and auditory information between humans and machines to help readers learn and understand the current text vocabulary.

Hybrid Object Detection.
In the auxiliary reading task, we let the reader's behavior have a positive effect on the task; here, the reader is not just the receiver of the auxiliary reading task but can be thought of as the leader in the process of human-computer interaction. Using a priori knowledge that readers have a high probability of pointing at unfamiliar text with their fingers in auxiliary reading scenarios, we define the text pointed at by the reader's finger and the reader's finger as a hybrid object category, called "hand-text." e definition of this mixed class weakens the difficult problem of small object detection. We aim to make the machine learn the contextual feature information contained in the behavior of a finger pointing at the text in an image and not only consider the mixed features of the two types of objects of the fingertip and the text area but also consider a wider range of gesture features. Furthermore, we take the feature of this mixed object of "hand text" as the basis for locating the text region of interest (ROI) from the image. erefore, our proposed hybrid object detection method still belongs to the category of object classification and localization tasks but has a different starting point from traditional object detection methods that only focus on the

Vision-based HCI
Gesture interaction [2] Look at human actions and analyze human intentions. Human action recognition [3,4] Face detection [5,6] Hearing-based HCI Speech recognition [13] Listen to human language and analyze human intent Table 2: Technical fields that may be covered by the auxiliary text reading task.

Technical fields Features Category
Ref.

Image preprocessing
Enhance images, improve image quality, and help downstream tasks achieve good results Image warping [14] Image fusion [15,16] Image superresolution [10,[17][18][19][20][21] Object detection Locate target areas of text that are of interest to readers Two-stage object detection [22][23][24] One-stage object detection [9,[25][26][27][28] Scene text detection Text region localization for complex scene images in real life Text detection [29][30][31] Text recognition Get detected text content of interest Text recognition [11,[32][33][34][35][36] feature information of a single object category. e di erence between the proposed HTD method and the traditional localization methods for locating the text of interest is shown in Figure 2, and the proposed method does not have multiple stages of alignment tasks such as text detection and object detection, which simpli es the idea of locating textual information of interest to readers. Since the proposed method is applied to the object localization task, it is natural to follow the principle of object localization task implementation. We utilize a convolutional neural network to extract the complex features of the hybrid object "hand text" in the image and use the localization of this hybrid object as the output of the prediction head of the hybrid object detection task. e key point of this task is whether the machine can learn the shallow texture features, shape features, and deep context features of the hybrid object category "hand text" from images in a reading scene through a single-stage or end-toend object detection method to correctly generate positioning predictions. us far, we can see that our proposed fast auxiliary text reading method uses hybrid object detection to directly locate the text ROI instead of using multiple object detection steps to narrow the detection range to obtain text regions.

Hand-Text Hybrid Object Detection
Dataset. To better achieve the task of HTD in the fast auxiliary text reading method, we prepared a HTD dataset from a text reading scene. It contains nearly 4,000 "hand-text" objects that have been marked and are considered "background" when a nger is not pointing at the text or when the text is not being pointed at by a nger, and only instances when a nger is pointing to the exact text area are marked as "foreground." e ngers pointing to the text in these images are those of di erent readers, and the pointed text is obtained from books with di erent fonts and font sizes. e lighting and background of the reading scenes are varied to fully ensure the diversity of the dataset. e sizes of the images in the HTD dataset are not the same to ensure that the machine text model learns the "hand-text" features in the reading scenes.
Data augmentation is very common in object detection tasks. e ultimate purpose is to enable the object detection model to learn more generalized expression capabilities with more diverse data and to accurately classify and locate objects in more complex environments. To make our proposed HTD method has a better detection e ect, we use Gaussian perturbations, brightness changes, small-angle rotations, scaling, up and down   flips, and so on, to augment the HTD dataset to train the hybrid object detection network.
e image data augmentation method processes the original images and labels, resulting in an HTD dataset with four times the number of original images. Figure 3 shows some examples of HTD images.

Hybrid Object Detection
Network. YOLO is one of the most representative models for object detection tasks. After continuous iterative development in recent years, a good balance between detection accuracy and real-time performance has been achieved by the YOLO-V5 [9] version. We use the latest YOLO-V5 as the basic model for hybrid object detection to discriminate and locate fingerpointing actions in text reading scenarios. We utilize YOLO-V5 for the detection of "hand-text" hybrid object categories in text reading scene images. In essence, "handtext" hybrid object detection is an object localization task with a single predicted category in a special scene. erefore, YOLO-V5 can obtain the ROI of the text pointed at by a finger with relatively high accuracy and a relatively small number of iterations with reasonable computational complexity.
e YOLO-V5 model can predict both the object class probability and its bounding box in an end-to-end manner. erefore, we use many visual sensors in the text reading scene to obtain a video stream of a reader while reading, use the image frames in the video stream as the input of the hybrid object detection model, and use the YOLO-V5 model to locate "hand-text" hybrid objects and obtain the bounding box information. Since the YOLO-V5 model contains multiple artificially set anchor boxes, the nonmaximum suppression (NMS) strategy is essential. Based on the particularity of the "hand-text" hybrid object detection task, we can directly perform weighted NMS on the predicted bounding box without judging whether the class labels of the initially predicted bounding boxes are the same when implementing the weighted NMS strategy. Compared with traditional NMS, weighted NMS uses the process of bounding box culling, and those boxes whose intersection over union (IOU) is greater than the threshold and that are in the same category as the current bounding box are not directly culled but are based on the confidence of the network prediction and are weighted to obtain a new bounding box, which is used as the final predicted bounding box. en, those boxes that are not the most suitable are eliminated. e formulation of weighted NMS is as follows: where B i represents the initial prediction box generated by the model, s i is the prediction confidence of the "hand-text" hybrid object category of the predicted bounding box B i , M represents the bounding box to be calculated currently, M is the weighted bounding box, and thres is the artificially specified confidence threshold. rough the YOLO-V5 end-to-end object positioning model, we directly obtain the bounding box coordinates of the text area pointed at by a finger and naturally use the text area image as the output of the hybrid object detection task, and it is also the input of the next text-image superresolution processing task.

Text-Image Superresolution.
In a text reading scene, the image of the text area of interest obtained by the common visual sensor after HTD is often not of high quality, and the effect of text recognition will decline, which will make it difficult for the auxiliary text reading task to achieve the expected effect. We attempt to reconstruct text images using text-image superresolution processing techniques. Generally, text-image superresolution processing is usually an upstream task of image text recognition and is used to improve the resolution of low-resolution text images and restore text-image details. We try to use text-image superresolution processing technology to reconstruct text images to prevent using low-quality images for text recognition so as to improve the accuracy of text recognition.
We exploit the prior that the text in the text ROI to the reader maintains the same font style in auxiliary text reading scenarios and embed this prior knowledge on the basis of Mobile Information Systems previous text-image superresolution methods. We try to let the machine help restore low-quality images to high-quality images by learning the font style consistency of text images.
erefore, based on TSRN [10], we introduce a feed-forward convolution encoder (FCE) for extracting font-consistency prior information of text images, which is extracted into font-consistency prior information and embedded into multiple di erent text superresolution modules to improve the ability of the text-image superresolution network to construct high-quality text images. Figure 4 shows our proposed font-consistency priors for the text-image superresolution network (FCSRN) architecture.
Speci cally, we take a low-resolution image I LR ∈ R h×w×3 as input and rst use the spatial transformation network (STN) [39] to align low-resolution input images of di erent sizes. is process can be expressed as follows: where STN(·) represents the dimension alignment and I LR represents the aligned low-resolution image. We perform two-way branch processing on the aligned low-resolution text images. On the one hand, the FCE branch is used to extract font-consistency prior features to obtain F f , and on the other hand, the 9 × 9 convolution operation is used to extract shallow image features to obtain F c . en, the extracted font-consistency prior features are embedded into the image features at di erent network depths extracted by the backbone through a concise and e cient method of concatenating on the channel, and they are used as the input of the sequential residual block (SRB). ese processes can be formulated as follows: where FSRB i is the i th text-image superresolution module embedded with font-consistency priors, F i sr is the text-image superresolution feature obtained by the i th FSRB, and N is the number of FSRB stacks. It is worth mentioning that we implement FCE using a stack of CBL modules consisting of convolution, batch normalization, and Leaky-ReLU activation functions. To ensure the role of the deeper network, we follow the residual network structure between the stacked N 5 SRBs. Finally, we use the pixel shu e module to increase the resolution by a factor of 2 and use the convolution operation again to output the nal superresolution text image.
For the loss calculation part of model training, we follow the method in the baseline task. Generally, the quality of text-image superresolution can be measured not only by image quality metrics but also by the downstream task of text recognition. We evaluate the performance of text-image superresolution methods based on font-consistency priors in Section 4.2.

Text Recognition.
In the context of auxiliary text reading, the text images that we use to identify the text of interest to readers have the characteristics of regular text and large image ratios. erefore, the text recognition operator, we use needs to have a good balance between the recognition accuracy of regular text and the computational complexity. In addition, for our proposed auxiliary text reading method to satisfy di erent audiences, we need a text recognition operator that can recognize multiple languages.
After weighing various factors, we choose to use a convolutional recurrent neural network (CRNN) [11], which is an early deep learning method for regular text recognition. e network architecture of CRNN is shown in Figure 5, which is composed of the convolution layer, recurrent layer, and transcription layer. In the head of CRNN, the convolution layer automatically extracts a feature sequence from each input image. In the neck, a recursive network is established, which is output by the convolution layer to predict each frame of the feature sequence. e transcription layer is at the tail of the CRNN, which converts the prediction of each frame of the cyclic layer into a tag sequence, and maps out the blank and redundant parts in the sequence. Although CRNN is composed of three parts: head, neck, and tail, it can still use  the same loss function for joint training. CRNN text recognition algorithm introduces a bidirectional long shortterm memory (Bi-LSTM) network to enhance context modeling, inputs the output feature sequence to CTC module, and directly decodes the sequence results. is algorithm is widely used in scene text image recognition tasks.

e Overall Process of the System RWYI.
e proposed fast auxiliary text reading system RWYI can achieve accurate localization of the text of interest to readers and improve the accuracy of text recognition by enhancing the quality of text images. A summary of the proposed method is shown in Algorithm 1. Among them, HTD is the hand-text hybrid object detection model, FCSRN is a text-image superresolution model with font-consistency prior, and CRNN is the proposed text recognition operator. e input I S is an image obtained by the visual sensor in the reading scene, and the output C I is the content of the text that the reader is interested in. In the rst step, we use the single-stage hybrid object detection method HTD to locate the target text area pointed at by the nger on the input image I S , and the best prediction frame O * ht obtained by weighted NMS screening is cropped the text image I * ht with lower quality than the original input image. In the second step, we perform superresolution processing on the detected low-quality text images after alignment processing, and obtain an enhanced high-resolution text image I SR . In the third step, we use highquality text images for text recognition and use the CRNN text recognition operator to recognize the content C I of the text of interest pointed at by the nger.

Experimental Results
e proposed method uses HTD to directly locate the text of interest to readers and embeds font-consistent priors for text-image superresolution recovery of text images. To verify whether these two modules can play a role in the auxiliary text reading scenario, we design experiments to verify the e ectiveness of the two modules in the proposed method. In addition, a comparative experiment is conducted between our proposed fast auxiliary text reading method and the traditional multistage combination method to comprehensively consider the performance of our proposed method. All experiments were performed on a machine with an Ubuntu 18.04 operating system, two NVIDIA RTX3090 24 GB GPUs, 128G DDR4 RAM, and two Intel Xeon Gold 6148 processors, and the system was developed using the Python programming language.

Evaluation of Text Localization Performance.
e purpose of the HTD task is to locate the text of interest pointed at by a reader's hand in an image. As the primary task of auxiliary text reading, the accuracy and speed of its localization should be considered. e traditional localization methods using the combination of multiple object detection tasks can be roughly divided into three types.

Method 1.
e global text and hand key points are detected rst, and the coordinates of speci c key points are matched with the text box to locate the text of interest. Method 2. e key points of the hand are detected rst, and the text of interest is located in the preset frame at the speci c key point. Method 3. First, the nger distribution is detected, and then the text closest to the nger is detected as the text of interest according to the nger distribution.
It should be noted that the detection algorithm of nger distribution and hand key points in the traditional multimodel combination method is the same as that in the proposed target detection algorithm, and the text detection algorithm uses EAST [29]. Using the above three traditional methods and the proposed HTD method, the test is carried out on a homemade private HTD dataset, and the obtained detection accuracy (mAP) and the number of images processed per second (FPS) are taken as performance evaluation metrics for text localization models of interest. Figure 6 shows the mAP values and speeds (FPS) of four methods for locating the text of interest pointed at by a reader's nger; the HTD private dataset was used for this task.
e experimental results show that the hand-text hybrid target detection method proposed by us is better than the traditional method in detection accuracy, especially in the index of localization speed. Our method is much faster than the multiple detection model combination localization method, because our method only needs to detect the image once, and there are no engineering strategies such as target position alignment and xed area mask. In terms of recognition accuracy, HTD adds nger features for learning, which makes the target location of interested text more accurate.

Convolutional Layers Recurrent Layers Transcription Layers
Predicted Sequence cause c a aus s e Bi-LSTM Feature Sequence Convolutional Feature Maps Input Image     8 Mobile Information Systems To more intuitively demonstrate the advantages of HTD and traditional localization methods on the HTD dataset, we visualized some examples of di erent interesting text localization methods on the HTD test set. As shown in Figure 7, the rst column is the text that the actual reader is interested in; the second, third, and fourth columns are the positioning results of traditional methods 1, 2, and 3; and the fth column is the proposed hand-text hybrid target detection and positioning result of the text. From the visualization results, it can be seen that the proposed hand-text hybrid target detection module can more accurately locate the text of interest to readers.
We also generalize di erent object detection networks, including YOLO-V4 [27], YOLO-V3 [25], and a single shot detector (SSD) [26], for HTD. Similarly, we evaluate the detection accuracy of these popular object detection networks on the HTD dataset. Table 3 shows the experimental results of our selected YOLO-V5 [9] model and other networks. Although this is not the focus of our experiments, it can be seen from the experimental data that the mAP value of the YOLO-V5 model we selected is higher than those of the other models because of its unique data enhancement method and novel feature extraction network.
Based on the above discussion, it is con rmed that the proposed method based on HTD is e ective for locating the text of interest pointed at by a reader's nger in the auxiliary text reading scene and can quickly locate the text of interest from a reading scene image.

Evaluation of Text-Image Superresolution Performance.
TextZoom is the rst dataset to focus on real text-image superresolution and contains a total of 21,740 low-resolution-high-resolution text-image pairs and the text content as the text recognition label for each sample. We conduct a performance evaluation of FCSRN on TextZoom. On the one hand, the purpose of FCSRN is to improve the quality of text images and improve the accuracy of text recognition for downstream tasks. erefore, we use the most general CRNN [11] text recognition operator to compare di erent text-image superresolution methods with our method. On the other hand, FCSRN is still intended for the image superresolution task, so no matter what the application scenario is, the use of the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the quality of superresolution images is necessary. We compare this model with other existing image superresolution algorithms and present the experimental results of a series of di erent image superresolution algorithms, including SCRNN [17], VDSR [18], SRResNet [19], RRDB [20], RDN [21], and TSRN [10], on the TextZoom dataset. e results are shown in Table 4. From the experimental data, it can be concluded that our method has achieved advanced performance on the test sets with di erent di culty levels from TextZoom using the CRNN [11] text recognition accuracy for performance evaluation. According to the distribution of the di erent di culty subsets of the TextZoom test set, the average text recognition accuracy is calculated, and the optimal performance is still achieved. Compared with other image superresolution methods, the image quality is also improved.
is may be because the font-consistency prior can e ectively promote the machine to extract text features from the same font category images. At the same time, it also shows that the minimalist prior fusion method can completely Table 3: Comparison of the detection accuracy of di erent object detection networks on the HTD dataset, and the models involved in the comparison include YOLO-V4 [27], YOLO-V3 [25], SSD [26], and YOLO-V5 [9].  transfer the feature information required for the superresolution of text images. Similarly, Figure 8 shows some test results of our proposed FCSRN model on TextZoom [10], where the first row is a low-resolution image, the second row is our networkobtained superresolution map, and the third row is the highresolution map. e experiments confirm from many aspects that our proposed font-consistency prior text-image superresolution network can improve the quality of text images in the context of auxiliary text reading.

Evaluation of the Overall Performance of RWYI.
For practicality, in addition to evaluating the performance of individual modules, we should also focus on the overall performance of the system approach. However, for the overall performance evaluation experiment of our proposed fast auxiliary text reading method, we only care about the recognition efficiency of the text that readers are interested in from the reading scene. In response to this problem, we designed a corresponding experiment to explore the overall performance of RWYI.
We additionally make text content labels for the test set in the HTD dataset so that the system can recognize the text of interest pointed at by a reader's finger and use the recognition accuracy (Acc) as one of the system performance evaluation indicators. In addition, using a Logitech C1000 E multiresolution webcam, we adjust the lens resolution to obtain different quality image frames of different reading scenes and input them into the auxiliary text reading system we built. e computing system processes each image, and the time spent (time) is used as another indicator for system performance evaluation. We divide our proposed RWYI into three main modules: text localization, text superresolution, and text recognition. Under the premise of keeping the text recognition operator unchanged, three other auxiliary text reading methods are constructed to evaluate our proposed method.
e experimental results are shown in Table 5. Among them, Traditional Method 3 is the third method with better performance among the three traditional multiple object detection model combined text localization methods; that is, to locate the text closest to the finger in a fixed area near the finger, HOD represents the HTD method for localizing text, and FCSRN is a text-image superresolution network embedded with the font-consistency prior.
From the abovementioned qualitative experimental results, it can be concluded that the accuracy rate of RWYI in identifying the text of interest is better than that of the other methods constructed. Although the auxiliary text reading system that does not use FCSRN to restore text images will speed up the processing of each image, the recognition accuracy is 13% lower than that of the proposed RWYI method, which also shows that FCSRN can promote the task of text recognition. Balancing the relationship between image processing speed and text recognition accuracy, the proposed RWYI method should be the  optimal aided text reading method. e efficiency of each method in processing images of different resolutions follows the abovementioned conclusions. is can also indirectly verify that the hybrid object detection approach in the proposed method can improve the speed of the auxiliary text reading task, and the text superresolution processing module can improve the accuracy of the auxiliary text reading task.

Conclusion
In this paper, we propose a fast auxiliary text reading method that improves a reader's text reading experience from a human-centered perspective. e method consists of three main tasks: first, the proposed hand-text hybrid object detection (HTD) model is used to quickly locate the text of interest to a reader in the input reading scene image, and then the proposed text-image superresolution model embedded with font-consistency priors are used to restore the low-resolution text image of the location to a high-resolution image to significantly improve the text image quality. Finally, the regular text recognition CRNN algorithm is used to identify and obtain the content of the text that a reader is interested in. To demonstrate the effectiveness of the proposed method, three quantitative comparing experiments were designed. e experimental results show that in the task of text location that readers are interested in, the proposed HTD is better than the existing multiple detection model combination method, and can locate the target region more quickly and accurately. In the text-image superresolution task, the proposed FCSRN can significantly improve the quality of the text images and promote text recognition compared with other image superresolution methods. In summary, we adopt the idea of directly locating the text region of interest to the reader and improving the pixel quality of the text image and propose an efficient and accurate auxiliary text reading method. However, the interesting text location used in our RWYI does not make full use of the regular arrangement of text in reading books but simply locates the interested text region in the image. In the future, we hope to take the document image layout as an auxiliary condition of the hand-text hybrid object model, so as to realize the handwritten interaction in a richer reading scene.