A Novel Model for Intelligent Pull-Ups Test Based on Key Point Estimation of Human Body and Equipment

,


Introduction
Classical physical tests such as pull-ups test in university or middle and primary school are routine examinations in physical teaching.However, the current examination is usually conducted manually.Te manual test always leads to low efciency, inconsistent standards, and subjectivity.More recently, China and other countries pay increasing attention to ensure the fairness and objectivity of the sports tests.Te vision-based AI technologies provide an efcient way to enhance the test efciency and fairness.In addition, it will also reduce the work burden of physical education teachers.
With the development of next generation of information technology such as the Internet of Tings (IoT), cloud computing, wearable devices, big data, and machine learning, "intelligent sports" has become a hot area attracting much attention of domain expert in the information and sports feld [1,2].Automatic human pose estimation (HPE) is a common and critical task in an intelligent physical ftness test.Te common practice is to analyze images or videos of the examinee's actions online or ofine by employing computer vision and machine learning techniques.
More recently, deep learning is widely applied to many felds, e.g., predicting malfunctions of sensor and machinery [3,4], or detecting intracranial aneurysms [5], owing to its powerful self-learning ability and adaptability of visual processing tasks.Te deep learning networks are also introduced in the feld of HPE [6].Most of the HPE algorithms identify human body joints based on the capture of key point graph of the human body.We adopted this strategy to realize intelligent physical testing function in our frst application software, i.e., the video stream captured by a camera was converted into a stream of a key point graph based on the traditional HPE algorithm.Te size of a key point graph was normalized according to the distance from the nose to the hips.However, it was found that not all the captured key points were useful.So, several sets of key points (including wrists and shoulders, elbows and buttocks, left and right ankle, and left and right knee) were designated to reduce the data redundancy and computational cost.However, there still exists limitations of the traditional HPE algorithms when applying them to practical physical testing.For example, in the test of pull-ups or sit-ups that require equipment assistance, the negligence of key points of the auxiliary equipment may lead to misjudgment or cheating actions (If an examinee just stands on the ground and imitates the test actions, it is hard to discern the cheating actions in terms of vision-based technology).
Terefore, comprehensive utilization of key points of both human body and equipment is a promising strategy to improve the performance of HPE.Inspired by this idea, we design a lightweight deep learning network and a mobile application to estimate and analyze human poses in a pullups test.Te main contributions are as follows: (1) We establish a benchmark dataset containing more than 2,000 images for pull-ups test and develop a semiautomatic annotating software for labeling key points of human body and equipment.(2) A novel deep learning network named PEPoseNet is designed to jointly estimate the key points of human body and equipment.Te network adopted depthwise separable convolution (double encoder-decoder) to improve estimation accuracy and speeded up the training by pretraining and freezing the gradient backpropagation of the heatmap branch.(3) An AQA algorithm for key point estimation of human body and equipment generated by PEPoseNet is designed, and an intelligent pull-ups test mobile application is developed.Te application can realize real-time assessment of fve states, i.e., ready or end, hang, pull, achieved and resume in one pull-ups test cycle, and rate each cycle and total movement.
Te remainder of the paper is organized as follows.Te related works of HPE in sports are reviewed in Section 2. Te proposed model, i.e., PEPoseNet and related dataset are described in Section 3. Section 4 presents the experiments and results, as well as comparisons of the state of the art methods.Te concluding remarks are drawn in Section 5.

The Related Work
Human pose estimation (HPE) refers to determine or judge human body posture by processing and analyzing images or videos.Currently, HPE has wide applications in many felds such as virtual reality (VR) [7], human health [8,9], motion capture systems [10], and human computer interaction (HCI) [11].Originally, the traditional machine learning methods were employed to estimate human posture.For example, Eichner et al. applied conditional random felds (CRF) to learn potential relationships between appearance of diferent body parts and annotated images [12].Shakhnarovich et al. utilized a parameter-sensitive hashing function to estimate the joints of human body.However, since the frst popular CNN model, i.e., AlexNet emerged, the deep learning methods displayed abrupt developments in HPE, owing to their powerful self-learning ability and remarkable performance [13,14].For instance, Newell et al. proposed the classical stacked hourglass networks architecture which provided inspiration for many subsequent works [15].Cao et al. proposed a convolutional pose machine to fnd the position of each joint and adopted a part afnity feld to assemble the joints [16].Bazarevsky et al. proposed BlazePose which estimates human pose by means of a set of code with high FPS [17].
Based on the research of HPE, some studies have begun to pay attention to the model of action quality assessment (AQA) [18].Te AQA task aims to design a system that can automatically and objectively evaluate some specifc human actions through video or images.AQA is currently being developed in many practical application scenarios, such as surgical skill rating, medical rehabilitation test, athlete posture correction, coaching system, operation compliance analysis, and dangerous action monitoring.Te evaluation module can be classifed into three types, i.e., regression scoring, grading, and pairwise sorting.In this study, following the human pose estimate, we adopt grading to assess action quality of the fve states in one pull-ups test cycle.
In sports feld, many actions are often diferent from daily movements.It is usually difcult to track complicated movements and high speed actions, just like the explosive actions in fencing and challenging body postures in yoga.In addition, the occlusion and interference of sport equipment also raise difculty in localizing the targets.Many eforts have been made to improve the performance of HPE and AQA in sports.Zecha et al. proposed a method of posture correction for underwater training [19].Neher et al. improved a stacked hourglass network to predict the attitude of hockey players and stick at the same time [20].Trejo and Yuan developed an interactive system which adopted Adaboost to perceive several postures of learning yoga and provided users with the function of posture correction [21].Promrit and Waijanya proposed video posture embedding by adopting the triplet-loss technique and applied one-shot learning to detect a badminton player's posture [22].Suda et al. presented a method that predicts the ball trajectory of a volleyball toss 0.3 s before the actual toss by observing the motion of setter player [23].Xu et al. proposed self-attentive LSTM and multiscale convolutional skip LSTM to predict total element score (TES) and total program component score (PCS) in fgure skating [24].Xiang et al. [25] divided the diving process into four stages: beginning, jumping, dropping, and entering into the water and adopted four independent P3D models [26] to complete feature extraction.
For the pull-ups test, the horizontal bar is needed just like the equipment to fx one's feet in sit-ups.However, the infuence of auxiliary equipment is often ignored in posture 2 Mobile Information Systems estimation.Terefore, detecting and localizing the key points of equipment may provide complementary information for HPE and AQA.In this work, we train a deep learning network by feeding the key points of human body and equipment.Ten, a grading assessment of action quality is carried out using a random forest classifer.Te welltrained network is ported to embedded platforms for confrming its practicality.Te intelligent pull-ups test can be carried out with satisfactory performance.

The Proposed Method
Te workfow of realizing intelligent pull-ups test based on PEPoseNet is shown in Figure 1.Tere are three modules that are marked using dotted boxes, i.e., dataset module, PEPoseNet module, and assessment module.Te dataset module completes data collection and labeling.Te PEPo-seNet module trains and tests samples with a lightweight network architecture.Te assessment module is responsible for actions quality assessment of pull-ups test.

Data Collection and Annotation.
In the feld of sport and physical exercise (SPE), there exist several popular data sets, e.g., Leeds Sports Pose (LSP) [27], Frames Labeled In Cinema (FLIC) [28], and Penn Action [29].However, to the best of our knowledge, there are no available public pull-ups datasets till now.Terefore, we need to establish a selfproduced pull-ups dataset for research.Te self-produced pull-ups dataset is named SDUST-PUT, which includes 263 images extracted from online videos and 1,737 images taken from volunteers.Te images should contain a subject and horizontal bar at the same time.Diferent postures were considered, e.g., standing under the horizontal bar, preparing to jump, and various stages of doing pull-ups.Figure 2 demonstrates four example images of diferent states in SDUST-PUT.Figures 2(a) and 2(b) are taken from volunteers and represent ready or end and hang, respectively, while Figures 2(c) and 2(d) are extracted from online videos and represent pull or resume and achieved, respectively.To annotate the images efciently, we develop a piece of software running on a Windows or Mac system based on the futter framework [30].Te annotator can label human joints and key points of equipment in a semiautomatic style.At frst, the OpenPifPaf algorithm [31] is employed to label human joints automatically.Ten, the users only need to annotate a small number of key points of the equipment and correct a smaller number of inaccurate points annotated by OpenPifPaf.Te efciency of data annotating enhances greatly using the developed software.Te annotating results are saved as .jsonformat.In order to avoid inaccurate annotation for compressed images, the annotator records the distance ratio instead of direct distance.
Te distance ratio of the key point and the left border is recorded as the value of horizontal axis, and the distance ratio of the key point and the upper border is recorded as the value of vertical axis.Te annotator is publicly available for download at https://github.com/PEPoseNet/PEPoseNet.
Te annotator can also be used to label similar datasets related to human posture estimation.Figure 3(a) illustrates the running interface of the annotator, while Figures 3(b) and 3(c) illustrate two example labeling results.

Te Proposed
PEPoseNet.Figure 4 illustrates the overall architecture of the proposed PEPoseNet, along with the structure of diferent blocks.As shown in Figure 4(a), the backbone is inspired by Google's BlazePose [17].It consists of two hierarchical networks, i.e., the heatmap network and the key point network.Te backbone adopts three types of convolution layer structure, i.e., Block 1, Block 2, and Block 3 as illustrated in Figures 4(b)-4(d), respectively.Te three blocks are flled with diferent colors for easy discrimination.Te Block 1 combines simple depth-wise separable convolution and regular convolution as shown in Figure 4(b).Except for the similar depth-wise separable convolution and regular convolution layers, the Block 2 in Figure 4(c) adopts a Maxpool layer to hierarchically reduce image scale, whereas the Block 3 in Figure 4(d) adopts an upsampling layer to hierarchically increase image scale.Te design of these blocks is to facilitate them running on mobile platforms or embedded devices.

Te Heatmap Network and the Key Point Network.
Te basic structure of the heatmap network is illustrated at right side in Figure 4(a), which is similar to the stacked hourglass networks proposed by Newell et al. [15].Te encoder receives original image with size of 512 × 512.A series of depth-wise separable convolutions followed by the maximum pool layer are carried out in sequence.In the implementation procedure, the number of channels increases step by step to extract potential information with diferent scales.Ten, the encoder output 8 × 8 × 288 heatmaps.In the connection of encoder and decoder, the residual structure [35] is adopted to reduce information loss.Te decoder adopts multilayer upsampling operation to continuously increase the size of heatmap.Te depth-wise separable convolution is utilized to further decode information at diferent scales.
In the training procedure, how to capture the key points of the original images efectively is an important issue.Here, we employ a 2D Gaussian kernel in the loss function of the heatmap network to extract rough center of key points as close as possible.Te loss function of the heatmap network is as follows: where P and E represent the predicted heatmap of human joints and key points of equipment, respectively, P * and E * represent the corresponding ground truths.M p and M E represent the mask of human joints and equipment key points, respectively, which are utilized to assign diferent weights to positive and negative samples.Tat is, the weight of positive samples is 1, while negative samples is 0.1.Te symbol ⊙ denotes element-wise product.C p and C E Mobile Information Systems represent the training weight of human joints and equipment joints, respectively.As shown in Figure 5, the center of each heatmap is searched in terms of the label data (coordinates of the key point), and a heatmap is formed by setting the nearest pixel to be closer to 1, while the pixels away from the center are set to 0. In such a way, each original image produced 15 heatmaps with size of 128 × 128 that are corresponding to 15 key points.
Te heatmap network output abundant information of key points that will feed to the key point network for accurate localizing key points.From Figure 4(a), we can see that the front layers of the heatmap network are connected to the decoder of a key point network.Two specifc modifcations are made in the construction of the key point network.First, we exploit intermediate data of the heatmap decoder instead of the fnal output.It is observed that a large amount of efective information exists in the intermediate data, rather than in the fnal convolution layer.Second, only forward propagations are retained in the training procedure of the key point network (denoted as dotted arrows in Figure 4(a)), the gradient backpropagations between the key point decoding and the output heatmaps are frozen.Tis modifcation can efectively avoid afecting the generation of heatmap in the training of a key point decoder.Te loss function of the key point network is divided into two parts, i.e., classifcation loss and regression loss.It is defned as follows: where l HMN represents the classifcation loss which is same as the loss function of the heatmap network.l REG represents the regression loss function which indicates the position diference between the predicted point and the label point.λ is a constant for balancing the two kinds of losses.It is set to 0.05 in this work.Te regression loss function is defned as follows: where O and O * represent the predicted position and the corresponding label of key points, respectively.G is a set of label points type.Z represents the area of human body, which can be estimated as follows: where Δx and Δy are the maximum distance between the horizontal and vertical coordinates in the real label, respectively.Smooth L1 is a thresholding function defned as follows: Figure 6 illustrates a demo result of the heatmap network and the key point network.In Figure 6(a), the original image and the 15 predicted heatmaps are presented.It can be seen that the centers of the heatmaps are very close to the real positions of human body joints and key points of horizontal bar.In Figure 6(b), the refned key points are output by the key point network.It is obvious that the two key points of horizontal bar can provide a good positioning reference for pull-ups test.Mobile Information Systems

Te Pretrained Network.
In the training procedure of the deep learning network, overftting will occur if the amount of data is small.Tere are two solutions to overcome this problem, i.e., data augmentation and transfer learning based on the pretrained network.In this work, we are prone to adopt pretrained scheme as the LSP [27] and FLIC [28] dataset and are suitable for pretraining the network.Since the LSP and FLIC dataset did not require containing a horizontal bar, the images with consistent labels of SDUST-PUT are screened out for pretraining the deep network.In the pretraining procedure, we frst train the heatmap network.Ten, the key point network is trained by fxing the parameters of the heatmap network.Te pretraining is satisfactory because percentage of correct key-points (PCK@ 0.2) can achieve 85.1 after 200 epochs.Terefore, the parameters of pretraining on the LSP and FLIC dataset are adopted as initialization parameters when training the  6 Mobile Information Systems pose estimation.However, it still has no reports about automatic action quality assessment for pull-ups test.In this study, we present a complete process for intelligent evaluation scheme of pull-ups test.First, we divide the movements in one pull-ups cycle into fve states, i.e., ready or end, hang, pull, achieved, and resume, as listed in Table 1.Te division is presented by experienced teachers who have been occupied in physical education and pullups test for more than 20 years.
Figure 7 illustrates the fve states and the sequence relations.It can be seen that the ready or end state represents the start and stop of a set of pull-ups.Te hang state refers to the body is hanging from the horizontal bar (arms fully extend is required).Te pull state refers to lift one's body with his or her arms.Te attained state refers to keep the head above the horizontal bar.Te resume state refers to relax one's arms and return to the hang state.In one pull-ups cycle, it is required that no obvious bending and swinging of the body or legs.

Mobile Information Systems
Second, we design a grading assessment solution for each state in one pull-ups cycle.As show in Figure 8, the standard action and nonstandard action of each state are illustrated.In order to make automatic grading, we adopt a random forest classifer in the assessment module.21 videos (containing 8718 frames) are collected from volunteers for training the classifer.In order to ensure the robustness of action evaluation, the distances and angles of all key points in n − 4th, n − 2th, nth, n + 2th, and n + 4th frames are considered, as shown in Figure 9. Te angle refers to the one between the horizontal direction and line connected by the two key points.Te other four frames are selected to obtain more obvious features in time dimensionality.In addition, several angles between the lines with obvious changes were also selected as features.For each frame, there are 2,270 features that can be used for making the assessment.Te coordinate values of each set of the key points are divided by the distance between two hip joints to normalize the data.Te spatial-temporal features of the key points of human body and equipment, output from PEPoseNet, are fed to the classifer to obtain the state of the nth frame.
In practical application, the state streams coming out of the random forest are fltered by a mode flter.Ten, the software counts the number of pull-ups using the cycle of states, and grades each cycle using action evaluation.Figure 10 illustrates the automatic scoring scheme of practical pull-ups test.We assume a complete pull-ups test has N cycles, and each cycle has M frame.Ten, the total score of this pull-ups test can be calculated as follows: (1) Calculating the cumulative scores in each cycle.For each frame in one cycle, if the action is standard, the grading value 1 is assigned.Otherwise, the grading value is assigned to 0.5.Ten, the score of each cycle is obtained by cumulatively summing the score of each frame divided by the number of frame (M).( 2) Calculating the cumulative scores in one test.Tat is, the score of each cycle is summed directly to obtain the total score.

PCK@T � 􏽐
where i represents the number of the joint points, d i represents Euclidean distance between the ith predicted point and its ground truth, and d represents the normalization scale factor.In this work, we adopt the Euclidean distance between left shoulder and right hip.T denotes the threshold value (T � 0.2 in this work).
In order to verify the practicability of the model and software proposed in this study, 21 pull-ups videos collected from volunteers were utilized to test the PEPoseNet and the traditional HPE algorithms.Te key points from the PEPoseNet and the traditional HPE algorithms are extracted, respectively.Ten, action quality assessment is carried out by implementing the random forest classifer.
where the subscript s stands for a state or an action.TP s represents the number of correctly classifed s frame.TN s represents the number of correctly classifed non-s frame.FP s represents the number of wrongly classifed s frame, and FN s represents the number of wrongly classifed non-s frame.

Implementation Details.
In the training of the PEPo-seNet, TensorFlow 2.0 python library [36] was called.Te input color images were resized to 512 × 512 × 3. Te output was coordinates of 15 key points.Adam optimizer [37] was employed to speed up the training.Te learning rate was 0.001.Te initial weights adopted the results of the pretrained model training on the LSP and FLIC datasets.200 epochs were implemented on Tesla P100 16G Nvidia GPU.Considering the limited computing power of the embedded or mobile platform, we also tested the performance on AMD Ryzen 7 3700X CPU without GPU.
To the best of our knowledge, there are no related pullups test deep networks to make comparison.Tus, we compare with two latest OpenPifPaf [31] and MediaPipe [38] (optimized by BlazePose) as they also carry out human posture estimations.Te model parameters are provided by their ofcial webs.

Ablation Experiment Schemes.
We design four ablation experiments to evaluate the efects of key modules of the proposed method.Te PEPoseNet-A architecture is designed to evaluate the efect of the heatmap network, i.e., directly input the heatmap output to the key point network or input the intermediate layer information of the heatmap network instead.Te PEPoseNet-B architecture is designed to evaluate the stability of the output of the heatmap network.In the baseline architecture of PEPoseNet, the heatmap network is trained independently.Ten, the key point network is trained           4 and 5 list the performance of action quality assessment conducted by the PEPoseNet and the MediaPipe, respectively.Te four quantitative metrics of the PEPoseNet were obviously superior to that of the MediaPipe.It refects the efectiveness of introducing the information of the key points of the equipment.Te key points extracted from the horizontal bar are helpful to provide reference localization information that are crucial for determining the movement states more accurately and robustly.For example, it is hard to identify and distinguish the Ready or End state in terms of only the key points of human body.If we take relative position of the examinee and the horizontal bar into consideration, it is easy to make correct determination by judging whether his or her hands hold the bar.
After successful training and testing of the PEPoseNet, we transplant the model into Android and iOS mobile platforms.Te TFLite of TensorFlow and the crossplatform of Flutter are adopted.Te developed mobile App can perform intelligent pull-ups test with friend interface and efcient implementation.Figure 11 illustrates the App interface in practical pull-ups test.Te application is tested on more than 100 volunteer students.Te results indicate that the application is suitable for practical pull-ups test with satisfactory accuracy and robustness.It provides the function of grading assessment and count of the pull-ups that is benefcial to avoid the cheating actions or false scores.

Conclusions
In this work, we proposed a novel deep learning model named PEPoseNet for intelligent pull-ups test based on the key point estimation of human body and horizontal bar.A self-produced pull-ups dataset containing 2,000 color images collected from volunteers and Internet was established (SDUST-PUT).Te data were normalized and annotated semiautomatically.Te lightweight deep network adopted backbone containing the heatmap network and the key point network.Te depth-wise separable convolution was adopted to speed up the training and convergence.A grading assessment standard of 5 states in one pull-ups cycle was defned and implemented in the framework.A simple automatic grading score scheme was designed.A robust and friendly mobile application was developed for practical pullups test.Te validation, comparison, and ablation experiments were carried out to evaluate the proposed model and software.Te experimental results demonstrated that the proposed PEPoseNet and the mobile application can improve the efciency, practicability, and fairness of pull-ups test.In the following work, we will continue to expand the size of dataset, investigate more efcient schemes to speed up the deep network, and explore more elaborate scoring scheme.Furthermore, the extension of the network and software to other sport projects will be explored and realized.

Figure 2 :Figure 3 :Figure 4 :
Figure 2: Example of pull-ups images at diferent states in SDUST-PUT.(a) Ready or end and (b) hang are taken from volunteers; (c) pull/ resume; (d) achieved are extracted from online videos.

Figure 5 :Figure 4 :
Figure 5: Te mechanism of producing ground truth of the heatmaps by employing a 2D Gaussian kernel.

Figure 6 :
Figure 6: Demo heatmaps and key points extracted from one subject using the heatmap network and the key point network, respectively.(a) Te original image and 15 predicted heatmap nodes and (b) the extracted key points of human joints and horizontal bar.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: Illustration of the fve states in one pull-ups cycle along with the sequence order.
based on the trained heatmaps in the condition of freezing the channel of heatmaps.Te PEPoseNet-B architecture removes the freezing of the heatmap network and enables it to be adjusted in the training of the key point network.Te PEPoseNet-C architecture is designed to evaluate the role of pretraining.As the ground truth of OpenPifPaf or MediaPipe is not consistent with the fnal output of the PEPoseNet, it is uncertain to determine whether pretraining brings positive or negative efects.Terefore, the PEPoseNet-C architecture is trained only on the SDUST-PUT dataset without adopting pretraining.Te PEPoseNet-D architecture is designed to evaluate the efect of the two key points of the bar.Tat is, in the baseline of the PEPoseNet, the two key points of the bar are considered and attend to make decision in the following AQA algorithm, while the PEPoseNet-D architecture removes the two key points.Tus, it can be determined the role of the two key points of the bar by comparing the results of the PEPoseNet-D and baseline PEPoseNet.4.4.Experiment Results 4.4.1.Te Performance of the Baseline PEPoseNet.

Figure 10 :
Figure 10: Te automatic scoring scheme of practical pull-ups test.For each cycle, the grading value 1 or 0.5 is assigned to each frame in terms of whether the action is standard.Te score of standard action is 1 and that of nonstandard is 0.5.Te total score is obtained by cumulatively summing the cumulative scores of each cycle.
model into mobile or embedded devices.In addition, the PEPoseNet has the advantage of being able to locate the key points of equipment.Te PCK and FPS of the PEPoseNet-A decreased obviously compared to the baseline model.It indicated that a large amount of feature data lost in the heatmap.Direct usage of the heatmap is not conducive to the key points.Te PCK reduction of the PEPoseNet-B indicates that the scheme of frozen back-propagation routes is effective.Te heatmap network is freed from the interference of the key point network.Te results of PEPoseNet-C demonstrated that the efectiveness of a pretraining model based on the common HPE datasets.Tables

Table 1 :
Te fve states in one pull-ups cycle and their assessment criteria.Stand below the horizontal bar Hang Body hanging from the horizontal bar, arms are required stretching completely Pull Lift one's body with his/her arms Attained Keep the head above the horizontal bar Resume Relax one's arms and return to the hang state Table 2 lists the estimation accuracy of 15 key points.Te right bar and left bar refer to two sides of the horizontal bar in pull-ups test.Te other 13 key points refer to the critical positions of human body for posture estimation.It can be seen that all the PCK values are larger than 80.It refects that the key points can be captured accurately and efciently by introducing the cascaded operations of the heatmap and key

Table 2 :
Te estimation accuracy of 15 key points using the PEPoseNet on SDUST-PUT.Table3lists the comparison results in terms of PCK and frame per second (FPS).It can be seen that the PCK of OpenPifPaf achieved 88.7, but its FPS was only 0.4.It means that the computational cost of OpenPifPaf is expensive that will limit its transplant to mobile or embedded devices.Te PCK of MediaPipe was 84.2 that was slightly higher than 83.8 of the PEPoseNet.However, its FPS 27 was smaller than 32 of the PEPoseNet.Te slightly higher PCK of MediaPipe may become from the larger training set and optimization tricks supported by Google engineers.In contrast, the PEPoseNet achieved the best FPS owing to the depth-wise separable convolution.In practice, high FPS is the most attractive characteristics for transplanting the

Table 3 :
Average PCK and FPS of diferent models.

Table 4 :
Comparison of state classifcation conducted by PEPoseNet and MediaPipe in pull-ups test.
Te values in bold represent the best data for this metrics.

Table 5 :
Comparison of action classifcation conducted by PEPoseNet and MediaPipe in pull-ups test.