The Fusion Application of Deep Learning Biological Image Visualization Technology and Human-Computer Interaction Intelligent Robot in Dance Movements

The paper aims to apply the deep learning-based image visualization technology to extract, recognize, and analyze human skeleton movements and evaluate the effect of the deep learning-based human-computer interaction (HCI) system. Dance education is researched. Firstly, the Visual Geometry Group Network (VGGNet) is optimized using Convolutional Neural Network (CNN). Then, the VGGNet extracts the human skeleton movements in the OpenPose database. Secondly, the Long Short-Term Memory (LSTM) network is optimized and recognizes human skeleton movements. Finally, an HCI system for dance education is designed based on the extraction and recognition methods of human skeleton movements. Results demonstrate that the highest extraction accuracy is 96%, and the average recognition accuracy of different dance movements is stable. The effectiveness of the proposed model is verified. The recognition accuracy of the optimized F-Multiple LSTMs is increased to 88.9%, suitable for recognizing human skeleton movements. The dance education HCI system's interactive accuracy built by deep learning-based visualization technology reaches 92%; the overall response time is distributed between 5.1 s and 5.9 s. Hence, the proposed model has excellent instantaneity. Therefore, the deep learning-based image visualization technology has enormous potential in human movement recognition, and combining deep learning and HCI plays a significant role.


Introduction
Modern technologies, such as the Internet and multimedia technology, have developed rapidly. Multimedia systems based on computer information technology have been applied in many fields. e intelligent interactive multimedia is a new platform that develops under the foundation of computer technology [1,2]. However, the applications of traditional multimedia systems are often independent and mechanized, which are inadequate to meet people's needs. Consequently, the human-computer interaction (HCI) technology emerged. People can interact with the multimedia engine and obtain the required media information quickly and efficiently via HCI. Besides, HCI technology can promote the accurate transmission of information and improve work efficiency [3][4][5], which has triggered a research boom. In daily life, people can directly express their thoughts or emotions through movements. erefore, movement recognition and analysis have become a critical direction in the field of HCI and attracted widespread attention, which leads to the wide popularity of human movement-based recognition technology [6,7].
With the advancement of social informatization, human beings have an increasing requirement in the intelligence level of computers. HCI no longer only depends on the original hardware-based interaction, and some relatively more intelligent interaction methods gradually appear in mass life. e face recognition, gesture recognition, and speech recognition systems constructed by machine learning technology have established a bridge between humans and computers [8]. e emergence of these convenient interaction modes has become a major development trend in the field of HCI. e development of HCI mode aims to enable the computer to serve and adapt to human needs well, so HCI focuses on humans instead of adapting to the computer. erefore, the friendly interaction between robots and humans is extremely vital in the research of machine learning and HCI. Some scholars focus on the importance of emotional factors related to the interaction between people and computer systems, when exploring the people-centered interaction systems [9]. Motion recognition technology is essentially a classification problem close to machine learning [10]. e above research results imply that the development of the Internet and multimedia technology has made multimedia systems successfully applied to many fields. e friendly interaction between robots and human beings plays an extremely important role in the study of machine learning and HCI. Deep learning shows excellent application potential in function extraction and HCI. A combination of deep learning and HCI is innovatively proposed to extract and identify human skeleton operations to expand the application field of HCI. e ultimate research purpose is to achieve a significant reduction in time costs and dependence on traditional equipment and facilities. e innovative ideas can also achieve the purpose of improving human-computer collaboration and interaction. Moreover, combined with the image visualization technology based on deep learning and HCI system, it is envisaged that the visual geometric group network (VGGNet) and long short-term memory (LSTM) can be optimized. e final HCI system and the research results of the recognition and analysis of human dance provide a reference value. e contributions based on the extraction and recognition of human dance movements are as follows: (1) An optimized VGGNet human skeleton movement extraction algorithm is proposed. Its extraction accuracy reaches 96%, which is significantly better than traditional algorithms. (2) An optimized multiple LSTM human skeleton movement recognition algorithm is proposed. Its recognition accuracy reaches 88.9%, which is significantly better than traditional LSTMs. (3) An HCI system based on image visualization is designed, and the interaction accuracy rate reaches 92%. (4) A reference is provided for more in-depth human movement extraction and recognition, and deep learning methods' application range in HCI systems is expanded. According to the "Terpsichore" project funded by the Horizon 2020 of the European Union, they proposed a high-level method based on the digitization of cultural assets [13]. Doulamis et al. discussed the digitization of tangible and intangible cultural heritage and proposed that 3D digital assets would develop into a part of augmented, virtual, and mixed reality experience [14]. Lv studied the application of virtual reality (VR) in 3D environment and HCI system and revealed the excellent performance of VR technology in 3D digitization [15]. e digitization of intangible cultural heritage has become an inevitable development trend, so has dance.

Literature Review
On the recognition and extraction of dance movements, Rallis et al. proposed a dance summarization method based on 3D capture data of the Vicon motion capture system. ey analyzed and studied the automatic extraction of dance patterns.
is method was a hierarchical scheme based on the temporal and spatial changes of dance characteristics [16]. Aiming at the preservation and dissemination of dance performance, Aristidou et al. proposed a dance action recognition framework based on Laban analysi which used feature space to capture different dance action components and pointed out a new direction for dance evaluation [17]. In terms of editing and synthesis of dance movements, Aristidou et al. used Laban analysis, radial basis function regression, and interpolation methods to map the movement features and emotional features in two directions and realized the stylization of high dynamic dance movements [18]. To sum up, there is a difference between the research of human action recognition and HCI, and there is little research on action recognition in dance education.

Research Progress of HCI.
Experts and scholars have made great efforts on deep learning and HCI. Bhardwaj et al. applied support vector machine and artificial neural network classifier to fingerprint recognition. By integrating the relevant dynamic information from hundreds of biometric scanning sample datasets, they found that the accuracy of fingerprint dynamic recognition by fusing the deep learning method was improved by 5.3% [19]. Israelsen and Ahmed analyzed the influence of artificial intelligence (AI) agent in HCI and machine learning based on the research of algorithm-guaranteed AI agent and discussed the advantages and disadvantages of different methods [20]. Based on similarity embedding, Spathis et al. proposed an interactive dimension reduction framework (iSP). In this framework, user interaction formed different goals. Gradient descent was used for learning, and an end-to-end composition structure could be trained. By evaluating the framework in two interaction scenarios, they found that the framework could be applied to semisupervised learning, transfer learning, and adaptive learning in interaction field [21]. Using interactive machine learning, Wu et al. studied local decision-making in feature selection of emotion classification task and analyzed the influence of interactive machine learning tools on feature selection results [22]. To improve the performance of multimodal image retrieval by using unmarked and marked multimodal web objects, Xu et al. proposed a semisupervised multiconcept retrieval method based on deep learning (SMRDL). Different from the traditional method of using multiple independent concepts in multiconcept semantic query, the proposed method regarded multiple concepts as a whole scene, which was used for multiconcept scene learning of unimodal retrieval. e comprehensive experimental results on two datasets of MIR flickr2011 and NUS-WIDE indicated that the proposed method was superior to some of the latest methods [23]. Long and Zhao held that intelligent teaching mode overcame the shortcomings of traditional online and offline teaching. However, there were some shortcomings in the real-time feature extraction of teachers and students. In view of this, they used particle swarm image recognition and deep learning technology to process the video teaching image of intelligent classroom. To overcome the shortcomings of premature convergence of standard particle swarm optimization (PSO) algorithm, they proposed an improved multi PSO algorithm strategy. Moreover, to improve the premature problem of PSO in search performance, they combined the algorithm with the useful attributes of other algorithms to improve the diversity of particles in the algorithm, enhance the global search ability of particles, and achieve effective feature extraction [24]. To sum up, there are many research results on the application of deep learning in HCI, but few studies on the combination of the two for dance action extraction.

Methods
In computer vision and image processing, movement recognition is a crucial component. However, some problems are found in its research and applications. For example, when extracting and recognizing human skeleton movements, bone modeling is challenging, movement amplitude can affect the extraction results, and feature extraction can be insufficient, increasing the difficulty in analyzing and classifying human movements. Deep learning has developed rapidly. CNN shows excellent performance in feature extraction, while LSTM has significant performance in processing time sequence problems. erefore, CNN and LSTM are introduced to extract and recognize human skeleton movements. However, traditional CNN models have lots of parameters, using a large convolution kernel to extract features. Traditional LSTM models never consider the connection of multiple different movement times in a long time. Hence, the CNN-based VGGNet is introduced and optimized in parallel. In the meantime, LSTM is improved and optimized before extracting and recognizing human skeleton movements.

Optimization of VGGNet CNN Model. Cat's visual cortex theory inspires the deep learning-based CNN.
Compared with the traditional neural network, CNN extracts the object's local feature information through the convolution layer, a critical CNN component that contains multiple convolution kernels [25]. VGGNet is a typical CNN. Unlike traditional CNNs that employ big convolution kernels to extract features, VGGNet utilizes several 3 * 3 small convolution kernels for feature extraction. Hence, VGGNet can extract richer features and reduce the calculation amount significantly [26][27][28]. e features extracted by the convolution layer are integrated to improve the accuracy of VGGNet, i.e., the parallel CNN [29][30][31].
Extractions of input image features before fusion are as follows: In (1) and (2), F A 1 and F B 1 represent features. e feature information extracted by the two small convolution kernels is fused via the feature fusion module. e convolution operation is denoted as G.
e feature map after fusion processing can be written as follows: e process of fusion of the above feature maps y c 1 , y A 1 , and y B 1 can be expressed as follows: e above fusion processing can enrich and diversify the extracted features. Graphics Processing Unit (GPU) processing is utilized for training VGGNet to compare the performance of the CNN-based VGGNet before and after optimization. Images in the training set are taken by the Kinect camera and the host computer program. e selected human movements include clapping, slapping, standing, picking up objects, and sitting down.
Movement capture includes the following steps: (1) the demonstrator makes different movements in front of the Kinect camera and (2) Kinect is utilized for evaluating human skeleton changes in real-time. Several demonstrators complete the collection of the entire training set. One thousand images are collected for each movement. Finally, a total of 5,000 human skeleton images under different movements are obtained. e skeleton images affected by the environment are removed, and the remaining human skeleton images are retained. ese images train the VGGNet before and after feature fusion. Accuracy and loss rates are taken as evaluation indicators [32,33]. Parameter settings of the entire training process are shown in Table 1.

Extraction Algorithm of Human Skeleton Movements.
Traditional human pose estimation algorithms extract human skeleton features via the bottom-up manner. Each skeleton extraction object requires a detector, and each Computational Intelligence and Neuroscience 3 movement is estimated separately.
erefore, traditional algorithms have many problems, such as false detection, long-running time, and poor instantaneity, which cannot meet the demands. Based on the OpenPose open-source database [34], the optimized VGGNet is the network architecture, and the histogram equalization [35,36] is introduced to suppress noises, thereby extracting the 2D features of the human skeleton.
OpenPose is an open-source database released in 2017 based on skeleton extraction. Unlike traditional pose estimation algorithms, OpenPose uses a bottom-up method.
e joint points of all human body parts are detected first. en, the nodes are connected to obtain the skeleton, thereby significantly reducing the running time. Also, OpenPose can improve detection accuracy and shorten the running time. Figure 1 illustrates the video information processing by OpenPose.
e unique convolution kernel structure in the CNN can learn spatial information in human actions, and more useful information can be obtained by different convolution kernels. Compared with traditional machine learning methods, CNN is more systematic and comprehensive in task learning with better performances. Unlike traditional CNN models, the VGGNet model extracts features by massive small convolution kernels as a typical CNN model. It can extract more features and reduce calculation amount with satisfactory generalization performance. e optimized VGGNet consists of three parts. e first part processes the image data via the input layer and employs CNN to extract the feature values of body parts. en, the extracted feature values enter the other two parts for critical point positioning and the body-based 2D vector field positioning. e input to output via the neural network spends a total of k periods, and the information input to the current period is the output feature value obtained through the learning process of k − 1. e optimized VGGNet's output is formed by a 2D vector field of crucial body parts and a confidence map. As the calculations increase, the candidate human body parts and the corresponding structure division become apparent via this cyclic process. Here, CNN's first convolutional layer is a double convolutional layer, and each contains 64 convolution kernels in the size of 4 * 4. Simultaneously, an activation layer and a normalization layer are added after each convolutional layer to process the nonlinear data. A pooling layer is added after the normalization layer to reduce dimensionality and prevent overfitting, located between the two convolutional layers. e Dropout layer comes after the second pooling layer. e Part Affinity Fields (PAFs) [37,38] are adopted to predict all the human body key points in the images.
In summary, extracting human skeleton information includes the following two processes: first, adding the corresponding image data to the input layer of VGGNet and, second, learning the feature value F according to the body parts. e 2D vector field of output corresponding to the human body in the k � 1 period is In (5) and (6), S represents the set of 2D position confidence maps, ρ and ϕ denote the set parameters, t refers to the period corresponding to the feature value, and L signifies the set of 2D vector fields. e solution to the confidence in the confidence map can be presented as follows: In (7) and (8), S represents the position confidence atlas and p denotes the output image in the corresponding period. Meanwhile, k refers to the number of people in the input image, j stands for the body part's serial number, and σ is a constant.
e joint point position in the 2D vector field is judged according to In (9) and (10), p represents the pixel of the prejudgment part and v denotes the unit vector. On this basis, the average value of the 2D vector field can be written as follows: In ( e OpenPose open-source library can achieve excellent results of skeleton extraction. However, the image noise limits feature extraction. erefore, histogram equalization is introduced, which enhances the contrast and reduces the noise by stretching the distribution range of pixel intensity. Videos based on image visualization are processed by Compute Unified Device Architecture (CUDA) to ensure the instantaneity of information extraction. Eighteen key part points are chosen as the input of skeleton movement

Skeleton Movement Recognition Based on Optimized
LSTM. Traditional neural networks have major limitations in practical application. For example, in time series processing, traditional methods perform well only in short-time series processing. In the separate data processing, the good learning and understanding abilities enable CNN to be applied in practice. However, CNN has limitations in the sequence problem processing related to time correlation. LSTM is a unique Recurrent Neural Network (RNN). LSTM can solve the long-term dependence problem in RNN applications, which has an inseparable relationship with the particular gate structure of LSTM, explicitly referring to input gates, forget gates, and output gates. e input data are calculated according to the following equation: In (12), w represents the weight, b corresponds to the deviation, and h t−1 denotes the output value corresponding to the time t − 1. Meanwhile, x t refers to the input value, σ represents the activation function, and f stands for the forget gate. Moreover, the memory information c t can be displayed as follows: In (13), c t−1 represents deciding whether to memorize the information at the time t − 1 and j t means the input gate.
Finally, the output gate o t can be expressed as follows: Although LSTM has many excellent performances, LSTM does not consider the correlation and feature influence between different skeleton movements over a long time. Hence, the LSTM model only depends on the human skeleton joints while recognizing human skeleton movements, resulting in limitations to recognizing human skeleton movements. erefore, the idea of time integral is introduced. First, the pre-acquired skeleton sequence information is transformed, such as translation and rotation. In this way, all movements can obtain their relative coordinates. If the human skeleton movement has differences due to different times, a multiple LSTM model is used to extract and fuse features [39]. Finally, multiple types of movements are captured by integrating multiple LSTMs. Figure 2 reveals the overall implementation framework of the optimized multi-LSTM human skeleton movement recognition.
Extraction accuracy and loss entropy of various LSTMs are compared to verify the effectiveness of the optimized multi-LSTM human skeleton movement recognition algorithm. Specifically, algorithms selected for comparison include the single-LSTM and double-LSTM. A skeleton sequence input into the optimized F-Multi-LSTM contains 24 frames, among which each frame consists of multiple 2D skeleton points. During analysis, the Adam optimization algorithm is used as the optimization tool, and the initial learning rate is set to 10 −4 , in an effort to achieve the model's global optimization. e single-LSTM has one input layer, while the double-LSTM has two input layers. e input is assumed as a sentence. In double-LSTM, one side of the input corresponds to the word at the beginning of the sentence and the other side corresponds to the word at the end of the sentence.

Determination of key parts and joints
Skeleton connection

Design of HCI System Based on Dance Education.
Dance education based on physical education helps improve students' physical fitness and transforms traditional sports teaching. According to the above image visualization-based extraction and recognition method of human skeleton movements, the Web3D engine-oriented deep movement recognition system's functional modules are shown in Figure 3.
e system based on dance education and dance movement recognition consists of the front-end interactive function module and the back-end recognition function module. e former is a 3D world built on Web Graphics Library (WebGL) technology, including data processing of video images, 3D processing, and the HCI submodule. e latter consists of two subfunction modules, namely, node recognition and classification of human dance movements. In this HCI system, the OpenPose open-source database and optimized VGGNet model can estimate facial expressions, positioning of limbs and trunk, and people's feature information. is human skeleton extraction method can identify the critical points of the human body, thereby employing the optimized F-Multi-LSTM skeleton movement recognition network to determine the classification and label attribution of human dance movements. e designed system is based on recognizing and analyzing dance movements. Eight types of dance movements are analyzed and discussed, including stepping and knee lift (S), crouching (C), reaching out and jumping (R), turning and clapping (T), straight punch (B), arm circles (A), jumping (J), and high knee (H).
In the HCI system, the dance pose estimation module and dance movement classification module in the background recognition module are the keys. Accuracy and response time are evaluation indicators to analyze the chosen dance movements, thereby testing the feasibility of the HCI system based on dance education and movement analysis and recognition.

Data Preprocessing.
e image is preprocessed as follows to better meet the needs of behavior recognition: first, the image is uniformly scaled to 432 × 368 based on the center point; second, image denoising. Noises are common in images, in which Gaussian noise is the most common one.
e Gaussian filter is used for processing to effectively suppress the Gaussian noise in the image. e one-dimensional Gaussian distribution and two-dimensional Gaussian distribution are shown in (15) and (16), respectively. e Gaussian filter function in open-source computer vision library (Open CV) is used to realize image denoising, and the relevant parameters are optimized.

Results
is section analyzes the optimized VGGNet algorithm's performance through comparison with several human skeleton movement extraction algorithms. e accuracy of the VGGNet algorithm in human skeleton movement extraction is analyzed and optimized on this basis. e effectiveness of the optimized model is verified. Besides, comparative analysis is conducted on the performance of the LSTM model, the single-LSTM model, and the double-LSTM model. Finally, the interaction accuracy and system real-time performance shall prevail to verify the HCI dance education system's performance. Table 2 presents the comparison result of the extraction accuracy of human movements by several algorithms, including the original and optimized VGGNet. Table 2 suggests that the optimized VGGNet algorithm presents the best performance in extracting human movements, with the highest accuracy of 98.2%, showing apparent superiority in performance over traditional VGGNet algorithms. e 3D CNN model can only extract a  Computational Intelligence and Neuroscience type of features from a three-dimensional space because the weights of the convolution kernel are the same in the whole space; that is, the weights are shared by the same convolution kernel, so the extraction accuracy of 3D CNN is only 91.2%. e spatial invariance of ST-CNN refers to the invariance of spatial transformation of images such as rotation, translation, and scaling. Even if the input is transformed or slightly modified, the model can recognize and extract features. ST-CNN is the most time-consuming and error-prone place in debugging interpolation and image index, so the extraction accuracy of ST-CNN is only 90.5%. ODPM-CNN model is a variability network and ODPM-CNN just the opposite, and its recognition accuracy reached 97.08%. e optimized VGGNet is also superior to other human movement extraction algorithms. In this way, the effectiveness of the proposed skeleton extraction algorithm is verified preliminarily.

Extraction Results of Human Skeleton Movements.
e accuracy distribution of the eight human skeleton movements' extraction results by optimized VGGNet on OpenPose open-source database is shown in Figure 4.
is collection of 100 dance pictures is seen as a total sample, and each picture contains eight parts of the action changes. S represents the step and knee lifting head, shoulders, elbows, wrists, hips, knees, ankle bone node extraction accuracy; other C, R, T, B, A, J, and H dataset content for the above eight parts of the extraction accuracy changes under the action of the title annotation. e extraction accuracy of the head is the highest, reaching 96%, and 100 images are correctly extracted. e extraction accuracy of the shoulder reaches 84.8%, with 90 pictures extracted correctly. e extraction accuracy of the elbow reaches 92.6%, with 89 pictures extracted correctly. e extraction accuracy of the wrist reaches 87.6%, with 86 pictures correctly extracted. e extraction accuracy of the hip reaches 91.0%, with 100 pictures extracted correctly. e extraction accuracy of the knee reaches 95.8%, with 90 pictures extracted correctly. e extraction accuracy of the ankle reaches 86.7%, with 88 pictures extracted correctly. Figure 4 signifies that the extraction accuracy of bone nodes in eight body parts is different, and the proportion of sample number is also different. Moreover, Figure 4 implies that the   Computational Intelligence and Neuroscience proportion of accurate number extracted from the large part of the space occupied by the body parts will be significantly higher.

Skeleton Movement Recognition Results of Multiple
LSTMs. e single-LSTM, double-LSTM, F-multi-LSTM, and A-multi-LSTM are compared. e results are shown in Figure 5. e parameters represented by the abscissa in Figure 5 are different neural network models. e corresponding left-axis variables refer to the accuracy, and the corresponding right-axis variables stand for loss rates. Single-LSTM is a sequence that supports one-way variable input and output, while double-LSTM is a sequence that supports two-way input and output. Multi-LSTM is a multidimensional LSTM for high-frequency time series, which supports multiple parallel input sequences with multiple inputs, rather than the planar structure of multiple inputs in other models. F-Multi-LSTM is an optimized multidimensional LSTM, and A-Multi-LSTM is expressed as a pair of optimized multidimensional LSTM. e double-LSTM has higher accuracy than the single-LSTM according to the comparison results of loss rate and accuracy of single-LSTM and multi-LSTM. e recognition accuracy reaches 79.8%, and the loss rate is 0.0685. Compared with the single-LSTM model, the difference is 43.8%; overall, the recognition accuracy and loss rate of the proposed multi-LSTM model are the best. Specifically, the single-LSTM model's recognition accuracy reaches 88.9%, and the loss rate is 0.0748, which is the best among the comparative algorithms. Compared with the traditional LSTM model before improvement, the optimized LSTM model has higher recognition accuracy. e optimized LSTM model has the best applicability in recognizing human skeleton movements.

HCI System Performance Based on Dance Education and Movement Analysis.
e eight dance movements are chosen as the benchmark. According to the indicators of interaction accuracy and system instantaneity, the HCI system's performance for dance education is shown in Figure 6.
In the dance education HCI system, the eight dance movements' overall interaction accuracy is above 70%. e interaction accuracy of movement B is the highest, reaching 92%. e overall accuracy of interactive recognition is distributed in the range of 72%-92%, with a large span. e overall response time corresponding to the eight dance movements is distributed within 5.1 seconds to 5.9 seconds, showing that the dance education HCI system has a high instantaneity.

Discussion
e above results indicate changes in the OpenPose opensource database's recognition accuracy and the optimized VGGNet model. e reason is that the head has almost no changes in coordinates or rotation angle. Besides, the movement range of the head is small. erefore, the accuracy of classification and recognition of the head is the highest. In contrast, the shoulders are greatly affected by external factors, such as rotation angle and abscissa among different movements. Hence, classification and recognition accuracy of the shoulders are relatively low. e elbow movements and the wrist movements are affected by changes in moving speed and longitudinal coordinates. If the human body's moving speed is slow and the position between the arm and the camera is not parallel, the classification and recognition accuracy will be high. e hips are easily affected by changes in the leg movements. e overall accuracy of classification and recognition corresponding to the knees is high, but movements with large fluctuations, such as movement H, can significantly affect classification and recognition accuracy. erefore, the accuracy is low. e ankles and other  Computational Intelligence and Neuroscience parts' classification and recognition accuracies are low, probably because of external factors such as clothes and shoes. Although the classification and recognition accuracy of different dance movements are mainly different, the average accuracy is high, confirming the proposed algorithm's effectiveness. e multiple LSTM model is also advantageous in skeleton movement recognition. Because the optimized LSTM model is robust, its learning and classification abilities are increased, thereby increasing its accuracy in recognizing different dance movements. e distribution changes of the interaction accuracy corresponding to the dance education HCI system reveal that the interaction accuracy corresponding to different movements has a large span. Compared with the model training process, the actual interaction will be affected by sophisticated environmental conditions, such as different lighting, the restraint between different dance movements, and the conversion frequency of various dance movements. Under sophisticated environmental conditions, the interaction accuracy of the HCI system drops. Hence, attention should also be paid to improve datasets in actual HCI applications.
Meanwhile, the proposed algorithm is compared with the methods proposed by other scholars [40][41][42][43][44][45][46][47][48][49] to verify its superiority. For the training, the input image size is set to 432 × 368, the number of cycles is set to 50, the batch size is set to 16, and the initial learning rate is set to 0.001. Table 3 reflects the results. Table 3 demonstrates that the proposed multi-LSTM model has the highest accuracy in bone motion recognition, and the recognition accuracy has been improved by 27.79%, 17.69%, and 27.62%, respectively, compared with the comparative methods.

Conclusions
For the dance education HCI system, the CNN-based VGGNet model is optimized and applied to extract human skeleton movements based on the OpenPose open-source database and histogram equalization. e proposed extraction algorithm for human skeleton movements shows intentional performance in extracting eight different dance movements, with the highest accuracy rate reaching 96%. From the comparison results of loss value and accuracy between a single-LSTM model and a multi-LSTM model, the accuracy of bone motion recognition by the multi-LSTM model is 79.8%, which is higher than that by a single-LSTM model. e optimized multi-LSTM model has higher accuracy in recognizing human skeleton movements than the traditional LSTM models. e constructed HCI system has an interaction accuracy of 92%.
is work achieves the extension of application range of deep learning in skeleton movement recognition and the organic combination of deep learning and HCI. e contributions based on the extraction and recognition of human dance movements are as follows: (1) An optimized VGGNet human skeleton movement extraction algorithm is proposed, which achieves a   Computational Intelligence and Neuroscience 9 better extraction accuracy than traditional algorithms, attaining 96%. (2) An optimized multiple LSTM human skeleton movement recognition algorithm is proposed. Its recognition accuracy reaches 88.9%, which is significantly better than traditional LSTMs. (3) A HCI system based on image visualization is designed, with the interaction accuracy rate of 92%. (4) A reference is provided for more in-depth human movement extraction and recognition, and deep learning strengthens the applicability to the HCI system.
Due to computational resource limitations, other larger and more complex datasets are considered in this experiment [50][51][52][53][54][55]. In addition, the algorithm can meet the real-time requirements, the recognition speed is still very slow. In view of the above problems, it is worth further expanding the datasets in complex scenes in the subsequent work and further optimizing the model to improve the detection speed [56].
Limited by the computing resources, other larger and more complex datasets are not explored [57,58]. In addition, the recognition speed of the algorithm is slow although it can meet the real-time performance [59][60][61]. In view of the above problems, the dataset will be further expanded, especially in complex scenarios, which further optimizes the model to improve the speed of detection.

Data Availability
e raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethical Approval
is article does not contain any studies with human participants or animals performed by any of the authors.

Consent
Informed consent was obtained from all individual participants included in the study.