Learning-Replay Based Automated Robotic Testing for Mobile App

Record-replay testing is widely used in mobile app testing as an automated testing method. However, the current record-replay methods are closely dependent on the internal information of the device or app under test. Due to the diversity of mobile devices and system platforms, their practical use is limited. To break this limitation, this paper proposes an entirely black-box learningreplay testing approach by combining robotics and vision technology to achieve a record-replay testing that can support crossdevice and cross-platform. Firstly, vision technology is used to extract the critical information of GUI and gesture actions during the tester’s testing process; secondly, the GUI composition and test actions are analyzed to form a test sequence; finally, the robotic arm is guided to complete the replay of the test sequence through visual judgment. On the one hand, the approach in this paper does not access the interior of the app, shielding the association between test actions and device; on the other hand, it captures more abstract test action information instead of simple operation location records and supports more flexible test action replay. We demonstrate the effectiveness of this approach by evaluating the learning-replay of 12 popular apps for 13 typical scenarios on the same device, across devices, and across platforms.


Introduction
Mobile apps have shown a high-speed development trend in recent years. ey have been gradually substituting desktop software to serve us. Compared with desktop software, mobile apps have new features such as rapid number growth, frequent version iteration, and diverse system platforms. However, these features of mobile apps present challenges for software testing both in workload and difficulty. In view of this, automated testing methods are more urgently needed for mobile apps to ensure quality [1].
Record-replay testing is a classic automated testing method. It records the tester's test actions, converts them into test scripts, and then uses the scripts to replay the test actions on the app under test (AUT) [2]. Record-replay plays a vital role in software testing, especially in regression testing that requires repeated verification. Industry and academia continue to make efforts to improve the performance of record-replay testing.
Event driven record-replay testing captures events of app to replay. Reran [3], a typical representative of record-replay testing, can accurately reproduce the test actions by capturing the underlying system event stream and can flexibly define the event-triggered time limit during replay. Reran obtains test information from the system level and can simulate nondiscrete actions, such as irregular sliding, but it is closely related to the device under test. Mosaic [4] and Appetizer [5] are similar to Reran. In addition, operating system platforms provide specialized testing frameworks or tools, such as Monkeyrunner [6] and Espresso [7] for Android app and XCTest [8] for iOS app, but they only work for testing apps on their own platform. Appium [9] implements cross-platform testing through script conversion by encapsulating the above testing frameworks of each platform. Calabash [10] also realizes cross-platform testing by converting test scripts for testing frameworks of different platforms. However, platform differences still make crossplatform testing difficult in this way. SARA [11] achieves a high replay success rate for single-device and cross-device by binding events to Android widgets. Amalfitano [12] considered the acquisition of GUI layout file information in the process of record to correspond action events with GUI controls. In the subsequent replay, accurate event replay can be performed by searching for corresponding controls in a targeted manner. ese works combine events with GUI layout to improve the accuracy of replay. e GUI driven record-relay testing further captures GUI images to replay. With the adoption of computer vision in software engineering [13], GUI images are included in mobile app testing. Sikuli [14] and Eyeautomate [15] can generate scripts that contain screenshots of GUI elements and replay across devices by comparing image pixels of the elements. Based on this idea, Airtest [16] and Ranorex [17] utilize modern visual technology to generate such visual scripts and achieve higher accuracy replay. Furthermore, LIRAT [18,19] tries to match image and layout characterization of GUI to solve the problem that the test cannot be replayed after fine-tuning due to version update or different platforms.
In the current record-replay testing techniques, an essential basis is that they all need to access the interior of the app to obtain either GUI information through layout files or test actions information through event streams. us, they are closely dependent on the system platform or AUT and cannot support cross-device and cross-platform testing well. However, mature mobile apps usually need to adapt to multiple mobile devices and multiple system platforms (e.g., iOS, Android, and Web). Differences in screen size, resolution, operating system type, and operating system version of mobile devices will lead to the failure of record and replay. In actual use, the maintenance workload for the record scripts to ensure the validity of the record-replay testing is still huge [20]. In addition, multiplatform version apps require independent record-replay testing tools for each platform, which will also increase the cost and complexity of testing.
At the same time, mobile apps require interaction with the outside world, so the traditional testing method using simulated signals to drive simulated events can no longer support the test coverage of mobile apps. erefore, robotic testing for mobile apps is proposed to enhance the realism of the testing [21][22][23]. e industry has also launched various robotic arm based automated software testing platforms [24][25][26]. However, in the current automated robotic testing research, the robot is only used as an actuator, and a wholeprocess automated robotic testing approach has not been formed.
Regarding the issues above, this paper proposes a learning-replay based automated robotic testing approach that learns and imitates the test actions from a complete black-box perspective. We expect the robot to recognize and test apps from the external appearance, just like humans. e robot captures test information through vision technology and uses the robotic arm to implement test execution, realizing the decoupling of testing from devices and apps. ereby, it can solve the shortcomings of the current recordreplay testing in cross-platform and cross-device issues. e main idea of the learning-replay based automated robotic testing is as follows: firstly, the tester's testing process is recorded as a video; secondly, vision technologies are used to visually capture the test information from the video, and then the test sequence is formed; finally, the test replay is completed by the robotic arm according to the test sequence automatically. e purpose of this paper is to achieve an entirely automated black-box testing for mobile app by learning and mimicking human test action. It can identify what operations can be done on what elements under what kind of GUI structure. e approach not only is a record of testers' test actions, but can identify more information and achieve comprehensive judgment as follows: (a) e ability to test due to GUI structure for crossplatform and cross-device testing. (b) e ability to select test scripts autonomously and execute interactively. Figure 1 shows an overview of our approach based on the automated robotic testing environment. e environment provides complete black-box testing without relying on any internal information of AUT. In detail, the camera captures the GUI and test actions to AUT, and the robotic arm imitates the interaction between tester and AUT. Like record-replay testing, our approach consists of a learning (record) phase and a replay phase.

Approach Description
In the learning phase, the camera is used to record the tester's testing process on AUT, and then vision technologies are utilized to identify test actions, including GUI and gesture information in each test step, thereby forming a test sequence. e goal of this phase is as follows: we expect the robot to learn the tester's testing intentions instead of rigidly recording the test.
In the replay phase, the robot first captures the GUI of AUT through the camera, then determines the executable action based on the learned test sequence, and finally drives the robotic arm to complete the action. e goal of this phase is as follows: we expect to achieve a visual-driven automated test execution process like humans. e details of the learning-replay based automated robotic testing are given below. e framework of the learningreplay approach is shown in Figure 2. Moreover, we design algorithms for the test learning and test replay processes, respectively. Algorithm 1 describes the generation process of test sequence scripts in the learning phase. e time cost of Algorithm 1 is O(n + mn), where n is the number of video frames, and m is the number of elements in a single GUI. Algorithm 2 describes the test execution process in the replay phase. e time cost of Algorithm 2 is O(pq), where p is the number of AUT's GUI, and q is the total number of test scripts.

Video
Recording. Video recording means recording the tester's testing process by video, as shown in Figure 3. Because the test environment only uses an upper monocular camera, during the test execution, the hand needs to enter the shooting area from outside and completely leave the area after completing a certain action. In addition, the shooting area needs to be able to capture the GUI and the hand movement trajectory. A recorded video is generated for each test case executed by the tester.

Video Analysis.
When the recorded video is obtained, it is necessary to further segment the GUI frames and the gesture action frames in the video. Here, we introduce OpenPose [27], a model that can detect the 2D pose of people in real time, and it is used to detect hand trajectories in the video.
As shown in Figure 4, OpenPose can output the key points of the hand in each video frame. From this, we can extract the fingertip trajectory of the index finger (the finger that touches the screen) for each test action in the video. erefore, the frames where the fingertip trajectory continuously appears are the gesture action frames, and the frames where the hand is not detected are the GUI frames; Figure 5 gives an example. Figure 5 shows the coordinate change value of the fingertip position in each frame of the video. Frames with coordinate change value (i.e., the hand is detected) are gesture frames. Frames without coordinate change value (i.e., no hand is detected) are GUI frames. And the GUI frames before and after the gesture action frames,  Mobile Information Systems respectively, represent the initial GUI before each action, and the response GUI after the action is executed. We add split tags to each frame by judging the transition of the GUI frames and gesture frames (lines 4-11 in Algorithm 1).

Test Gesture Identification.
For the test actions, we expect the robot to learn what actions the tester did rather than just record the action trajectory. We do not directly judge the action but identify the micromotion of gesture action. Refer to the standard gesture actions supported in the development instructions of iOS [32] and Android [33], as shown in Table 1 (here, we only consider the frequently used single-finger actions), which can all be represented by micromotions.
Tap is a click action that triggers a control by clicking on a control element in the GUI. Long press a control element to activate its function, such as a long press on a text to trigger the copy menu. Double tap a control element, such as double tap a picture to zoom in or out. Flick is a quick touch and movement of the surface, such as scrolling the surface vertically to browse all the contents in a list. Drag is to hold down an element or surface and move it, such as moving an element to another location or sorting items in a list. Swipe is to slide the control element or surface to the left or right,   1  14  27  40  53  66  79  92  105  118  131  144  157  170  183  196  209  222  235  248  261  274  287  300  313  326  339  352  365  378  391  404  417  430  443  456  469  482   such as switching tabs or removing an item from a list by sliding horizontally. e interactive objects of all actions contain elements and surfaces. Some gestures are only valid for elements, some for surfaces, and some for both. Gestures against surfaces tend to be more guided, while gestures against elements have both guided and behavioral implications. In general, conflicting gestures are not recommended for iOS and Android designs [32,33].
Analysis of the characteristics of these gesture actions reveals that all of them can be composed of the four micromotions: touch, move, hold, and leave. Touch is the initial sign of all gesture actions, and leave is the end sign. When the tester's hand moves towards the interactive object, it presents a speed change from fast to slow and finally appears to be relatively static for multiple frames when the action occurs and then accelerates away. Taking Figure 7 as an example, we calculate the coordinate change value for each frame of a long press action. e coordinate change is almost maintained at a relatively low position when the long press action occurs. To trigger an action, the finger needs to contact the screen for a certain period. e length of this period can determine the type of the intermediate micromotion (touch time requirement: none < touch < hold). Moreover, according to the change of the contact area before and after the period, it is determined whether a move micromotion occurs. erefore, we first calculate the fingertip position detection error when the finger is in a continuous static state as the threshold for judging whether the finger is in contact with the screen. If there is a constant frame segment lower than the threshold, it is determined to be the period when the finger touches the screen, that is, the intermediate micromotion frames. en, we evaluate the average frame number of intermediate micromotion for tap, long press, and double tap gestures as the measure threshold for none, hold, touch, and the average deviation of fingertip position within the intermediate micromotion frames as the measure threshold for move. us, the test gesture identification is completed. e gesture identification process is described as function identifyGesture in Algorithm 1 (line 22-33). For our robotic testing environment, f_threshold is set as 7.1 to judge and extract frames where the intermediate micromotion occurs, c_threshold is set as 15 to judge the change of fingertip position between the first and last frame during the intermediate micromotion occurs, n_threshold is set as 42 to judge whether it is none micromotion, and h_threshold is set as 125 to judge whether it is hold micromotion. For the nonmove micromotions, the function will return the micromotion category and the first frame fingertip coordinate; for the move micromotion, the function will return the category and the fingertip coordinates of the first frame and end frame.

Test Action Identification.
After the test gesture and GUI information are extracted from the video, it can be inferred what kind of action the tester performed on an interactive object due to the contact position when the action occurs.  Figure 6: Detect GUI elements and construct GUI skeleton. en, the generated test sequence is defined as [Serial number, GUI skeleton, Micromotion decomposition, interactive object]. Serial number denotes sequence order, and it will be organized into a tree structure if there are multiple videos for a single app. GUI skeleton is constructed in Section 2.3. Micromotion decomposition is generated in Section 2.4. Interactive object is determined by the object detection results of GUI elements in Section 2.4 and the fingertip position of gesture in Section 2.5. e function identifyObject defines the judgment of the interactive object (lines 34-42 in Algorithm 1). If the coordinates of the fingertip fall in the bounding box of a GUI element, it is the element; else, it is the surface. e test sequence script is shown in Figure 8. e script uses a more advanced form of expression instead of recording the underlying location information to better support cross-platform testing.

Test Replay.
In the replay phase, we do not specify the test sequence to be executed but let the robot determine the executable test based on the learned test sequence. e detailed replay process is as follows: (a) At the beginning of the test, the initial GUI of AUT is randomly placed in the camera shooting area. After the robot captures the current GUI, it matches the GUI with the initial GUI in the test sequence, looking for possible test action (lines 6-13 in Algorithm 2). Here, the method of comparing the vector similarity of two GUI skeletons in Section 2.4 is adopted to match the GUI. For our robotic testing environment, s_threshold is set as 0.85. (b) e robot locates the interactive object on the current GUI according to the interactive object and the test gesture (micromotion decomposition) defined by the test sequence. en, it drives the robotic arm to complete the action execution. e function executeAction describes this process (lines 29-41 in Algorithm 2). (c) After completing the test action, the robot obtains the response GUI of AUT. First, the robot determines whether the interface has changed, that is, whether the test action is executed successfully; second, it determines whether the current GUI matches the GUI in the next sequence defined by the test sequence, that is, whether the response of AUT meets the expectations. If it meets the expectations, execute the next sequence; otherwise, report an exception (lines 15-25 in Algorithm 2).
Repeat the above process until the completion of a test sequence. Test result records will be generated whether the test replay is successful or not. During the replay process, the test will be executed by the robot in real time interactive judgment, rather than rigid action playback.

Experiment
To evaluate the effectiveness of our approach, we perform a study of testing popular apps under different platform versions and devices.

Experiment Setup.
We select 6 mobile devices with different screen sizes, resolutions, and operating systems and install popular apps on the iOS and Android App Stores, covering categories such as social, communication, entertainment, and news. Details are shown in Tables 2 and 3.  In terms of recording the test video, we write test cases for each app around the 9 typical scenarios, including select, search, login, forward, comment, open, add to, setting, zoom-out, browse, hidden menu, tabs-switch, and move, and then assign multiple testers (senior graduate students) to complete the execution of the test cases. All testing procedures are recorded by the camera. e scenarios' distribution of apps is shown in Table 4.
After the recorded videos are obtained, the learning phase of the robotic testing begins. e robot learns the test actions from the videos and generates test sequence scripts. en, the replay phase of the robotic testing can be conducted to verify the validity of the learning-replay testing.
In the replay phase, three groups of experiments are carried out: replay on the original device, relay on the original device that recorded the test video; replay across devices of the same platform, replay between devices of the same system platform; and replay across devices and platforms, replay between devices of different system platforms. e success rate (SR) is the metric to evaluate the correctness of the learning-replay testing, which is denoted as where P actual represents the actual result, and P true represents the ideal result. In the testing experiment, the test scripts generated by each app are replayed in the robotic testing environment, each test script is executed once, and the SR of each test action execution is calculated.

Results and Discussion.
e results of the experiment are shown in Figure 9. TP, LP, DT, FK, DG, and SP represent tap, long press, double tap, flick, drag, and swipe gestures, respectively. e vertical SR calculates the replay success of each app, and the horizontal SR calculates the replay success of each gesture.

Replay on the Original Device.
For the replay on the original device, the average SR reached 93.9%. WeChat and Zoom even achieved 100% replay SR; namely, all test actions are successfully replayed. e poor performing Amazon Shopping and Booking.com also reached more than 85%. For the replay performance of test actions, except the SR of long press and drag, which are lower than 90%, the rest of the actions are over 90%. e replay on the original device is relatively simple, and there is no need to consider differences in devices, platforms, and apps. erefore, the replay SR of the original device is only affected by the effect of our proposed approach.
In terms of visual identification for apps, too complex GUI elements can easily cause confusion in GUI identification. For example, there are many videos or pictures in YouTube and Reddit, and the text or logos appearing in the videos or pictures will interfere with GUI identification. e design of some GUI will confuse object detection, such as Booking.com, which uses a lot of text, and some texts are buttons or links that can be clicked. e variability of the GUI causes the action to fail to execute, such as Reddit or Opera News, which will constantly update the content, resulting in changes in the GUI structure. eir browse lists usually consist of plain text, a combination of image and text, or a video.
In terms of executing actions, some actions have a large contact area and high operating tolerance, such as flick and     original device replay. Cross-device replay on the same platform uses the same app of the same platform, so there are only differences in properties of devices. (In the experiment, the tablet is still in portrait mode.) Compared with the original device replay results, the SR of cross-device replay on the same platform has almost all dropped. Changes in resolution or screen size still affect the recognition of GUI elements, especially from high-resolution devices to low-resolution devices.
In addition, an unexpected result is that the SR of a few actions is higher than that of the original device replay. Two reasons cause this result. One is that our approach abstracts the test actions into a test script, so the execution of the test is independent of the original device. Another is that the adopted visual detection algorithm has an inevitable detection error fluctuation. Changes in the accuracy of the detection box size can influence the execution results. Such subtle fluctuations may impact the test accuracy. However, in practice, multiple rounds of testing can be used to eliminate this effect.
It can be seen from the replay results that the execution SR error is low, and the proposed approach can effectively support replay across devices of the same platform.

Replay on Cross-Platform Devices.
In the cross-platform replay, the test replay execution between iOS, Android, and Web platforms is completed, respectively. Compared with the first two sets of experiments, the cross-platform test results of each app have dropped significantly. Android to iOS, iOS to Android, Web to Android, and Web to iOS achieved SR between 75% and 85%. However, the SR of iOS to Web and Android to Web are only 64.2% and 49.3%.
We find that the main reasons for the decline of SR are as follows: (i) First, it is undeniable that the GUI design of crossplatform apps is slightly different due to the differences of platforms. Especially between the web platform and the mobile system platform, the web is more about the display adaptation on the screen size of the mobile phone as well as the adaptation of ordinary gestures (click, flick) but does not consider the adaptation of uncommon gestures. For example, for the swipe gesture, the tab pages can be switched by swipe on the mobile system platform but clicking the tab to switch page is more used on the web platform. Among the experimental apps, only Apple Music's web version supports swipe to switch the tab pages. (ii) Second, some apps use a hybrid development approach, nesting the web within the mobile app. For example, for Amazon Shopping, the browsing area of Android and Web is the same but different from iOS. e navigation buttons outside the browsing area are different for each platform. (iii) ird, the web version of some apps is entirely different from the mobile version, such as Wikipedia.
Despite the drop in SR, cross-platform replay of more than half of the test actions is still achieved through the abstraction ability of our approach. To the best of our knowledge, this is the first time that record-replay testing has been applied to replay on multiple platforms from a blackbox perspective. To a certain extent, we eliminate the impact of device and app differences on testing, thereby supporting record-replay across platforms. e learning SR and replay SR of each functional scenario are shown in Table 5. It demonstrates that our approach can support test execution in various scenarios. However, some scenarios requiring sequence actions to complete will affect the SR, such as login and settings. Because of the sequential nature of test actions, the failure to execute some key actions will lead to the failure of subsequent actions to be executed.

Conclusion
In this paper, aiming at the shortcomings of current recordreplay testing in cross-platform and cross-device testing, we propose a learning-replay testing approach based on robot vision. Our approach extracts the key information of GUI and gesture action in the testing process through vision technology and then transforms it into a test sequence, driving the robotic arm to complete the replay of test actions. e results of the experiment on popular apps and multiple scenarios prove that this approach has the ability of crossdevice and cross-platform testing for black-box.
ere are a large number and various types of mobile apps. We only use limited devices and apps for the verification of our approach. us, we may not have verified all the conditions. However, we have tried our best to collect representative apps, devices, and typical scenarios to participate in our experiment to reduce this impact.
ere are still some limitations in this work. First, some apps that require sensor signals (e.g., gravity sensing) still challenge this entire black-box testing. It will put higher requirements on the robotic test environment and supporting algorithms. Second, most game apps have relatively independent characteristics of GUI and interaction, which will lead to the failure of our approach.
ird, we only implemented the recognition and replay of single-finger gestures, while multifinger actions are still used in some scenarios.
In the future, we will continue to improve this approach and promote its integration with the commercial mobile apps testing process. In addition, the essence of our approach is to judge the GUI structure to complete replay, but how to replay according to the meaning of GUI is the key to solving the cross-platform testing and even the cross-app testing of the same functional scenario, which will further enhance the automation of mobile app testing.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no potential conflicts of interest.