Burner: Recipe Automatic Generation for HPC Container Based on Domain Knowledge Graph

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China Australia-China Joint Research Centre for Energy Informatics and Demand Response Technologies, Centre for Distributed and High Performance Computing, School of Computer Science, University of Sydney, Sydney, Australia


Introduction
With the rapid development of IoT technology, the application scenarios are very wide. When the computing power of edge computing is limited in IoT applications, highperformance computing cloud can supplement the powerful computing power. Container technology is widely popular due to its lightweight and convenience. At the same time, researchers in HPC have also recognized the value of containers. Singularity [1] is currently the most widely used container technology in the HPC field, and many optimizations have been made for HPC applications. First, Singularity can prevent user privilege escalation within the container. Secondly, it can make full use of the host's high-speed interconnect hardware such as InfiniBand, simplifying access to acceleration devices such as GPUs. By now, most of the world's top HPC centers use Singularity as a solution for containerizing HPC applications in production environments.
Container technology simplifies the packaging of applications so that the dependent environment can be easily maintained. The encapsulation of the application running environment in container technology depends on recipes.
The recipe is the core of implementing applicationdependent environment encapsulation which is a script written based on domain-specific language (DSL) for building container images. It records all instructions on how to build the application running environment. The use of recipes improves the transparency of the research process and facilitates the reproduction of scientific research results [2][3][4]. However, the effort involved in manually constructing an environment specification is non-trivial. An experienced developer may spend 20 minutes to 2 hours creating a recipe for an application and often fails to build an accurate specification [5]. Common challenges in writing recipes include selecting base images and packages and determining the correct installation order for transitive dependencies.
Henkel et al. designed the Binnacle toolset [6] to parse the 178,000 Dockerfiles present in the collected Github projects. This toolset is capable of mining semantic rules and best practices in Dockerfiles, providing friendly suggestions to Dockerfile developers. Unfortunately, the Binnacle toolset cannot be directly applied to Singularity recipe parsing, nor can it mine dependencies between packages. DockerizeMe [7] reproduces the running environment of Python code by building a Docker image and uses a combination of static analysis and dynamic analysis to solve the import error problem in Python. However, DockerizeMe mainly analyzes the Python language, which is only suitable for specific scenarios and cannot deal with the diversity of software systems.
HPC Container Maker [8] is an open source tool to make it easier to generate container specification files. HPCCM can generate Dockerfiles or Singularity Definition Files from a high level Python recipe. However, HPCCM essentially uses the Python to define a set of its own recipe specifications, which has relatively high requirements for users. On the other hand, due to the lack of domain knowledge, HPCCM cannot provide users with recommendations for key entities. Therefore, users must have relatively professional computer knowledge (such as Python and recipe syntax specifications) to implement customized recipes for HPC applications.
There are two challenges to realize the automatic generation of recipes in the HPC field: (i) (C1) How to parse recipe files and extract entities and entity relationships from them (ii) (C2) How to apply the obtained entities and entity relationships to the automatic generation of recipes To address (C1), we first design and implement a twophase parsing method for Singularity recipes and a relationship miner to extract key entities of recipes and mine relationships between entities. For (C2), we consider that the dependencies of software packages can be expressed more efficiently with graph data structures, so we store the acquired knowledge in a standardized graph database such as Neo4j. The knowledge graph provides data support for the automatic generation of recipes, which has anexcellent scalability. In the automatic generation of recipes, we improve the tag-based recommendation method to meet HPC users' personalized and diverse needs.
In summary, we make four core contributions: (1) A unique toolset is designed for Singularity recipes to automatically extract the knowledge required for image construction and mine the associations between entities (2) We build a knowledge graph of HPC containers to provide support for automatic recipe generation. The knowledge graph also provides functions such as entity recognition and entity-relationship query (3) An improved recommendation method based on TF-IDF is designed, significantly improving recommendation performance (4) Burner: an automatic recipe generation system. It is worth mentioning that Burner supports both Singularity Definition File and Dockerfile rule specifications The original recipe dataset and parsing results can be obtained at https://github.com/jhshz520/BurnerRecipe. The rest of this paper is organized as follows: Section 2 reviews related work. Section 3 presents the overall design of the Burner. Section 4 introduces the construction of domain knowledge graph, mainly including two-phase parsing of recipes, entity extraction and entity relationship mining. The automatic generation of recipes based on knowledge graphs is described in Section 5. Section 6 is the performance evaluation of our toolset and Burner system. The last section draws conclusions and proposes future work.

Container Technology in HPC.
Charliecloud is an open source software based on the user-defined software stack (UDSS), emphasizing that it can be executed without users having root permissions. Charliecloud is a lightweight container implementation with a small code size of only about 800 lines. However, its functions, portability, and dependencies are slightly insufficient, and it cannot provide a powerful reproduction mechanism [9]. NERSC cooperated with Cray to develop Shifter [10,11]. The main idea of Shifter is to reuse some components of Docker workflow and improve the runtime engine to meet the needs of HPC applications. Shifter reuses key components of the Docker ecosystem, rewriting the Docker runtime. However, the setup and management of Shifter are also relatively complex. Sarus [12] builds around the OCI specification, uses runc as the container runtime, and extends the functionality of HPC use cases by using OCI Hook, but it is not much different from Charliecloud and Shifter, all of them need to be used with the modified Docker containers to achieve targets for applications in HPC. Singularity is currently the best container solution in the HPC environment. It has a unique security model that allows untrusted users to safely run untrusted containers on multi-tenant systems. A special image format Singularity Image Format (SIF) is used to package and distribute containers. This compressed single-layer image format greatly reduces the storage space of the image and 2 Wireless Communications and Mobile Computing facilitates the distribution of the image with better performance. In addition, Singularity implements cryptographic signature and verification with excellent portability and repeatability [13]. Singularity allows user-defined/managed/ created containers to be easily integrated into existing HPC workflows and also provides compatibility with older OS versions via the setuid launcher. As of December 2021, Singularity has three major version iterations with many useful features.

Recipe
Analysis. Singularity appeared in recent few years, and the application scenarios are not as extensive as Docker. At present, there is still a lack of research on Singularity Definition File, but the existing researches on Dockerfile have great reference significance. Cito et al. [14] conducted an exploratory analysis of the Docker container ecosystem on Github, and the research dataset contained more than 70,000 Dockerfiles. After comparing the most popular top100 and top1000 projects, it was found that up to 34% of the Dockerfiles in these projects could not be successfully built due to various problems and 28.6% of the quality problems were caused by the lack of version tags. Schermann and Zumberi [15] collected structured data about the status and changes of Dockerfiles from over 15,000 projects on Github and stored them in a PostgreSQL database. Zhang et al. [16] studied the impact of Dockerfile evolution trajectory on Dockerfile quality and corresponding image build latency. It was found that the fewer the number of image layers and the larger the space occupied by each layer of images, the fewer image quality problems and the shorter the build latency. By using the Dockerfile Lint tool Hadolint [17] to perform static analysis on a large number of Dockerfiles, Lu et al. discovered a problem in Dockerfiles that they called "Temporary File Smell." In the process of image building using Dockerfile, due to Docker's Copy On Write mechanism, inappropriate writing order of Dockerfile instructions will result in redundant temporary files in the Docker image [18,19]. Yin et al. [20] proposed the STAR method, which solved the tag recommendation problem for Docker image repositories without training data. Hassan et al. [21] developed the Rudsea tool, which could implement Dockerfile update prompts based on analysis of changes in the software environment.
2.3. Software Domain Knowledge Graph. Knowledge graphs are not only widely used in search engines, question answering systems [22], and medical service support, but also play an important role in software reuse. Lin

Burner
The main purpose of Burner is to solve the problem of automatic recipe generation. We solve the dependency problem by building an offline knowledge base and design an inference algorithm to return dependencies in a feasible installation order. We use the Django framework to implement Burner as a web application that researchers can use by visiting the website. As shown in Figure 1, Burner uses the knowledge graph to automatically generate recipes. The core modules of Burner include the knowledge graph reading and writing module, which mainly provide data support for other modules. The entity recommendation module can recommend entities such as base images and software packages according to the Tag selected by a user. The software package installation order inference module can infer the order of the software package entities selected by the user to form an ordered software package installation list. Finally, the recipe generation module generates instructions according to the rules of the recipe, and saves the generated instructions in the form of files.

Construction of Knowledge Graph in HPC Container Domain
The knowledge graph is the cornerstone for our automatic generation of recipes and can provide strong support for dependency inference. Therefore, we start with the construction of domain knowledge graph to illustrate our work. As shown in Figure 2, the construction of knowledge graph in the field of HPC container mainly includes four parts: raw data acquisition, recipe parsing, knowledge fusion, and knowledge storage.

Ontology of Knowledge
Graph. The ontology of the knowledge graph in the HPC container domain is shown in Figure 3, which includes 4 entity types and their attributes and 8 entity relationships. 3 Wireless Communications and Mobile Computing 4.2. Data Acquisition. The data mainly comes from Singularity Hub, a publicly available platform for building and deploying scientific containers, which provides great convenience for reproducible scientific fields [26]. We use a customized crawler to collect and organize the recipes and their authors, tags, and other information. The raw data we obtained contains 530 tags and more than 1000 published recipes for HPC applications.

Two-Phase
Parsing of Recipe. Code 1 is an example of a Singularity Definition File, where bash statements are usually nested [27]. We design a two-phase parsing method to parse the nested bash statements in recipe. The first stage is the instruction parsing, and AST parsing is performed according to the grammar specification defined by Singularity. The second stage parses the bash statements nested in the ASTs.
The first stage identifies each instruction according to the grammar of the recipe. The Singularity Definition File has different instruction blocks. Except for Bootstrap instruction and From instruction, all other instructions start with %. Therefore, regular expressions can be used to match and divide instructions, and each instruction can be parsed into an AST node, as shown in Figure 4.
In the second stage of parsing, through command analysis, it is known that the information of the base image is in the "From" command field of the recipe and information of the packages is mainly in the "%post," "%environment," and "%runscript" instructions. However, the bash statements are often nested in the "%post" and "%runscript," which are numerous and varied.
It is impractical to design corresponding parsing methods for all Bash instructions, so we classified and counted these Bash instructions and found that 80% of the Bash command line calls are included in the 50 most commonly used commands. The names and classifications of the 50 commands are shown in Table 1. In this paper, we design a Bash statement parser for these 50 instructions by      Wireless Communications and Mobile Computing referring to the command manual and official documents of these instructions. The Bash statement parser is implemented by modifying the shellcheck tool [28]. As shown in Figure 5, after the second stage of parsing, the AST is generated. The parsing of commonly used Bash statements greatly enriches the content of the abstract syntax tree, which also provides a foundation for the extraction of software package entities and the mining of dependencies.
In the parsing example, apt − get − y update, apt − get − y installf ortunecowsaylolcat, and f ortunejcowsayjlolcat cannot be parsed in the first stage, but in the second stage, apt − get is one of the common commands, which can be further recognized and parsed by the Bash statement parser, while f ortunejcowsayjlolcat cannot be further parsed because it is not a common command.

Entity Extraction and Entity Relationship
Mining. Entity extraction can be performed from the ASTs generated by parsing. The entities we focus on are mainly base images and software packages. The information about a base image can be obtained from the child nodes of From node. We can traverse the subtree with %post and %runscript nodes as the root node to find the PACKAGES nodes for package information. During the traversal process, we can obtain the installation method of the packages from nodes such as APT-GET-INSTALL, YUM-INSTALL, and PIP-INSTALL. While traversing the tree, the appearance order of software packages in the PACKAGES node is also recorded, which is convenient for subsequent mining of software package dependencies.

Wireless Communications and Mobile Computing
The main purpose of the software package entity association mining is to find the predecessor and successor relationships between software packages [29]. We use the Apriori algorithm to mine package dependencies [30]. If the confidence level of the association rule pkg1 ⟶ pkg2 is 1.0, it means that pkg2 can be installed under the condition that the package pkg1 is known to be installed; then, we can consider that the package pkg2 is a dependent package of the package pkg1. The mining algorithm is shown in Algorithm 1. We set min_support as the reciprocal of the minimum frequency of software packages so that the dependencies between software packages can be mined to the greatest extent. The minimum confidence is set to the most commonly used 0.8.

Knowledge Fusion and Knowledge Graph Construction.
Knowledge fusion [31] is to unify and standardize the knowledge extracted from different recipes. The acquired knowledge is uniformly encoded with all entities as nodes and all entity relationships as edges. This unified code is the unique identification of the entity or entity relationship in the knowledge graph. Finally, the knowledge is stored in the graph database Neo4j (see Figure 6), which contains 2832 entities and 62614 relationships. The standardized knowledge base can provide support for the customized generation of subsequent recipes. The entities and their attributes are shown in Table 2, and the entity relationships are shown in Table 3.

Implementation of Burner
The most important modules of Burner are the entity recommendation module and the installation order inference module. In the recommendation module, we improve the tag-based recommendation method, which can avoid the influence of popular tags and popular items on the recommendation effect. In the installation order inference module, we use a graph algorithm to supplement package dependencies and determine the package installation order.

Tag-Based Base Image and Package Recommendation.
Tag is the label that the user marks on the recipe, but a recipe contains a base image and multiple package entities. There is no direct relationship between tags and these entities. The most commonly used software such as git and w get are widely present in recipes. How to find the entity in the recipes that can well represent the Tag is a problem worth thinking about.
In Figure 7, we count the number of occurrences of tags; it can be observed that tags conforms to the long-tailed distribution [32]. In order to better meet the needs of user personalization, we design a tag-based recommender system inspired by the idea of TF-IDF [33]. The simplest tag-based recommendation method counts the number of times tagged by tags to recommend items. In the actual system, according to the tags selected by the user, the corresponding most popular items are searched for the recommendation. However,

Relation type Explanation
Include The relationship between recipe and its base image or software package, indicating that recipe contains a certain base image or a software package

In
The relationship between base image or software package and recipe, indicating that the base image or software package is included in the recipe

RelyOn
The relationship between package and base image means that the package depends on the base image for installation and configuration

IsFoundationOf
The relationship between base image and package, which means that the image is a starting image for the package to install and configure IsTagOf The relationship between tag and recipe, indicating that the tag is a label of the recipe HasTag The relationship between recipe and tag, indicating that the recipe has the tag it points to

Previous
The relationship between packages, indicating that the current package appears in the recipe before the package it points to Next The relationship between packages, indicating that the current package appears in the recipe after the package it points to 6 Wireless Communications and Mobile Computing the apparent disadvantage of this recommendation method is that popular tags and popular items have a considerable weight, which dramatically reduces the novelty of the recommendation results. To this end, we have optimized the tag-based recommendation algorithm by drawing on the idea of TF-IDF. The core of our recommendation algorithm is based on the fact that a software package has appeared under a certain tag and hardly appears in other tags, so it can be considered that the software package is the core package under this tag. As shown in Formulas (1), (2), and (3), n i , j represents the number of times that pkg i is marked with tag j , and ∑ k n k , j represents the total number of times that all software packages are marked with tag j . jTagj indicates the total number of tags, jfj : pkg i ∈ t j gj represents the number of tags used to mark pkg i by the user. In Formula (3), jfj : pkg i ∈ t j gj + 1 can prevent the denominator from being 0: 5.2. Dependency Complement and Order Inference. Algorithms 2 and 3 show the complete process of package dependency complementation and package order inference. Algorithm 2 is implemented based on depth-first search (DFS) by taking advantage of the transitive nature of package dependencies. After the user specifies the core packages related to the application, some dependency packages that these core packages depend on may not be included. The representation of package dependencies in the graph is that there is a directed edge between the package and the dependent package. After obtaining the dependencies of the software package list to be installed, the inference module will add dependency packages together with the core software packages specified by the user to the final set of software packages to be installed. Figure 8 shows the possible dependencies of software packages in the graph. In actual use, we only consider the relationship of Previous, because Previous and Next appear in pairs and their functions are equivalent. The packages pointed to by the edges of Previous in the subgraph are in the first order.
The inference of the software package order in Algorithm 3 is to use the topological sorting method to sort all the software packages to be installed. The core idea of  7 Wireless Communications and Mobile Computing topological sorting is to continuously remove the nodes with zero out-degree in the subgraph of the software package until the subgraph is empty. If a node in the subgraph has no out-degree, the software package represented by this node can be installed directly without depending on other software packages. During the iteration, the node with outdegree zero is added to the sorted list, and all dependent edges of the node are removed from the subgraph. By repeating the above process until the graph is empty, the installation order of the packages is finally obtained. 6. Evaluation 6.1. Burner Demonstration. Burner is very friendly for HPC users even with no computer background. In the process of generating recipes using Burner, users do not need to perform any text input operations; during the entire interaction, they can complete the generation of customized recipes only by selecting operations.
In Figure 9, we take the nextf low tag as an example to demonstrate the generation of Singularity Recipe. Figure 9(a) shows the recommendation of software packages and base images based on the tags selected by the user, and the recommended entities are displayed in a dynamic word cloud. Then, the users can select the required base image and software packages to add them to the material library (see Figure 9(b)), and the back end of the material library use the Redis database to quickly perform operations such as additions and deletions. As shown in Figure 9(c), after completing the selection of materials, the users can also specify the type of recipe. Currently, the system supports Singularity Definition File and Dockerfile. Users can preview the generated recipe online or perform operations such as edit, download, and delete (see Figure 9(d)).

PSR =
Node total j j− Node unknown j j Node total j j : To quantify the performance of our recipe parser, we define a parsing success rate in this paper. After the  sorted_pkgs_list.append(pkg) 6 foreach pkg_next in nextNode(pkg) do 7 pkg_next_out_degree -=1 8 end 9 remove_node(pkg, sub_graph) 10 end 11 end Algorithm 3: Package installation order inference Algorithm. 8 Wireless Communications and Mobile Computing parsing is completed, we uniformly mark the nodes that cannot be parsed as UNKNOWN (gray nodes in Figures 4  and 5). We use jNode total j represents the total number of nodes in the AST, and jNode unknown j represents the number of UNKNOWN nodes. The definition of parsing success rate (PSR) is shown in Formula (4), a larger value of PSR means that the parser has stronger performance.

Evaluation Metrics for Recommendation Performance.
In the practical application, we use tags for TopN recommendation [34]. Recommended entities include base images and packages. We use the mainstream Precision, Recall and F1 to measure the recommendation performance of the system. The definitions of recommendation metrics are shown in Formulas (5), (6), and (7): We use the build success rate (BSR) as a functional indicator of the system. Whether the image can be successfully built can intuitively represent the quality of the generated recipes. The definition of BSR is shown in Formula (8). jRecipe total j represents the total number of generated recipes, and jRecipe bs j represents the number of recipes that can successfully build the container images.

Experimental Results and Analysis
6.3.1. Two-Phase Parsing Method. We performed statistical analysis on ASTs generated by 1000 recipes. The density histograms of the distribution of UNKNOWN nodes in two parsing stages are shown in Figure 10.
After the first stage of parsing, 28.3% of the nodes are marked as UNKNOWN as shown in Figure 10 Figure 10(b), in the second stage, the density of recipes with high PSR increases significantly, and the PSR in some ASTs even reaches 100%. On average, only 18.2% of the nodes could not be parsed in the second stage, and the PSR increased by 10.1%. Then, we analyzed the recipes with low PSR and found that the main reason was that some Bash commands in these recipes were not commonly used or the Bash statements were nested too deeply. The results show that our two-phase parsing method can effectively perform recipe parsing and entity extraction.

Recommendation Performance Test.
Taking recommending software packages to users as an example, we compare the performance of four recommendation methods. The four methods are UserCF, ItemCF, TB, and TB-TFIDF. UserCF and ItemCF do not use tag information, only use user and item information as input. TB simply uses statistical information, and TB-TFIDF uses TF-IDF to greatly improve TB. There are two hyperparameters K and N in UserCF and ItemCF. K represents the selection of K users with the most similar interests to the recommended user, and N represents the number of items recommended to the user. After grid search and tuning, the optimal value of K is set to 80. Figures 11 and 12 show the performance of the four recommendation methods under different N values; the detailed results are shown in Table 4. It can be observed that with the increase of N, the Prec ision of UserCF and ItemCF has a relatively significant decline. The recommendation method of TB-TFIDF is relatively stable, and the Precision is usually above 10%. From the F1 value that measures the overall performance of the recommender system, the effect of TB-TFIDF is also the best. The average number of packages contained in each recipe is 23. In practical applications, we set the N value to 20 or 25. Experiments show that the recommendation performance TB-TFIDF is best. The TB-TFIDF recommendation method also does not have the problem of cold start, which is more in line with the actual application scenario.

Image Build Test.
In the image build test, 50 different tags are selected from the tag list by executing a random function. For each Tag, entity recommendation and auto-matic recipe generation are performed through the Burner system. In the recipe corpus of this paper, operating system images account for a large proportion of all base images; these operating system images are dominated by Ubuntu and Centos; the sum of Ubuntu and Centos accounts for more than 80% (see Figure 13). Therefore, for the sake of standardization, the base images are uniformly designated as ubuntu: 16.04 and centos: 7. For the same Tag, we used Burner to generate two types of recipes (Singularity Definition File and Dockerfile).
As shown in   Figure 13: Types and proportions of OS images in recipes. After the image build test, we used the Hadolint tool to detect the Dockerfile, and the results are shown in Table 6. Violations such as DL3006 and DL4000 have been eliminated in the automatically generated Dockerfile; DL3008 and DL3013 have also been improved. The reason of DL3008 and DL3013 cannot be eliminated is that the knowledge of software packages and their versions in the current knowledge base is not sufficient. It can be foreseen that with the enrichment of the knowledge base, DL3006 and DL4000 problems will be improved.
Finally, we analyzed the build logs of the recipes that failed to build, found that the reasons for the failure included environment variable setting errors, missing compilation statements, and "apt-get update" network errors. To solve the above problems, it is necessary to manually further increase the configuration of environment variables and other measures. The highest BSR is 80%, which proves the system can better help users to write recipes, and the Hadolint detect results also prove that the recipes automatically generated by Burner have high quality.

Conclusion and Future Work
Compared with the one-phase parsing method, the twophase parsing method we designed can parse recipes more efficiently. We use the extracted knowledge to build a relatively complete domain knowledge base. On the one hand, this knowledge graph can realize the fine-grained representation of knowledge. On the other hand, the use of graph data and graph algorithms can better solve the problem of dependencies. The automatic generation of recipes using knowledge can greatly reduce the burden of related developers. The recipe automatic generation system Burner can meet the individual needs of different users on the basis of improving the correctness of recipes. The design of Burner revolves around the two core issues of automation and personalization, and the automatic generation of recipes is finally achieved through the construction of knowledge base and the recommendation of entities.
At present, the amount of knowledge in the prototype system is still relatively small, and the dependency inference through this knowledge base may lack version information. In the future, higher-quality recipe generation can be achieved by expanding the scale of the knowledge base. In addition, the software packages in our knowledge base are all officially packaged software (OPS) registered in public repositories and can be installed using package management tools such as apt-get and pip. Some unofficially packaged software (UOPS) cannot be automatically downloaded and installed by package management tools. These UOPS usually need to specify the download address and also need to perform operations such as decompression and switching directory compilation to install. Further research is required for UOPS.
The new versions of Docker and Singularity have added a multi-stage build function, which supports the separation of the compilation environment and the running environment, allowing multiple FROM instructions to appear. This new feature can greatly reduce the size of the final image. We tend to support the multi-stage build function. We will collect the recipe application examples of multi-stage build and improve our research to support the multi-stage build function.

Data Availability
The original recipe dataset and parsing results can be accessed at https://github.com/jhshz520/BurnerRecipe.

Conflicts of Interest
The authors declare that they have no conflicts of interest.