Lolisa: Formal Syntax and Semantics for a Subset of the Solidity Programming Language

This article presents the formal syntax and semantics for a large subset of the Solidity programming language developed for the Etheruem blockchain platform. This subset is denoted as Lolisa, which, to our knowledge, is the first mechanized and validated formal syntax and semantics developed for Solidity. The formal syntax of Lolisa adopts a stronger static type system than Solidity for enhanced type safety. In addition, Lolisa not only includes nearly all the syntax components of Solidity, such as mapping, modifier, contract, and address types, but it also contains general-purpose programming language features, such as multiple return values, pointer arithmetic, struct, and field access. Therefore, the inherent compatibility of Lolisa allows Solidity programs to be directly translated into Lolisa with a line-by-line correspondence without rebuilding or abstracting, and, in addition, the inherent generality of Lolisa allows it to be extended to express other programming languages as well. To this end, we also present a preliminary scheme for extending Lolisa to other languages systematically. In recent work, we developed a general, extensible, and reusable formal memory (GERM) framework that can simultaneously support different formal verification specifications, particularly at the code level, for verifying the properties of programs based on higher-order logic theorem proving. The framework simulates physical memory hardware structure, including a low-level formal memory space, and provides a set of simple, nonintrusive application programming interfaces and assistant tools using Coq that can support different formal verification specifications simultaneously. The proposed GERM framework is independent and customizable, and was verified entirely in Coq. We also developed an extension of Curry-Howard isomorphism, denoted as execution-verification isomorphism (EVI), which combines symbolic execution and theorem proving for increasing the degree of automation in higher-order logic theorem-proving assistant tools. To capitalize on these breakthroughs, the semantics of Lolisa follows EVI, and is designed based on the GERM framework using natural semantics that observes both terminating and diverging executions. Therefore, in conjunction with the formal interpreter developed based on Coq in the present work, it is then theoretically possible for programs written in Lolisa to be symbolically executed in higher-order theorem-proving assistants directly, and have their properties verified automatically at the same time. The semantics of Lolisa are validated, and we certify that Lolisa satisfies EVI. This work is part of our project to build a general and powerful formal symbolic process virtual machine for certifying and verifying smart contracts operating on the blockchain platform easily and automatically without consistency problems


Introduction
The blockchain platform [1] is one of the emerging technologies developed to address a wide range of disparate problems, such as those associated with cryptocurrency [6] and distributed storage [14].Ethereum is one of the most widely adopted blockchain systems.One of the most important features of Ethereum is that it implements a general-purpose Turing-complete programming language denoted as Solidity [7].This allows for the development of arbitrary applications and scripts that can be executed in a virtual runtime environment denoted as the Ethereum Virtual Machine (EVM) to conduct blockchain transactions automatically.These applications and scripts (i.e., programs) are collectively denoted as smart contracts.The growing use of smart contracts has led to increased scrutiny of their security.Smart contracts can include particular properties (i.e., bugs) making them susceptible to deliberate attacks that can result in direct economic loss.Some of the largest attacks on smart contracts are well known, such as the attack on decentralized autonomous organization (DAO) [9] and Parity wallet [10] contracts.In fact, many classes of subtle bugs, ranging from transaction-ordering dependencies to mishandled exceptions, exist in smart contracts [11].
One of the challenges that must be confronted in the development of smart contracts is that the programming process differs from that of normal programs.Smart contracts are a special type of digital contract, which means the source code is the law, so, similar to the contracts written in natural language, the obligations and terms should be presented in the smart contracts explicitly.However, smart contracts developers are generally programmers, whose grasp of obligations and terms is secondary to their grasp of programming, rather than legal experts.So, even though programming language is a type of formal language, there are many loopholes of law in smart contracts caused by the programming habits of programmers.An additional challenge is that conventional software testing methods, such as α and β testing, do not perform well for smart contracts because of the near impossibility of patching a contract once it has been deployed due to the anonymous Byzantine execution environment of a public blockchain.Besides, these loopholes of contracts are different from common bugs to be easily found out.Because these loopholes often directly cause the economic loss stealthily instead of making the programs break off.Moreover, software engineering techniques employing such static and dynamic analysis tools as Manticore [12] and Mythril [13] have not yet been proven to be effective at increasing the reliability of smart contracts.
Due to these difficulties, formal verification for the blockchain platform has become a subject of particular interest in recent years because it is one the most powerful program verification technologies.And many popular research works focus on the formal verification on the bytecode of Solidity.For example, KEVM [18] is a formal semantics for the EVM written using the K-framework, like the formalization conducted in Lem [19].KEVM is executable, and therefore can run the validation test suite provided by the Ethereum foundation.The symbolic reasoning conducted for KEVM programs involves specifying properties in Reachability Logic, and verifying them with a separate analysis tool.
While these represent currently available mechanized formalizations of operational semantics, axiomatic semantics, and formal low-level programming verification tools for EVM and Solidity bytecode, they are not well suited to high-level programming languages, such as Solidity.
In response, the Ethereum community has placed open calls for formal verification proposals [15] as part of a concerted effort to develop formal verification strategies [16].
Actually, in other field of computer science, a number of interesting studies have focused on developing mechanized formalizations of operational semantics for different high-level programming languages.For the C language, the Cholera project [32] represents one of the most interesting works that formalized the static and dynamic semantics of a large subset of C using the HOL proof assistant.The CompCert project [33] is another influential verification work for C and GCC that developed a formal semantics for a subset of C denoted as Clight.This work formed the basis for VST [34] and CompCertX [35].In addition, a number of interesting formal verification studies have been conducted for operating systems based on the CompCert project.Tews et al. [36] developed a denotational semantics for a subset of the C++ language, which were presented as a shallow embedding in the PVS prover.For Java, the work of Igarashi et al. [37] is particularly inspiring because it presents a minimal core calculus for Java and Generic Java (GJ), and the important core properties are certified.Another similar work has been conducted for proving Java type soundness [38].In addition, the operational semantics of JavaScipt have been investigated [39], which is of particular importance to the present study because Solidity is a programming language like JavaScipt.However, not all these frameworks can be executed in higher-order logic theorem-proving assistants directly.
In addition, the formal syntax and semantics of programming languages play a very important role in several areas of computer science, particularly in program verification.For advanced programmers and compiler developers, formal semantics provide a more precise alternative to the informal English descriptions that usually pass as language standards.In the context of formal methods, such as static analysis, model checking, and program proof, formal semantics are required to validate the abstract interpretations and program logic (e.g., axiomatic semantics) used to analyze and verify programs.The verification of programming tools such as compilers, type-checkers, static analyzers, and program verifiers is another area where formal semantics for the languages involved is a prerequisite.However, the development of high-level formal specifications of Solidity and relevant formal verification tools have attracted considerably less interest from researchers despite its importance for programming and debugging smart contract software.Therefore, the need for related work is very urgent to fill this gap.
Higher-order logic theorem proving is one of the most rigorous technologies for verifying the properties of programs.This involves establishing a formal model of a software system, and then verifying the system according to a mathematical proof of the formal model.
However, numerous problems regarding reusability, consistency, and automation are encountered when applying theorem-proving technology to program verification.One of the available solutions for addressing these problems is to design a formal symbolic process virtual machine (FSPVM) in a higher-order theorem-proving system such that real-world programs can be symbolically executed, and their properties verified automatically using the execution result.However, the successful implementation of such an FSPVM must overcome a number of challenges.In a recent work [3], we addressed some of these challenges by developing a general, extensible, and reusable formal memory (GERM) framework based on higher-order logic using Coq.It includes a formal memory space, and provides a set of simple and nonintrusive memory management application programming interfaces (APIs) and a set of assistant tools.The GERM framework can express the interaction relationships between special and normal memory blocks.One the one hand, the framework functions independently of higher level specifications, so it can be used to represent intermediate states of any high-level specifications designated by general users, which facilitates the reuse of intermediate representations in different high-level formal verification models.On the other hand, the framework can be used as an operating environment to facilitate automated higher-order logic theorem proving.We also developed an extension of Curry-Howard isomorphism (CHI), denoted herein as execution-verification isomorphism (EVI), which can combine theorem proving and symbolic execution technology in the operating environment of the GERM framework to facilitate automated higher-order logic theorem proving.The use of EVI makes it possible to execute a real-world program logically while simultaneously verifying the properties of the program automatically in Coq or using another proof assistant that supports higher-order logic proving based on CHI without suffering consistency problems.
The work presented in this paper also aims to address an additional challenge associated with designing an FSPVM in a higher-order theorem-proving system, as part of our ongoing project to build a general and powerful FSPVM for certifying and verifying smart contracts operating on multiple blockchain platforms.This challenge involves formalizing real-world programming languages as an extensible intermediate programming language (IPL), and integrating the IPL into the logic operating environment.Here, the formal syntax and semantics of the IPL should be equivalent to those of the respective real-world target programming languages.Such an IPL contributes to addressing the reusability and consistency problems because it standardizes the process of building a formal model for programs.As such, the equivalent formal versions of programs written in the IPL can serve as their formal models without additional abstracting or rebuilding.The developed formal interpreter based on the GERM framework and EVI should be able to automatically execute the formal version of programs written in the IPL in the logic operating environment.Therefore, the present article capitalizes upon our past work by defining the formal syntax and operational semantics for a large subset of the Solidity programming language.This subset is denoted herein as Lolisa, which, to our knowledge, is the first mechanized and validated formal syntax and semantics developed for Solidity.Lolisa has the following features.
 Consistancy Lolisa formalizes most of the types, operators, and mechanisms of Solidity, including reference arithmetic, references to functions, and struct, address, and mapping types, as well as the contract inheritance mechanism, and it includes nearly all of the Solidity syntax according to Solidity documentation [21].In addition, we build a standard library based on Lolisa to represent the built-in data structures and functions of EVM, such as msg, block, and send.As such, programs written in Solidity can be translated into Lolisa, and vice versa, with a line-by-line correspondence without rebuilding or abstracting, which are operations that can negatively impact consistency.

 Static Type System
The formal syntax in Lolisa is defined using generalized algebraic datatypes (GADTs) [2], which imparts static type annotation to all the values and expressions of Lolisa.In this way, Lolisa has a stronger static type system than Solidity for checking the construction of programs.As such, it is impossible to construct ill-typed terms in Lolisa, which also assists in discovering ill-typed terms in Solidity source code.Moreover, the formal syntax ensures that all expressions and values in Lolisa are deterministic.

 Executable and Provable
In contrast to similar efforts focused on building formal syntax and semantics for high-level programming languages, the formal semantics of Lolisa are defined based on the GERM framework in conjunction with EVI.As such, it is theoretically possible for programs written in Lolisa to be symbolically executed and have their properties simultaneously verified automatically in higher-order logic theorem-proving assistants directly as program execution in the real world when conducted in conjunction with a formal interpreter.

 Mechanized and Validated
The syntax and semantics of Lolisa are mechanized using the Coq proof assistant [5].We also develop a formal verified interpreter in Coq to validate whether Lolisa satisfies the above Executable and Provable feature and the meta-properties of the semantics.The details regarding the implementation of our formal interpreter will be presented in our next paper.

 Extensible and Universal
Although Lolisa is designed for Solidity, it includes many general features of other general-purpose high-level programming languages.As such, the core syntax and semantics of Lolisa can be extended to formalize other similar programming languages.We therefore provide a preliminary scheme to systematically extend Lolisa to support different high-level programming languages.
The remainder of this paper is structured as follows.Section 2 introduces preparatory work about the modification of the GERM framework to support the semantics of Lolisa.Section 3 elaborates on the formal abstract syntax of Lolisa, and compares this with the formal abstract syntax of Solidity.Section 4 presents the module system of Lolisa to formalize the behavior of contract inheritance and member access.
Section 5 presents the formal dynamic semantics of Lolisa, including the program execution semantics and the formal standard library for the built-in data structures and functions of EVM.Section 6 describes the integration of the Lolisa programming language and its semantics within the formal verified interpreter and its validation.Section 7 introduces the extensibility and universality of Lolisa, and proposes a preliminary scheme to systematically extend Lolisa to support different high-level programming languages.Finally, Section 8 presents the conclusions of our work.

Preparastion Work
Before defining the formal specifications of Lolisa, it is necessary to achieve the preparation works for defining the basic environment.

Predifinitions
Tables 1 summarize the helper functions used in the dynamic semantic definitions.State functions calculate commonly needed values from the current state of the program, and all of these state functions will be encountered in the following contents.Components of specific states will be denoted using the appropriate Greek letter subscripted by the state of interest.
First of all, because all Lolisa formal specifications are constructed on GERM framework, the context of the formal memory space is denoted as , the context of the execution environment is represented as ℰ and we assign  to denote the set of memory addresses and employ the meta-variable  to represent an arbitrary address.Specially, the function return address   , which, in the current version of Lolisa, is assumed to be the next address after a function identifier.In addition, struct is a type of important data structure in Lolisa.Therefore,  represents the Lolisa struct information context, and  is employed to represent the set of pointers of struct types.Besides, in the following type judgements may include variables, so our type will include references to variable-typing contexts, which we will denote as ,  1 , etc.Such contexts are finite mappings from variable names to types.Besides, we assign  as the native value set of the basic logic system.And programs may also contain references to the Solidity program's declared functions of a Solidity program, so that another map is needed, this mapping from function identifiers to types.For brevity, we will be written ,  1 etc.For shorthand, we will assign ℱ to represent the formal system combination of , , , , , and .Because they.All these terms represent different structure information of Lolisa different part infer the formal system world, and of Lolisa that will be needed in the following content, they will be needed all discussion.

Modification of the GERM framework
As discussed, the dynamic semantics of Lolisa are designed based on our GERM framework, which is also employed to represent or generate the intermediate memory state.We write  to denote the specific memory state.The level of the formal memory space simulates a real-world physical memory structure, and consists of formal memory blocks used to store information, and the formal memory addresses used to index the respective memory blocks.Because of the formal memory space definition employed in the GERM framework, we can define special memory addresses to index special memory blocks isolated from the normal memory block.We define the formal memory space architecture by enumerating the memory blocks using the Record type to simulate the physical memory space intuitively in the higher order logic system of Coq.In the formal specification based on these useful features of the Record type, the memory address denoted by address is the field identifier of record type memory, and each field can record a term denoted as value.Each formal memory block can be abstracted as a Cartesian product ⟨m addr , m value ⟩: address*value, where the metavariable m addr is an arbitrary memory address, and the metavariable m value is the value term stored in m addr , which includes the input data, and respective data environment and memory block state information.In the original version of the proposed framework, value can record 11 basic datatypes, including undefined, machine integer, Boolean, floating point, string, and array datatypes, in addition to pointers for variables, parameters, and functions, as well as program statement and struct datatypes.The level of low-level memory management operations analyzes requests for high-level memory management operations, and interacts with the formal memory space to generate the resulting memory state for those operation requests.Finally, the operation requests are executed at this level.The interactions on this level involve distinct low-level operations for normal memory blocks and special memory blocks.In the low-level memory management layer, the label address type is a transitional type in Coq that is employed to provide a simple memory address identifier for operation functions, and to isolate the low-level formal memory space from high-level formal specifications.We generally adopt the metavariable L address to represent a label address.However, to support the syntax and semantics of Lolisa, we must modify and extend the GERM framework.The modification includes: 1) defining special label addresses and respective memory blocks for the built-in data structures and functions of EVM; 2) specifying abstract definitions of the environment and memory block states; 3) extending the range of types of m value .
First, Lolisa operates at the high-level specification in the GERM framework, such that the low-level memory address is transparent for Lolisa, and it can only operate label address.Therefore, for simplicity, the label address will be denoted as memory address or address in the following discussion.Each identifier of Lolisa programs (including identifiers for variables, parameters, struct types, functions, and contracts) is bound by an address to index the respective memory block in the GERM framework.For example, in Coq, an identifier is defined as And in the formal system of Coq or similar proof assistants,   and   are equivalent logic symbol, which can be abstracted as rule ID-ADDR.Therefore, in the following contents, it can be implied that each identifier represents a memory address.

𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑟 ≝ 𝛼. (ID-ADDR)
However, Solidity employs some built-in structures and functions in addition to the structures designed by users, and we assume in Lolisa that these built-in components are correct and trusted.These data structures and functions are defined and implemented using Lolisa syntax in advance, and we package them as a standard library, which will be discussed in detail in a later section.Obviously, these must also be stored in memory blocks indexed by memory addresses.To avoid overwriting these standard data during the execution (verification) of formal programs in theorem-proving assistants, we add some special addresses and respective memory blocks in the GERM framework to isolate these standard data structures from user data.Currently, we support the built-in data structures and functions send (also denoted as transfer in current EVM), call, msg, address, and block.Their addresses are defined as follows.
Special address:   :: In addition, we introduce some special functions in the following sections that need no special addresses, such as requires [21].
Second, we specify the abstract definitions of the execution environment and memory block states, which are used to store the logic information of the native formal system, and are provided to the dynamic semantics of Lolisa.The abstract definitions of the environment and memory block states are given according to Solidity specifications below as Redefinitions 1 and 2.
Redefinition 1 (execution environment; env) In the current framework for Lolisa, the execution environment (denoted as env) includes four components:  ℎ: a list of terms with   type to store inheritance relationships between different contracts;    : terms with   type to store the super-scope of the current scope;    : terms with   type to store the current scope;    : terms with ℕ type to store the level of the current scope, where levels 0, 1, and 2 represent a local variable, a contract member variable, and a global variable, respectively;  : represents the remaining amount of gas [21] in the current execution environment.
The inductive definition of environment follows the rule ENV-IND given below.
≝  〈ℎ,   ,   ,   , 〉 (ENV-IND) We formally define env according to the rule ENV-TERM below.Here, env represents an arbitrary term constructed by only a single constructor Env according to the rule CST-ENV below.
 : This represents whether a current memory block is presently allocated.In Coq formalization, it is defined as an inductive type, and has two representations:  : This refers to the access modifier used to represent the access authority of the current memory block.In Coq formalization, it is defined as an inductive type, and has three representations:  : The GERM framework employs a strong type memory space.Therefore, if a memory block is allocated to a identifier having a type , the memory block will be initialized by the respective initial memory value that satisfies the mapping relation with type .
As such,   will store a Lolisa type annotation that represents an expected type of the respective memory block indexed by the current identifier.This is defined in detail in the following section.
We formally define b infor according to the rule BLC-TERM below.Here, b infor represents an arbitrary term constructed by only a single constructor Blc according to the rule CST-BLC below, .In addition,   is contained within .  :  (BLC-TERM) and VALUE-SEND,   is defined by rule MAP-V below, where the symbol ⊝ is adopted to represent deleting rather than listing its rules redundantly.
In Coq formalization,   is defined as a new type whose constructors are prefixed by i.

Formal Syntax of Lolisa
Lolisa is a large formal subset of Solidity that is structured into type annotations, values, expressions, statements, functions, and modules, and its syntax is structured into five level that the type level, value level, expression level, statement level and module level.In Coq formalization, the formal abstract syntax is presented as inductive predicates, and therefore achieves a deep embedding of Lolisa in Coq.The formal abstract syntax tree of Lolisa is nearly equivalent to that of Solidity, in that it contains nearly all the components of the original Solidity syntax.However, owing to the stated goal of Lolisa, it also includes some modifications to ensure more effective program verification.One of the differences between Solidity and Lolisa is that the Lolisa syntax of values and expressions are redefined as strongly typed using GADTs, which allows us to define types of syntax constructors directly.As discussed, it is therefore impossible to construct ill-typed terms, and the formal syntax ensures that all expressions and values in Lolisa are deterministic.Moreover, the formal static and dynamic semantics are more easily defined and understood, and the evaluator for these semantics is easier to read and write, as well as being more efficient.In addition, the type annotations not only transmit the type information, but can also store the value information, which will be discussed in detail in the following subsections.Other differences are that the current version of Lolisa omits some unnecessary qualifiers, does not support the inline assembly of Solidity, and only wei units are supported implicitly.

Types
The formal abstract syntax of Lolisa types is given in Figure 2. Supported types include arithmetic types (integers in various sizes and signedness), byte types, array types, mapping types, as well as function types and struct types.Although Solidity is a JavaScript-like language, it supports pointer reference.Therefore, Lolisa also includes pointer types (including pointers to functions) based on label address specification.
Furthermore, these type annotations and relevant components can be easily formalized by enumerating inductively in Coq or other higher-order logic theorem-proving assistants.Lolisa does not support any of the type qualifiers such as const, volatile, and restrict, and these qualifiers are simply erased during parsing.The types fill two roles in Lolisa.Firstly, they serve as type declarations of identifiers, and, secondly, they serve as tags to specify the constructor types of values and expressions.In Coq formalization, the term  is declared as type type according to rule TYPE-TERM below.
:  (TYPE-TERM) Note that many types are defined in Figure 2 as parameterized types recursively.In this way, a specific type is dependent on the specified parameters, and can abstract and express many different Solidity types.
One of the most important data types of Solidity is mapping types.In Solidity documentation [21], mapping types are declared as mapping (_KeyType ⇒ _ValueType).Here, _KeyType can be nearly any type except for a mapping, a dynamically sized array, a contract, and a struct.As shown in Figure 2, it is defined as Tmap(  , ), where   represents the _KeyType and  represents the _ValueType.The best way to keep the terms in Lolisa well-typed and to ensure type safety is to maintain type isolation rather than adding corollary conditions.
Therefore, we define a coordinate type   for _KeyType employed in mapping.In particular, the address types in Lolisa are treated as a special struct type, so that _KeyType is allowed to be a struct type in Lolisa.In Coq formalization,   shares the same constructor with that of  except for Tmap, and a term with type   is recorded as   according to the rule MAP-TYPE-TERM below. and ] are both still with type .Therefore, the array type and mapping type of Lolisa can be abstracted as rule EXP-ARR-N' and EXP-MAP-N' that the   represents the recursive type declaration.
Array types in Solidity can be signed by the qualifiers storage and memory, which divide array types into storage arrays and memory arrays, respectively.For storage arrays, the element type can be arbitrary.For memory arrays, the element type cannot be a mapping.In Lolisa, we omit these two qualifiers, and set storage arrays as the default because a memory array is a special case of a storage array, and the difference between the memory array and storage array can be easily checked by the Solidity compiler directly .We classify  and   into normal form types and non-normal form types.The normal form types refer to the types whose typing rules are without recursive definition.And, non-normal form types are defined conversely.For example, the normal form of  (  , ) should be Tbool.As defined in Figure 2, in Lolisa, the non-normal form types are array type and mapping type, and the normal form types include Tundef, Tint(signedness, intsize), Tbool, Tstring, Tfloat, Tbytes(bytesize), Tstruct(), Tvid(), Tpid(), Tfid(), Tcid(), Tstt and Tmodi.The predicate () represennts a type  ∈   , .And, as shown in EXP-ARR-N and EXP-MAP-N above, the type of value yielded by evaluation is the final type   , which cannot be array and mapping types.Therefore,   should satify the predicate (  ).
Specifically, the Tundef type serves as the void type.The integer and byte types, denoted as Tint(signedness, intsize) and Tbytes(bytesize), fully specify the bit size and the signedness.Typically, the interpreter maps TInt to a signed integer with a size I64.Fixed point number types

Tfloat
are not yet fully supported by Solidity, and, although they can be declared, they cannot be modified by side effect.Therefore, float types in Lolisa are only supported in basic operations.
The struct type denoted as Tstruct() in Lolisa requires special treatment.Solidity is a very flexible language that allows the usage of struct datatypes and access to their respective fields under a number of different conditions.In an effort to retain the flexibility of Solidity while maintaining strict formal type rules in Lolisa, we separate the struct formal syntax into four levels, and assign different functions to the definition of each level.These different definitions are given in Convention 1. Basically, Tstruct() is used at the type level as an annotation for variable declaration.The parameter  is also an identifier specified by a memory address that points to the specific struct type, which is declared at the statement level and stored in memory space.The details about struct definition at value level, expression level and statement level will be introduced in Section 3.2, 3.3 and 3.4 respectively.

Convention 1. Formal struct datatype
Lolisa includes four kinds of pointer types: variable pointers, parameter pointers, function pointers, and contract pointers, which are defined as dependent types formally.Thus, each type is specified by a specific optional memory address  with the type    , which is used to express the situation of the NULL pointer.Specifically, if a reference identifier points to a NULL pointer or fails to allocate a logic memory address,  is None.Otherwise,  is Some .However, the NULL pointer is invalid in the semantics for Solidity during evaluation.During evaluation, a formal interpreter or compiler can index a respective memory block by  directly, and this design is used for extension which will be explained in Section 7 NULL NULLNULL In subsequent discussion, we employ the construction , ,  ⊢ () to represent that type  is well-formed in the Lolisa struct information context , assuming that all pointers to struct types with tags in the set  are also well-formed.In addition, types may require a label address, so the parameter  is necessary.Furthermore, the construction  ⊢ () represents that  and  are equal to the empty set, which will be our typical meaning when employing the term well-formed.First, the normal form types are well-formed: (WF-MAP) Henceforth, our meta-variables ,  1 ,  2 etc. will range only over well-formed types.

Values
Most similar formalization studies focused on high-level programming languages do not contain a value-level formalization.The present motivation for defining a value level for Lolisa is that the target of this project is to formalize mechanized syntax and semantics for a subset of the Solidity language that can be executed and verified in higher-order logic theorem-proving assistants directly.Therefore, Solidity values must be evaluated like the native values in the formal system.The best possible situation would be to employ the values of Solidity or those of some mainstream high-level programming language explicitly in the formal system.However, due to the strict typing system of the trusted core and the adoption of different paradigms, nearly all higher-order logic theorem-proving assistants, such as Gallina (Coq), do not directly support complex values, such as array values and mapping values.Therefore, it is essential to define an interlayer between the values of real-world languages and the native values of the formal system that can represent the real world-values directly with an equivalent syntax, and can translate the real world-values into the native values using formal semantics.The syntax of values employed in Lolisa is given in Figure 3.All values are signed by the types defined in the previous subsection.Actually, value is a type dependent on the specific type signature, as given by rule LOS-VAL below.

𝑣𝑎𝑙: 𝑡𝑦𝑝𝑒 → 𝑇𝑦𝑝𝑒 (LOS-VAL)
In this way, all values in Lolisa are well-typed, such that it is impossible to construct an ill-typed value such as Vint(true).In addition, the type information is transmitted to the expression level for maintaining well-typed values.For example, a value v has type   1 , and the constant expression Econst, which will be defined in Section 3.3, has type ∀ (: ),   →  ⇓  ⇓  .And the  in Econst(v) is determined by  1 .In this way, the type information of value level is transmitted to expression level.The more specific details will be explained in the next section.We assign  to represent terms with type (∀ : ,  ), as given by the rule LOS-VAL-TERM below.

𝑣: (∀ 𝜏: 𝑡𝑦𝑝𝑒, 𝑣𝑎𝑙 𝜏) (LOS-VAL-TERM)
Another important component of the values in Lolisa is the parallel mapping value   , which corresponds to   .In this way, type safety and type isolation can be transmitted from the type level to the current value level.It should be noted that we do not discriminate between values and mapping values because the syntax of mapping values is nearly equivalent to that of values which is defined in Figure 3.The only differences between their respective syntaxes are that mapping values are annotated by   and mappings are not defined recursively.
These differences do not affect the value-level formalization.
As was employed for types, the value formalizations are also classified as normal and non-normal forms.But different from the classification principle of normal and non-normal form types, the normal form value   signifies that a value can access the native value of the formal system directly, rather than searching and matching values stored in the memory space, and that   types have a one-to-one correspondence with   types.Moreover,   types can be easily formalized using inductive enumeration.For example, the rule EXM-VAL-BOOL below presents the type rule of Boolean values, while the remaining   type rules are defined in an equivalent manner.
As such, values are obviously well-formed types, and are written as (  Typically, field access is defined as a value rather than as an expression or statement.This is the case in Solidity, which allows both field access or method access to be defined as values.An example of this is shown by the Solidity code segment given in Figure 4 [22], where the members of the struct pledge[i] are invoked as values, which is common usage in Solidity.Therefore, field access is formalized in Lolisa at the value level.
The formal type assessment of field access is defined as the rule VAL-FIELD below.
,, ⊢  0  1 : ,, ⊢ℎ:   0 , ⊢:   ,,, ⊢ :  ( ) ∧ () ,,, ⊢ :(∀  0  1 :,(   0 )→   → ( )→  1 ) ,,, ⊢ ((ℎ,,)) (VAL-FIELD) Figure 4. Smart sponsor contract code [22] Note that, if the final member of the member list is a function call, we employ the list ival to transmit the arguments.Here, ival, defined in Figure 3, is similar to val, except it does not include type annotation.It is unnecessary to limit the types of arguments that will be checked by the helper function in dynamic semantics.Finally, the rule VAL-STR below defines the formal type assessment of struct.As given by Convention 1, struct is also treated in Lolisa as values, which is used to represent an expression value at the right position with a struct type stored in memory space.
Following the above definitions, we can assume that the values in Lolisa are well-formed, and our meta-variables , v 1 , v 2 , etc. will range only over values satisfying ().

Expressions
Having formally specified all the possible forms of values that may be declared and manipulated in Solidity programs, we now discuss the expressions used in programs to wrap values.The formal syntax of expressions is given in Figure 5.All expressions and their sub-expressions are annotated by two type signatures according to the rule EXPR-TYPE below.

𝑒𝑥𝑝𝑟: 𝜏 0 → 𝜏 1 → 𝑇𝑦𝑝𝑒 (EXPR-TYPE)
Here,  0 refers to the current expression type and  1 refers to the normal form type after evaluation.For instance, we would define the type of an integer variable expression e as   ()  .In this way, the formal syntax of expressions becomes more clear and abstract, and allows the type safety of Lolisa expressions to be maintained strictly.In addition, employing the combination of the two type annotations facilitates the definition of a very large number of different expressions based on equivalent constructors.Of course, the use of  0 and  1 may be subject to different limitations depending on the situation.Within expressions, Lolisa does not support assignment operators (=, +=, ++, etc.) and function calls, and otherwise supports only Solidity operators that are free of side effects during evaluation.In Lolisa, assignments and function calls are presented as statements, and cannot occur within expressions.In addition, unary assignment operators, such as increment or decrement operators, are treated as syntactic sugar.As a consequence, all Lolisa expressions always terminate during evaluation and are pure, i.e., their evaluation incurs no side effects.In this way, syntactic expressions can be safely used as components in logical assertions, such that it is much easier to define axiomatic semantics such as Hoare logic and separation logic.Likewise, abstract interpretations and other forms of static analysis are greatly simplified if expressions are pure because most static analysis and program verification tools begin by removing assignments and function calls from expressions, and only then perform analyses over the resulting pure expressions [23].The expressions of Lolisa are classified according to four categories, including constant expressions, location reference expressions, special expressions, and operator expressions, all of which are discussed in detail as follows.Therefore,  0 and  1 should satisfy the limitation TYPE-FORM given below.
To satisfy the limitation TYPE-FORM, the array types and mapping types should be analyzed and simplified according to the type definitions given by Figure 2 into   ∈   , which can be formulated as   ⊢  →  ′ → ⋯ →   ∧   ∈   .We denote this process as ⇓  .Therefore, the type assignment of constant expressions is defined as the rule EXPR-CONS below.Location reference expressions include the identifiers of variables, parameters, functions, and contracts, which are summarized as eaddr in Figure 5.For the type assignment of location reference expressions, the types of constructors are defined as functors.For example, Evar has the type defined by FUNCTOR-VID below.

𝜆 (𝑜𝛼 : 𝑜𝑝𝑡𝑖𝑜𝑛 𝐿 𝑎𝑑𝑑𝑟 ). 𝜆 (𝜏 : 𝑡𝑦𝑝𝑒). (𝑒𝑥𝑝𝑟 (𝑇𝑣𝑖𝑑 𝑜𝛼) 𝜏) (FUNCTOR-VID)
Here, the use of an optional address (the variable identifier) and another type as parameters is facilitated under GADT style definition to denote a specific type.Note that the address is optional because it includes conditions requiring the NULL pointer for representing an expendable space for expending Lolisa to support more general-purpose programming languages, although the NULL pointer is invalid in current Lolisa semantics for Solidity.Therefore, the type assignment of location reference expressions is defined as the rules EXPR-PAR, EXPR-VAR, EXPR-FUN, and EXPR-CON below for the identifiers of parameters, functions, variables, and contracts, respectively.
Here, the value list is defined as   because, while the formal syntax of Lolisa values is dependent on an arbitrary type , the list elements should be of equivalent types.Thus, we can define a wrapper in Coq formalization that cleverly hides the type .For Emodifier expressions, Solidity includes a modifier as a special kind of functor [21] that can limit the behavior of functions annotated by it.Therefore, the only function of Emodifier in Lolisa is to declare an identifier to a modifier, and thereby maintain type isolation from normal functions.It is constructed using Efun to represent the identifier, and employs the list ival to transmit its parameters, as given by the type rule EXPR-MODI below.For operator expressions, Lolisa supports nearly all binary and unary operators.In Coq formalization, binary and unary operators are abstracted as two inductive types  2 and  1 that are also defined by GADTs, and specific operators serve as their constructors.In this way, operator expressions are made more clear and concise, and can be extended more easily than when employing a weaker static type system.The binary and unary operators are annotated by two type signatures, as respectively given in rules EXPR-BOP-TYPE and EXPR-UOP-TYPE below.
2 :  0 →  1 →  (EXPR-BOP-TYPE) As was presented for the standard rule EXPR-TYPE,  0 refers to the operator input type and  1 refers to the output type after evaluation.Therefore,  0 and  1 must satisfy  0 ,  1 ∈   .For example, the conjunction operator for Boolean values is defined as :  2  .In addition, due to the requirement for type limitation, each operator belongs to an operator class.For example, we define distinct equivalence operators for integer values, Boolean values, and address values as :  2  (, ) , :  2  , and :  2  , respectively.
On the one hand, this method maintains strong type limitations for each operator, and ensures that the behavior of each operator is deterministic.
As a result, clear rules are given to the compiler and interpreter.On the other hand, the method ensures that Lolisa operators can be extended easily by adding new operator constructors.The Coq formalization of Lolisa includes nearly 100 kinds of specific operation type constructors for  2 and  1 .For the remainder of this paper, we adopt   () to simplify the formal abstract syntax.Based on the operator formalizations, we can define the formal type rules of operator expressions according to rules EXPR-UOP and EXPR-BOP below.That operator expressions are well-formed is guaranteed by the static typing rules, and it is impossible to construct ill-formed operations, such as "error" + 1, because ill-formed operator expressions cannot pass the GADT-style formal type-checking rules defined above.In particular, type casting is defined as a unary operator because it can be seen as transforming the type  1 of expressions to a type  2 .
In the following discussion, our meta-variables , e 1 , e 2 , etc. will range only over well-formed expressions satisfying ().

Statements
Figure 6 defines the syntax of Lolisa statements.Here, nearly all the structured control statements of Solidity (i.e., conditional statements, loops, structure declarations, modifier definitions, contracts, returns, multi-value returns, and function calls) are supported, but Lolisa does not support unstructured statements such as goto and unstructured switches like the infamous "Duff's device" [29].As previously discussed, the assignment  1 =  2 of a right-value (r-value)  2 to a left-value (l-value)  1 , and modifier declarations, as well as function calls and structure declarations are treated as statements.In addition, statements are also classified according to normal form and non-normal form categories, where the normal form statement, given as   , represents a statement that halts after being evaluated.Actually, although Solidity is a Turing-complete language, smart contract programs written in Solidity have no existing halting problems because the programs are limited by gas [25], which we have defined in ℰ of Lolisa.Owing to formal syntax system in Lolisa employing GADTs, Lolisa statements are all well-formed.For example, Solidity or other formal languages such as Clight [24] allow the formation of the following examples of erroneous syntax: (if ("error")  1  2 ) or (bool b = 4).
Although these syntax errors will be discovered during compilation, such errors can seriously affect the evaluation of programs in higher-order theorem-proving assistants.However, constructing ill-typed statements is not possible in Lolisa because the type annotations have been fixed in the Lolisa formal abstract syntax tree.For example, the r-value   and l-value   in the rule STT-ASSIGN must have equivalent final types, and the return type of the condition expression given by the rule STT-IF must be Boolean.In addition, the current statement  0 in the rule STT-SEQ cannot be a sequence because such a condition would create a confusing program structure.Therefore, errors are discovered through static type checking, which ensures that all statements are well-formed.In addition, conditional and sequence statements are well-formed if their sub-statements are not function declarations and contract declarations.Therefore, we denote the sequence of sub-statements by   ().
The specification of a function in Solidity is given as follows [21]: where the square brackets ([]) indicate an optional component.This specification is too complex to maintain clarity using a single type rule.
Therefore, single-value-return functions (Fun) and multi-value-return functions (Funs) are defined separately to maintain clear and well-formed definitions, and we adopt the option type to indicate optional components.The variable pars represents a list of parameters that stores expressions whose current type must be ( ).Due to the strong typing provided by GADTs, the expression constructor Epar in the higher-order logic system of Coq is a type class that each specific parameter has different type.Therefore, pars is defined as a heterogeneous list that allows the elements stored in the pars to have different types.The quantifier access represents the functions come in three flavors: public, protected, and private.In addition, the return types of Funs statements are stored in an additional list.To ensure that such statements are well-formed, their bodies must not contain function and contract declarations.Note that anonymous functions are forbidden in Lolisa because all functions must have a binding identifier to ensure that they are well-formed.The formal typing rules of Fun and Funs are respectively defined according to the rules STT-FUN and STT-FUNS below.Here, the result can be either assigned to an l-value term by defining syntactic sugar or discarded.The variable fpars is similar to pars in terms of function declarations rules, and the allowable expressions stored in the list are not limited by type annotation, except that they cannot be an Emodifier expression.It is noted here that the new statement in Solidity is treated as a special function call in Lolisa, which can only invoke the constructor function of a contract.In addition, new can be encapsulated as syntactic sugar.
Loops in Lolisa include the while loop ( ℎ ) and the for loop (  ), which have been commonly defined in other similar studies [24], and the formal definitions in Lolisa are equivalent.Here, we have not defined the do-while loop.If needed, the do-while loop can be defined as syntactic sugar using a while loop.The formal rules of   and  ℎ are respectively defined as STT-FOR-LOOP and STT-WHILE-LOOP below.(STT-CON) Because most higher-order theorem-proving assistants are pure, the inheritance mechanism is formalized based on a module system that is explained in the following section.However, we retain the component of inheritance in the Lolisa syntax for consistency checking, which is also explained in the following section.If the contract declaration is well-formed, its body should not contain any other contract declarations.Structure declarations are defined as statements because general-purpose high level languages employ structure declarations as a way of defining new composite data types (or record types) based on the requirements of users.Statements   represents a list wrapper of names (defined by string type) and types.In this way, Lolisa provides sufficient flexibility for programmers to define complex customized data types in the formal system.Figure 7 provides an example definition, where the left-hand side is a structure declaration statement in Solidity and the right-hand side is an equivalent declaration in Lolisa, which defines a complex recursive struct type: struct A { struct B {…; struct N;}…} that effectively demonstrates the facility of the process.The formal type rule for structure declarations is given below by the rule STT-STR.

(STT-STR)
A struct declaration is well-formed if all of its member types are well-formed and non-void, and if all of the member names are distinct.
Finally, struct types with no members are forbidden.We assign () to denote a sequence of member names and types associated with a struct tag s.In addition, we separately assign  1 () to denote the sequence of member names, and  2 () to denote the sequence of member types associated with struct tag s.
In the following discussion, our meta-variables ,  1 ,  2 , etc. will range only over well-formed statements satisfying ().

Module Semantics
Two problems must be addressed before we can formally define the dynamic semantics of Lolisa syntax.Firstly, Lolisa was developed as a formal subset of Solidity to facilitate contract program execution and verification in higher-order theorem-proving assistants.In contrast to standard program execution, this condition requires that we define all identifiers of the target programs in advance for their use in the source code, and an unambiguous address must consequently be allocated to each identifier.And it is easily to predicate that different functions may have the same variable identifiers pointed to different memory address.Because variables of functions are in different namespace, they are valid to have the same identifiers in real world.However, due to the built-in mechanism, the basic logic context of Coq and similar proof assistants include only a single namespace.Besides, real world programs have been translated into the logic symbols, thus such that identifiers must have unique addresses.

Table 2. A simple example about error definition in Coq.
As shown in Table 2, if we want to declare two identical identifiers for two different functions in a single logic context, Coq will return the error message when define the same and it is difficult to formally simulate different namespaces by the type system of Coq.Secondly, because most theorem-proving assistants are pure and based on lambda calculus, expressing the behavior of inheritance or implementing a formal compiler in such theorem-proving assistants is a difficult task.To address these two problems, we introduce the ML module system [27] into Lolisa.This has several advantages.First, a number of higher-order theorem-proving tools, such as Coq, already support the powerful ML module system.
Second, the ML module system can define subtyping [28] and subdomains, which can be therefore directly defined in Lolisa.Third, although the ML module system and object-oriented programming languages employ different type systems, they have similar behaviors.For example, sub-modules in Coq can inherit types and members from their super-modules or imported modules.In fact, some mainstream languages, such as Python, support ML module systems and objects systems simultaneously .However, the ML module system allows users to access the members of any module if the members are in an equivalent context.Therefore, we retain the component of inheritance in the formal syntax of Lolisa to provide the formal interpreter with inheritance information for checking whether users access invalid modules while writing theorems manually.
Of course, the best means of avoiding this is to associate a single main contract module and its related modules with a single context., contract module declarations L, function module declarations N, and member declarations K is given at the top of Figure 8.The meta-variables ,  0 ,  1 ,  2 , etc. range over contract module names; ,  0 ,  1 ,  2 , etc. range over function module names; ℳ and p range over arbitrary module sets and terms with module types; id ranges over identifiers.We assign ℳ ̅ to denote possible sequences ℳ 0 ,…, ℳ  , and assign  ̅ ,  ̅ , and  ̅ to denote similar possible sequences for C, F, and id, respectively.The predicate call(id) denotes the act of the source code accessing (applying) the respective identifier.According to the abstract syntax, it is clear that a function module must be in a contract module.In other words, a function module is forbidden in Lolisa to be defined independently, and a contract module is forbidden to be defined in a function module.

𝑺𝒚𝒏𝒕𝒂𝒙 𝛤[ℳ]
The module semantics can now be defined based on the abstract syntax shown in Figure 8. Modules are constructed by contracts, functions, and their recursive combinations.A subtype relationship between modules is denoted by  < : , where A is a subtype of B. Therefore, the rules BAS-C and BAS-F indicate that contract modules and function modules basically represent reflexive closure, while the rule TRANS indicates that contract modules also represent transitive closure.In addition, the rules INHERT and IN-FUN indicate that a main contract module is the subtype of an inherited module (in Solidity, the keyword is denotes inheritance) and a function module is the subtype of its super-contract module.The remaining rules represent member access type assignments.The rule SING defines that the operation call(id) for an arbitrary identifier id in a module p is equivalent with accessing the id of p, i.e., call(p.id).This follows equally if other super-modules ℳ ̅ and p have an equivalent identifier id.Therefore, we do not check whether other super-modules have equivalent identifiers, as given by the rule MULT-IN.
When the identifier id exists in ℳ ̅ but not exist in p, the source code will access the identifier in the nearest module, as given by the rule MULT-NOT-IN.Alternatively, as given by the rule MULT-OUT, the source code can access an id in ℳ directly, denoted as call(ℳ.).The special identifier this employed in Solidity and other object-oriented languages is accommodated in Lolisa according to the rules THIS-T and THIS-F.Here, this serves a single function, in that, if a function module F has an identifier combined with this (i.e., this.id), the identifier in the super-contract module of F is accessed (i.e., (.)).Otherwise, an incorrect result will be obtained.
We have formalized the domain separation and inheritance behavior of Lolisa.The following discussion of dynamic semantics are all based on these developed module semantics.

Formal Semantics
The deep correspondences applied in CHI make it very useful for unifying formal proofs and program computation.In brief, CHI proposes that a deep correspondence exists between the world of logic and the world of computation.This correspondence can be expressed according to three general principles: propositions as types, proofs as programs, and proofs as evaluation of programs.However, as discussed, most higher-order logic theorem-proving assistants are based on lambda calculus; yet, most mainstream programming languages employed in the real world are not designed based on lambda calculus, and cannot be analyzed in a higher-order logic environment.The programs written using these languages are very difficult or even impossible to verify directly and automatically using CHI.This forms the basis of EVI [3], which extends the formal relations of CHI to include three corollaries: proofs as evaluation of programs as execution of programs, properties as propositions as types, and verifications as proofs.Based on these corollaries, the correspondences of CHI can be made still deeper to obtain a fourth general principle: verifications as execution of programs.Accordingly, Lolisa is defined to be a formal subset of Solidity that can be executed and reasoned in higher-order logic theorem-proving assistants based on EVI.Thus, the formal semantics are based on the GERM framework, in conjunction with memory management operations conducted with APIs defined in the GERM framework.Therefore, we can guarantee a safe-type memory access because every memory block of the GERM framework stores the native logic information directly, and employs specific types for different memory values.
We now formalize the dynamic semantics of Lolisa using natural semantics, which are also known as big-step operational semantics [4].
Because the static formal semantics of Lolisa (that is, its formal syntax typing rules) have been formally specified, the dynamic semantics are defined under the assumption that programs written in Lolisa are well-typed, and, in particular, with the assumption that the type annotations of expressions are consistent.However, programs written in Lolisa also include undefined behaviors when accessing memory space owing to execution authority or other limitations.Clearly, we must know whether the behavior of programs is correct if we wish to reason in a higher-order logic world.Therefore, with reference to the basic API definitions of the GERM framework, we employ a monad-type option [30] to represent the different conditions represented by return values.Here, the return value is annotated as Some if it is meaningful, None if it is nothing, and, otherwise, it is assigned an error message Error.

Predefinitions of semantics evaluations
Each subsection of the semantics defines its own operators and miscellaneous notation.However, a number of general observations can be given here regarding notation.All the following subsections present an evaluation relation of the form  0 ⇓  0 〈  3 respectively summarize the helper functions used in the dynamic semantic definitions.One of the important aspects of the dynamic semantics of Lolisa is the environment information used to observe changes in the environment, and to determine whether programs are executing in valid environments.This information is composed of two environment information terms: the current environment env, which stores the current execution environment information, and the super-environment fenv, which stores the super environment information of env.
In the initial state, env and fenv are equivalent, except for the gas value because env stores the remaining amount of gas and fenv stores the minimal gas limit.Then, the helper functions listed in Table 3 are typically used as abbreviations for relatively complicated expressions regarding states, but are not particularly interesting in their own right.A few of these functions and their components will be defined in the course of this section.

Evaluation of Values
The formal semantics of value evaluations involve evaluating Lolisa values, obtaining native value information that can be computed or reasoned in the base formal system, and generating respective memory values in the GERM framework.Here, we adopt the meta-symbol  to represent both a Lolisa value  and a mapping value   .This is possible because   has equivalent static typing rules as , except for not having definitions related to mapping types.In addition, each value maps to a unique respective memory value, as discussed in Here, the operation ⇓   yields an offset based on a current base address indexed by a respective identifier.Then, the basic API   of the GERM framework employs the offset as a parameter to yield the final address.If   returns an error message, then the array address to which it pertains is illegitimate.Otherwise, the basic API of the GERM framework  ℎ will adopt the final address as a parameter and attempt to extract the memory value stored in the respective memory block.Due to the recursive array type definition in rule EXP-ARR-N', we should check the type   to determine whether information exists for the next dimension.According to verification of the basic APIs of the GERM framework [3], we can assume that, if  ℎ is invoked successfully, the type  and the memory value satisfy the map relation.Then, we should call the rule EVAL-V-ARR2 again like the rule EVAL-V-ARR3.
For mapping values, the evaluation process has three parts.First, the operation ⇓   yields the key value (:    ), and  ℎ attempts to extract the mapping value stored in the initial address.If ⇓   or  ℎ fails, the evaluation process will return an error message.Otherwise, the evaluation process will be simplified as   , and it will take the results of ⇓   and  ℎ as parameters.Second, if   successfully obtains the memory block whose key is equal to   , then   will extract the mapping value and obtain the stored value through   .Here, we have ensured by  ℎ that the memory block stores a mapping value, so it can be extracted by   directly.Finally, if a next dimension exists and the result of   is , then the next dimension is evaluated; otherwise, an error message is returned.The evaluation semantics of mapping values are defined according to the following rules. The At the value level, struct is only employed to represent a memory value with a struct type.Therefore, it is similar to a normal form value, and we need only extract the struct value by  ℎ directly, as given by the rule EVAL-V-STR below.

ℰ⊢𝑒𝑛𝑣,𝑓𝑒𝑛𝑣
⊢,    , ⊢      ℎ (,,  ,  )↪ ℰ,,ℱ ⊢〈,,,(  ,  )〉 ⇒ 〈,,, ⊑ {,   }〉 (EVAL-V-STR) The semantics of field access are very complex in Solidity, and consist of two parts: contract member access and struct field access.If the contract member access derives from an inheritance relationship or a special identifier such as this, contract members can be accessed directly based on the ML module system, as discussed above.If contract member access derives from a variable, the contract information stored in the respective memory block is searched, and the identified member is accessed according to the rule EVAL-V-CONS.For the second part of the field access semantics, Lolisa supports all kinds of struct field access, but a convention is introduced, where the middle members cannot be functions.For example, a struct field access A.B.f(a,b,c).C, where f(• ) is a function, is forbidden in Lolisa.Here, we need not worry about the built-in EVM functions in a standard structure, such as msg or block, because these have defined in the Lolisa standard library in advance, as will be discussed later; therefore, they can be treated as normal structures in the semantics.We assign ⇓  to denote the process of evaluating a base address   and a struct-type address   .Then,   will take  0 ..  ,   , and   as parameters and attempt to obtain the memory value indexed by   .If   is invoked successfully, it returns a pair (,   ).Here,  actually refers to the address of  −1 because Solidity will take  −1 as an implicit argument if   is a function call pointer, such as

𝑎. 𝑠𝑒𝑛𝑑(𝑣, 𝑚𝑠𝑠) ≡ 𝑠𝑒𝑛𝑑(𝑎, 𝑣, 𝑚𝑠𝑠).
IThis indicates that the identifier  is a parameter of the  function during interpretation or compilation, which is a common usage in Solidity.Therefore, if   is a function pointer (( * )) for a function call,  and function input  should be combined into   to facilitate their transmission to the next level.And the above evaluation process are summarized as following semantics EVAL-V-FIELD-T and EVAL-V-FIELD-F.ℱ ⊢ (ℎ,) ⇓   ∨   (, 0 .. ,  ,  )↪ ℰ,,ℱ ⊢ 〈,,,( 0 , 1 ,⌊ℎ.0 .. ⌋,)〉 ⇒ 〈,,,〉 (EVAL-V-FIELD-F) Finally, Solidity includes some special cases where some built-in EVM functions employ some special variable types, such as a mapping type variable v to invoke the built-in function length (v.length).This condition is also addressed by Lolisa standard library functions, which is discussed in the Subsection 5.5.below.

Evaluation of Expressions
The semantics of expression evaluation are the rules governing the evaluation of Lolisa expressions into the memory address values of the GERM framework, and this process includes two parts: the l-value position evaluation and the r-value position evaluation.In contrast, modifier expressions are a special case that cannot be evaluated according to these expression evaluation semantics, but their evaluation is conducted according to the following semantics statement.

Evaluating expressions in the l-value position
In the following, we assign ⇓   to denote the evaluation of expressions in the l-value position to yield respective memory addresses.First, obviously, most expressions constructed by Econst constructor cannot be employed as the l-value.
Because most of them represent a Lolisa constant value at expression level directly.For example,

𝐸𝑐𝑜𝑛𝑠𝑡(𝑉𝑏𝑜𝑜𝑙(𝑡𝑟𝑢𝑒)) ∶ 𝑒𝑥𝑝𝑟 𝑇𝑏𝑜𝑜𝑙 𝑇𝑏𝑜𝑜𝑙
represents a Boolean value at expression level.However, specially, in Lolisa, the l-value positions are allowed to employ the constant expressions specified by value constructors Varray and Vmap.As mentioned previously, value constructors Varray and Vmap are actually address pointers for indexing specific memory blocks.For example, [] =  is a common use in most general-purpose programming languages.Thus, they not only can represent memory values at the value level, but can also represent memory addresses at the expression level.
This can facilitate a simplification of the Lolisa syntax.For brevity, we assign ], * ] to denote the recursive processes of Varray and Vmap employed for searching indexed addresses, as defined in the previous subsection.Note that Vstruct and Vfield are forbidden to specify expressions in the l-value position, even though they are also address pointers.This is because, for Vstruct, Evar can represent any variable address for any type.Therefore, to avoid confusion, Vstruct can only represent a memory value at the value level, as represented by Convention 1.For Vfield, both Solidity and Lolisa include a number of special structures, such as msg and block, and their members are forbidden to be altered arbitrarily.Moreover, in most cases, it is invalid in Solidity to modify the values stored in a field.Therefore, to ensure that Lolisa is well-formed and well-behaved, Vfield is forbidden from specifying expressions in the l-value position.The only means allowed in Lolisa of altering the fields of structures are using Estruct to change either all fields or declaring a new field.Although this limitation may be not friendly for programmers or verifiers, it avoids potential risks.
In the previous subsection, we defined the semantics of array values.Accordingly, we can define the address searching process based on the Taking array values as an example, as given by the rules EVAL-LEXP-ARR1and EVAL-LEXP-ARR2 below, if ARR-SEARCH obtains the address successfully, the address can be transmitted to the statement level; otherwise, an error message is returned.We similarly define the rules of mapping values, as given by the rules EVAL-LEXP-MAP1 and EVAL-LEXP-MAP2 below.⊢,  ℱ,  ⊢  ℰ,,ℱ ⊢ ⇓  ⇒(,   ) ℰ,,ℱ ⊢ 〈,,,()〉⇒ 〈,,,   〉 (EVAL-REXP-CONS1) ⊢,  ℱ,  ⊢  ℰ,,ℱ ⊢ ⇓  ⇒(,) ℰ,,ℱ ⊢ 〈,,,()〉⇒ 〈,,,〉 (EVAL-REXP-CONS2) Here, we note that, because constant expressions store Lolisa values directly, the results can be obtained by applying ⇓  directly.In the expression level, the r-value position is specified with a struct type.This is also the only means of initializing or changing the value of a struct-type term.The rules EVAL-REXP-STR1 and EVAL-REXP-STR2 below define this process.(EVAL-REXP-STR2) Here, if the evaluation of Estruct fails, the process of evaluating a member's value yields an error message.Otherwise, the member's value set is obtained and the respective struct memory value is returned.The semantics of address pointer expressions are defined by the rules EVAL-REXP-ADDR1 and EVAL-REXP-ADDR2 below.According to the formal syntax definition of address pointer expressions, the results can be obtained by applying  ℎ directly as EVAL-REXP-ADDR1.In particular, the expression Efun should read the function return address   to obtain the result of the respective function pointer as EVAL-REXP-ADDR2.Finally, the semantics of binary and unary operations are defined according to the rules EVAL-REXP-BOP, EVAL-REXP-UOP, and EVAL-REXP-OP-F below.In the present version of Lolisa, the above definition forbids mixed arithmetic operations, such as "int + float", because, as discussed previously, Solidity does not completely support the float datatype, and float values are also rarely employed in smart contract programs.
Therefore, including mixed arithmetic operations will add unnecessary complexity and computational burden to formal interpreter or compiler implementation.Of course, the semantics do not forbid the use of mixed arithmetic operations, and the formal syntax of expressions also preserves sufficient extensibility, such that users can extend Lolisa themselves accordingly if deemed necessary.In addition, the results of   and   should be signed as a monad-type option.
Although the formal syntax of expressions is designed using GADTs, and the results can be assumed to be well-formed and follow the type rules statically without invalid combinations, invalid behaviors can occur when evaluating values.For example, the simplest and most obviously invalid case is a divisor of 0 in division operations.The specific design target of Lolisa indicates that we should formally address division by zero.As such, we adopt a return value of Some to express valid division results, and Error to represent undefined results caused by invalid division, which is expressed by rules EXM-1 and EXM-2 below.

Evaluation of Statements
In the following, we assign ⇓  to denote the evaluation process of statements.Most evaluations employ two helper functions  ℎ and   .The helper function  ℎ takes the current environment env and the super-environment fenv as arguments, to check conditions such as gas limitations and the congruence of execution levels.For example, if the domains in env and fenv are equal, but the execution levels differ, the program will be stopped, and env will be reset by fenv.This is formally defined in the rule ENV-F below, where current statements will be executed only if the result of  ℎ is true; otherwise, the program will stop and return to the beginning memory state.
ℰ ⊢, ⊢ ℰ,,ℱ ⊢  ℎ (,)↪ ℰ,,ℱ ⊢ 〈,,, * 〉⇒ 〈,,,〉 (ENV-F) The helper function   takes current execution statement and environment as arguments to deduct the gas recorded in env and generates a new environment env' with a new gas amount.In Coq,   is actually implemented as a matching tree whose branches are the deduction cases of gas following the gas price sheet in [40].And, defined as the rule GAS-F, if the   fails, the current program execution will be stopped.

ℰ ⊢𝑒𝑛𝑣
ℰ,,ℱ ⊢  ℰ,,ℱ ⊢   (,)↪ ℰ,,ℱ ⊢ 〈,,, * 〉⇒ 〈,,,〉 (GAS-F) Contract declarations are one of the most important statements of Solidity.In Lolisa, contract declaration involves two operations.First, the consistency of inheritance information is checked using the helper function ℎ ℎ , which takes the inheritance relations in module context  and the source code as arguments.Second, the initial contract information, including all member identifiers, are written into a designated memory block.The formal semantics of contract declaration are defined as EVAL-STT-CON below.Here, we assume that the current logical context based on the GERM framework includes sufficient logical memory space, such that each identifier has a valid and free address.Therefore, all indexed memory blocks have been initialized by   , and   will be always successful.
With respect to struct declarations, a new struct type is declared in the statement level according to Convention 1.The address is the new struct type identifier, and the struct type information is written into the respective memory block directly.The dynamic semantics of this process are defined by the rule EVAL-STT-STRUCT below.Here, the first step is an attempt to extract the function declaration statements stored in the respective memory address by  ℎ .If the read operation is successful, the second step sets the current execution environment level as 0, and also sets the domain as the called function identifier using   .The final step executes the function body by  ⇓   .()with the new environment env'.
Modifier declarations are a kind of special function declaration that require three steps, and includes a single limitation.The parameter values are set by the   predicate.As defined by the rule EVAL-STT-MODI below, the first step (denoted as ) initializes and sets the parameters.The second step (denoted as ) stores the modifier body into the respective memory block.The third step (denoted as ) attempts to initialize the return address   .Due to the multiple return values,   takes a return type list as an argument.Then, we attempt to execute the modifier body.Here, although not formally expressed in the rules below, an attempt to execute the body returns an error if  =  ∨  =  ∨  = .According to the rule MODI-LIMIT below, the modifier body can only yield an initial memory state, and therefore cannot change memory states.Due to the separate definitions given by the rules STT-FUN and STT-FUNS discussed in Subsection 3.4, the dynamic semantics of function declarations are also defined separately below by the rules EVAL-STT-FUN-T, EVAL-STT-FUN-F, EVAL-STT-FUNS-T, and EVAL-STT-FUNS-F.The difference between modifier semantics and function semantics is modifier limitations checking.Specifically, before invoking a function, the modifier, restricting the function, will be executed, and the results of modifier evaluations will be checked.If the result is   , it means the limitations checking of the modifier fails and the function invocation will be thrown out.Otherwise, the function will be executed.Here, the predicates   and   determine the truth value of .However, the type of  need not be checked because, as discussed previously, we have limited the type of each abstract syntax by GADTs, which guarantees that  is well-typed.
The semantics of sequence statements are very simple, in that, if evaluation of the first statement  0 yields a new memory state and the output is normal, then the next statement  1 is evaluated.Otherwise, an error is returned and the evaluation stopped.This process is defined by the rules EVAL-SEQ1 and EVAL-SEQ2 below.

Extensibility and Universality
While ensuring that the developed set of formal syntax definitions and semantics faithfully capture the intended behaviors of programs written in real-world programming languages is essential, further ensuring that the this set can be applied for multiple programming languages is also of great value.Therefore, implementing extensibility and universality in the Lolisa design was a goal considered from the beginning of its development.As such, we deliberately incorporated sufficient extensible space in Lolisa for extending features such as pointer formalization and the implementation of independent operator definitions.Extensibility is further accommodated by the independence of syntax inductive predicates within the same level indicated by Convention 1, which is further supported in the semantics definitions.Therefore, Lolisa is easily extended to incorporate the features of other languages by adding new typing rule constructors in the formal abstract syntax and the respective formal semantics in the interpreter.Moreover, except for the accommodation of specific Solidity data structures, such as contracts and mapping, the remainder of the syntax definitions and semantics of Lolisa were designed to be universally applicable to any other general-purpose programming language.Finally, Lolisa was designed based on the GERM framework and EVI, which are appropriate for the formalization of any programming language.However, two problems remain that impede the extensibility and universality of Lolisa.
First, the formal syntax of Lolisa is overly complex to accommodate its adoption by general users.While the syntax of Lolisa includes the same components as those employed in Solidity, it has more strict formal typing rules.Therefore, Lolisa syntax must include some additional components not supported in Solidity, such as type annotations and a monad-type option.Moreover, the syntax of Lolisa is formally defined in Coq formalization as inductive predicates.As a result, Lolisa code looks much more complicated than corresponding Solidity code, as shown in Thus, Lolisa is obviously unfriendly for general users, and this increases the difficulty of writing code in Lolisa manually or the difficulty of developing a translator between Lolisa and Solidity or another language.Second, the Lolisa, playing the role of the core of universal formal intermediate programming language specification, can be expected to become very large after being extended, and this will increase the difficulty of adapting the formal syntax to a variety of languages if the syntax remains complex without explicit classification.
Fortunately, Coq and some other higher-order theorem-proving assistants provide a special macro-mechanism.Taking Coq as an example, this mechanism is denoted as the notation mechanism [5].A notation is a symbolic abbreviation denoting some term or term pattern, which is parsed by Coq automatically.For example, an assignment in Lolisa can be wrapped as follows.
"Notation "t0 '::=' t1": = Assignv (Evar (Some t0) Tint) (Evar (Some t1) Tuint)" Substituting a and b from the previous example yields NOTATION below, which demonstrates that the notation is nearly equivalent with the original Solidity syntax.Through this mechanism, we can hide the fixed formal syntax components, and thereby provide a simpler syntax to users.Moreover, this mechanism makes the equivalence between real-world languages and Lolisa far more intuitive.Therefore, we provide a preliminary scheme based on this macro-mechanism to improve the extensibility and universality of Lolisa systematically.The architecture of the proposed preliminary scheme is illustrated in Figure 14.Here, we treat Lolisa as the core formal language, which is transparent for real-world users, and we logically classify the formal syntax and semantics of Lolisa according to a general component  and n special components   , as defined by rule 1 below.Correspondingly, a general-purpose programming language ℒ  can be formalized by the Lolisa subset  ∪   by wrapping the subset using notation as a symbolic abbreviation   for ℒ  , which adopts syntax symbols that are nearly equivalent with the original syntax symbols of ℒ  .Through this method, each ℒ  will have a respective notation set   that satisfies   ⊆ .This relation is defined as rule 2 below.As discussed, the notation layer can hide the details of the formal syntax, making it more user-friendlier and clarifying the equivalence between real-world languages and Lolisa.In addition, the symbolic abbreviation set  facilitates the systemization and classification of the formal syntax and semantics of Lolisa.As such, the proposed scheme addresses the problems impeding the extensibility and universality of Lolisa discussed above.As notations, such as "(+)" and "(~>)", shown in 11, we are presently conducting the relevant work of encapsulate Lolisa according to the above architecture, and this work will be completed in the near feature.

Conclusion and Future Work
In this paper, we defined the formal syntax and semantics for a large subset of Solidity, which we denoted as Lolisa.To our knowledge, Lolisa is the first mechanized and validated formal syntax and semantics for Solidity.The formal syntax of Lolisa is strongly typed according to GADTs.The syntax of Lolisa includes nearly all the syntax in Solidity, and the two languages are therefore equivalent with each other.As such, Solidity programs can be translated to Lolisa line-by-directly without rebuilding or abstracting, which are operations that are too complex to be conducted by general programmers, and may introduce inconsistencies.By basing the formal semantics of Lolisa on our GERM framework in conjunction with EVI, programs written in Lolisa can be, in theory, symbolically and automatically executed in higher-order theorem-proving assistants, and thereby verify the corresponding Solidity programs simultaneously.Moreover, we have mechanized Lolisa in Coq completely, and have developed a formal interpreter in Coq based on Lolisa.The formal interpreter was employed to validate the semantics of Lolisa, and certify that Lolisa satisfies the propositions of EVI.We also presented an example to demonstrate the execution and verification process of Lolisa in Coq.In addition, we validated the semantics of Lolisa using two other distinct approaches, including proving the properties of the semantics and through equivalence with alternate verified semantics.Finally, we illustrated the extensibility and universality of Lolisa, and proposed an initial scheme for systematically simplifying and extending Lolisa to support the formalization of multiple general-purpose programming languages.As a result of the present work, we can now directly verify smart contracts written in Solidity using Lolisa.In the future, we hope that Lolisa might be sufficiently powerful and friendly to be used by general programmers to verify their programs easily.Presently, we are working toward verifying the correctness of FEther, and developing a proof of the equivalence between computable semantics and inductive semantics.Subsequently, we will implement our proposed preliminary scheme based on the notation mechanism of Coq to extend Lolisa along two important avenues.First, we will seek to support other features of Solidity, such as inline assembly.Second, we will seek to support other high-level programming languages, including Serpent [31].Finally, we will build a general formal verification toolchain for blockchain smart contracts based on EVI to achieve the ultimate goal of automatic smart contract verification.

Figure 5 .
Figure 5. Abstract syntax of Lolisa expressions Constant expressions are used to denote the native values of the basic formal system, which are transformed from respective Lolisa values.

Figure 8 .
Figure 8. Module declarationabstract syntax, declarations, and member access type assignment rules of Lolisa Lolisa has two kinds of modules: contract modules and function modules.The members of contract and function modules are specified by the identifier and Lolisa source code ℂ, and we assign [ℳ] to denote the current global context.The abstract syntax of Lolisa in terms of [ℳ], contract module declarations L, function module declarations N, and member declarations K is given at the top of Figure 8.The

Figure 11 .
Figure 11.The formal version of Figure 10 written in Lolisa

Figure 12 .Figure 13 .
Figure 12.Execution and verification of the Lolisa program in Figure 10 using the formal interpreter FEther in Coq

Figure 10 ,
Figure 10, even though Lolisa and Solidity code present a line-by-line correspondence.An example of this difficulty is illustrated by the following code of a Tuint expression b being assigned to a Tint expression a in Lolisa on the left, and the equivalent assignment in Solidity on the right.

Table 1 .
State functions to correspond to the memory values   and   of Lolisa, which will be described in detail in the following sections.Because   is a subset of  that does not include the typing rules VALUE-MAP ) respectively, where  represents an identifier indexed by an address, and mems represents the member name of structure members.Specially, because the index value Amap_id is forbidden to employ itself recursively,   array is the coordinate index defined in Figure2for mapping values, which shares the same constructor with that of   except for Amap_id.Therefore, array types of  are parameterized in Lolisa by   , while array types of   are parameterized by    , which does not include a mapping identifier.As shown in Figure2,   and    contain the respective formal constructors to represent different index forms.Here, we assume that all indices with memory addresses in the set  are well-formed.The formalizations   and    not only can refer to the size of respective arrays, but can also be used to refer to the index number during evaluation.For example, in Lolisa, on the one hand, (  ( (_(10), ))) (EXP-ARR-VAR)   (  ( (_(10), ))) (EXP-ARR-ASSIGN) Because the size of array types in Solidity can be dynamic, the dynamic size array type in Lolisa is treated as a special mapping type of   (Iint Signed I64).In addition, array types and mapping types are defined recursively.Due to the recursive inductive definition, Lolisa can express n-dimensional array types and n-dimensional mapping types easily, which is illustrated below by examples EXP-ARR-N and EXP-MAP-N, respectively.
Second, array types are well-formed if their normal form type is well-formed and not void (Tundef), and if the number of elements is greater than zero.We write ⇓   and ⇓    to represent evaluations of array indices, and their type assessments are respectively defined as rules WF-ARR and WF-ARR-MAP below.,, ⊢ (  )   ≠  ⊢  ⇓     > 0 ,, ⊢ (  [   ]) ,, ⊢ () ≠  ⊢  ⇓    > 0 ,, ⊢ ([  ]) (WF-ARR) (WF-ARR-MAP) Finally, mapping types are well-formed only if their arguments _KeyType   and _ValueType  are well-formed, which is expressed by the rule WF-MAP below.,, ⊢ () ∧ (  ) ,, ⊢([  ⇒]) ). Non-normal form values include the values of arrays, mappings, and field access.Essentially, array and mapping values can be obtained by evaluating their indices.As shown in EXP-ARR above,   not only can refer to the size of an array, but can also be used as the key of array elements.Therefore, array values in Lolisa can be constructed according to the rule VAL-ARR below.
Special expressions include struct expressions Estruct and modifier expressions Emodifier.First, at the expression level given in Convention 1, the only function of Estruct is to represent an expression value at the right position, which is used to initialize or modify struct type terms.Therefore, its type rule has two parameters: the respective struct type identifier and the value list   for each member field, as defined by the rule EXPR-STR below.

Table 3 .
Helper functionsTable 1,  1 〉, where  0 and  1 are the initial and final memory states, respectively,  0 represents the form of Lolisa syntax being defined, and the nature of  1 depends on the precise evaluation relation being defined.We employ the notation  ⊑ {} to indicate that the term a will be at least simplified as a kind of normal form existing in the set b.

Table 4 .
Due to the static type limitations in the formal abstract syntax definition based on GADTs, the expressions, sub-expressions, and operations are all guaranteed to be well-formed, and the type dependence relations need not be checked using, e.g., informal assistant functions, as required by other formal semantics such as Clight.The functions   and   take the results of expression evaluations and required operations as arguments, and combine them together to generate new memory values.The example given in Table4is the cases corresponding to Case study ℎ (,)↪ ∧ ℎ ℎ (ℎ  ,ℎ)↪ ∧   (,)↪  ′   ((,,( ( ),,  ,,  )))↪ ′ ℰ,,ℱ ⊢ 〈,,,((( )),ℎ,)〉⇒ 〈 ′ , ′ ,,〉 (EVAL-STT-CON)Variable declarations are one of the most basic statements in Lolisa.As discussed in Section 4, the namespace of identifiers is controlled by the ML module system and stored in env and fenv.Therefore, with variable declarations, we can use env and fenv directly.The function   , which is a special case of   , takes the variable type, indexed address, and environment information as parameters, and initializes respective memory blocks, which is defined as the rule EVAL-STT-VAR below.