A Conceptual Approach to Complex Model Management with Generalized Modelling Patterns and Evolutionary Identification

Complex systems' modeling and simulation are powerful ways to investigate a multitude of natural phenomena providing extended knowledge on their structure and behavior. However, enhanced modeling and simulation require integration of various data and knowledge sources, models of various kinds (data-driven models, numerical models, simulation models, etc.), intelligent components in one composite solution. Growing complexity of such composite model leads to the need of specific approaches for management of such model. This need extends where the model itself becomes a complex system. One of the important aspects of complex model management is dealing with the uncertainty of various kinds (context, parametric, structural, input/output) to control the model. In the situation where a system being modeled, or modeling requirements change over time, specific methods and tools are needed to make modeling and application procedures (meta-modeling operations) in an automatic manner. To support automatic building and management of complex models we propose a general evolutionary computation approach which enables managing of complexity and uncertainty of various kinds. The approach is based on an evolutionary investigation of model phase space to identify the best model's structure and parameters. Examples of different areas (healthcare, hydrometeorology, social network analysis) were elaborated with the proposed approach and solutions.


Introduction
Today the area of modeling and simulation of complex systems evolves rapidly. A complex system [1] is usually characterized by a large number of elements, complex long-distance interaction between elements, and multi-scale variety. One of the results of the area's development is growing complexity of the models used for investigation of complex systems. As a result, contemporary model of a complex system could be easily characterized by the same features as a natural complex system. 2 Usually, a complexity of a model is considered in tight relation to a complexity of a modeling system. Nevertheless, in many cases, the complexity of a model does not mimic the complexity of a system under investigation (at least exactly). It leads to additional issues in managing a complex model during identification, calibration, data assimilation, verification, validation, and application. One of the core reason for these issues is the uncertainty of various kinds [2,3] applied on levels of system, data, and model. In addition, complexity is even more extended within multi-disciplinary models and models which incorporate additional complex or/and third-party sub-models. From the application point of view, complex models are often difficult to support and integrate with a practical solution because of a low level of automation and high modeling skills needed to support and adapt a model to the changing conditions.
On the other hand, recently evolutionary approaches are popular for solving various types of model-centered operations like model identification [4], equation-free methods [5], ensemble management [6], data assimilation [7], and others. Evolutionary computation (EC) provides the ability to implement automatic optimization and dynamic adaptation of the system within a complex state space. Still, most of the solutions are still tightly related to the application and modeling system.
Within the current research, we are trying to develop a unified conceptual and technological approach to support core operation with a complex model by distinguishing concepts and operations on model, data, and system levels. We consider a combination of EC and data-driven approaches as a tool for building intelligent solutions for more precise and systematic managing (and lowering) uncertainty and providing the required level of automation, adaptability, extendibility.

Conceptual Basis
The proposed approach is based on several key ideas, aimed to extend uncertainty management in complex system modeling and simulation.
1) Disjoint consideration of model, data, and system in terms of structure, behavior, and quality is aimed toward a system-level review of modeling and simulation process and distinguish between the uncertainty of various kinds originated from different level [9].
2) Intelligent technologies like data mining, process mining, machine learning, knowledgebased approaches are to be hired to fill the gap in automation of modeling and simulation. Key sources for the development of such solution include formalization of various knowledge within composite solutions [8] and data-driven technologies to support the identification of model components.
3) EC approaches are widespread in modeling and simulation of complex systems [9,10]. We believe that systematization of this process with separate consideration of spaces for a system (with its sub-systems) and a model (with its sub-models) could enhance such solutions significantly. 3 4) The aim of the approach's development is twofold. First, it is aimed towards automation of modeling operations to extend the functionality of possible model-based applications. Second, working with a combination of EC and intelligent data-driven technologies could be considered as an additional knowledge source for system and model analysis.
Furtherly this section considers the conceptual basis of the proposed approach with a special focus on the role of EC algorithms and data-driven intelligent technologies for building and exploiting complex models.

Core Concepts
To distinguish between main modeling concepts and operations, we propose a conceptual framework (see Fig. 1) for consideration of key processes and operations during modeling of the complex system. The framework may be considered as a generalization and extension of a framework [11,12]   involved in the operation. Transitions between concepts and between layers are denoted with 1 → 2 and 1 → 2 respectively, e.g., operator Γ → Ξ reflects observation of quantitative parameters, operator Γ → Ξ stays for basic data assimilation. Also, a set of operators may refer to a single modeling operation, e.g., operators Γ Φ→Ξ and Γ Σ→Ξ are often implemented within a single monolithic model.
Mainly, operators are related to the specific sub-model within a complex model. We consider three key classes of models. F-models are usually classical continuous models developed with knowledge of a system. DD-models are data-driven models based on analysis of available data sets with corresponding techniques (statistics, data mining, process mining, etc.). A-models are mainly intelligent components of a system usually based on machine learning or knowledge-based approaches. Also, we consider EC-based components as belonging to A-models class.
A key problem within complex system modeling and simulation is related to the absent or at least significantly limited possibility to observe the structure and functional characteristics of the for direct discovery of structure and functional characteristics directly and operators Γ Σ→Φ , Γ Φ→Σ , Γ Σ→Φ , Γ Φ→Σ for interconnection of discovered characteristics in available data and within the used model). In the proposed approach, primary attention is paid to these kinds of solutions where DD-and A-models enable enhancement of complex modeling process with an additional level of automation, adaptation, and knowledge providing.

Complex Modeling Patterns
Considering the defined conceptual framework, we identify several patterns of modeling and simulation of a complex system (see Fig. 2). The patterns are defined as combinations in a context of surrogate models for functional characteristics (Γ Ξ→Φ ); c) providing estimation of investigated parameters with machine learning (ML) algorithms and models (Γ′ Ξ→Ξ ). In contrast to the previous pattern data-driven models usually operate quickly (although it could require significant time to train the model). Still, such models have lower quality than original "full" models. Nevertheless, combining this pattern with others provide significant enhancement in functionality and performance, e.g., data-driven models can be used in optimization loop (see previous pattern).

P3.
Ensemble-based modeling ( Fig. 2c) extends P1 for working with sets of objects (models, data sets, states) reflecting uncertainty, variability, or alternative solutions (e.g., models). Previously [11] we identified 5 classes of ensembles (see E1-E5 in Fig. 2c): decomposition ensemble, alternative models ensemble, data-driven ensemble, parameter diversity ensemble, and meta-ensemble. All these patterns can be applied within a context of the proposed framework. Still, an extension of ensemble structure increases structural complexity of the model and thus leads to the need for additional (automatic) control procedures. Moreover, the performance issues of P1 are getting even worthier in ensemble modeling.

P4.
One of the key ideas of the proposed approach is an implementation of data-driven analysis of model states, structure, and behavior. To implement it within a conceptual framework we propose pattern for data-driven complex modeling (Fig. 2d). It includes identification and prediction of a model structure through DM and PM techniques (Γ → Ξ→Σ ) and generation of surrogate models for 7 injection into the complex model (Γ → Ξ→Φ ). In addition, it is possible to use data-driven techniques to predict the quality of the considered model and use it for model optimization (Γ′ → Ξ , Γ′ → Ξ→Σ , Γ′ → Ξ→Φ ).

P5.
A key pattern for EC implementation is presented in Fig. 2e. Here EC is used to identify a model structure (Γ → Ξ→Σ ) and surrogate sub-models (Γ → Ξ→Φ ) with a consideration of population of models. As a result, modeling result is also (as well as in P3) presented in multiple instances which may be analyzed, filtered end evolved within consequent iterations over changing time (and processing of coming observations of the system) or within a single timestamp (and fixed observation data).
P6. Finally, last presented pattern (Fig. 2f) is aimed at investigation of system phase space using DD-models and/or EC to reflect unobservable landscape for estimation of model positioning, assessing its quality in inferring of (sub-)optimal structural (Γ′ → Ξ→Σ and Γ′ → -enhanced ways of domain knowledge discovery for applications and general investigation of a system.

Composite solution development
The proposed structure of core concepts and patterns may be applied in various ways to form a solution which combine operators with original implementation within the solutions or implemented as external model calls. Fig. 3 shows the essential elements (artifacts and procedures) in a typical composite solution within the proposed conceptual layers ( , , ). -layer includes actual system's state which can be assessed through the observation procedure and described by explicit domain knowledge. -layer includes datasets divided into observation data and simulation/modeling data with procedures for data processing and data assimilation. Finally, -layer includes a set of available basic models 1 … which may be identified, calibrated with available data having tuned models 8 ′ 1 … ′ as a result. Here, essential elements are model composition (which may be performed either automatically, or by the modeler) and application of the model.

Figure 3 -Artifacts and procedures within a typical composite solution
The key benefit of the approach is an application of a combination of EC, data-driven and intelligent procedures to manage the whole composite solution including data processing, modeling, and simulation to lower uncertainty in Σ × Ξ × Φ. Within the shown structure these procedures may be applied: -to rank and select alternative models; -to support model identification, calibration, composition, and application; -to manage artifacts on various conceptual layers in a systematic way; -to infer implicit knowledge from available data and explicitly presented domain knowledge.

Evolutionary Model Identification
Implementing evolution of models within a complex modeling task structure, functional and quantitative parameters are usually considered as genotype whereas model output (data layer) are considered as phenotype. Within the proposed approach we can adapt basic EC operations definition within genotype-phenotype mapping [13]: programming, etc.) with data-driven operations to overcome or, at least to lower complex modeling issues.

Model Management Approach and Algorithm
By model management we assume operations with models within problem domain solution development and application. This includes identification, calibration, DA, optimization, prediction, forecasting, etc. To systematize the model management in the presented patterns we propose an approach for explicit consideration of spaces , , within hybrid modeling with EC and DDmodeling. To summarize complex modeling procedures within the approach, we developed a highlevel algorithm which includes series of steps to be implemented within a context of complex model management.
Step 1. Space discovery. This step identifies the description of phase space (in most cases, ) in case of lack of knowledge or for automation purposes. For example, the step could be applied in 10 the discovery of system state space or model structure. Space description may include a) distance metrics; b) proximity structure (e.g., graph, clustering hierarchy, density, etc.); c) positioning function. One of the possible ways to perform this step is an application of DM and EC algorithm to available data (see pattern P6).
Step 2. Identification of supplementary functions. Data-driven functions (Φ) are applied to work in model evolution with consideration of space (landscape) representation as available information.
Step 3. Evolutionary processing of a set of models. This step is described by a combination   -complicated mapping of genotype to phenotype space. 11 To overcome these issues, the proposed approach involves two options. First, the intelligent procedures may be used to tune EC hyper-parameters (P5), predict features of genotype-phenotype mapping, boundaries etc. (P4), discover interpretable states and filters (for system, data, and model) to control convergence and adaptation of population (P2, P4, P5) with interpretable and reproducible (through the defined control procedure). Second, the composite model may use various approaches, methods, and elements to obtain better quality and performance of the solution: -surrogate models (P2, P4, P5) which may increase performance (for example within preliminary and intermediate optimization steps); -ensemble models (P3) which may be considered as interpretable and controllable population; -interpretation and formal inference using explicit domain-specific knowledge and results of data mining to feed procedures of EC and infer parameters in both models and EC.
-controllable space decomposition (P6) with predictive models for possible areas and directions of population migration in EC to explicitly lower uncertainty and obtain additional interpretability; Finally, an essential feature of the proposed approach is a holistic analysis of a composite solution with possible co-evolution models (sub-modes within a composite model) and data processing procedures.

Problem #1: Evolution in Models for Metocean Simulation
The environmental simulation systems usually contain several numerical models serving for different purposes (complementary simulation processes, improving the reliability of a system by 12 performing alternative results, etc.). Each model typically can be described by a large number of quantitative parameters and functional characteristics that should be adjusted by an expert or using intelligent automatized methods (e.g., EC). Alternative models inside the environmental simulation system can be joined in ensemble according to complex modeling pattern based on evolutionary computing (a combination of P3 and P5 patterns). In the current case study, we introduce an example illustrated an ensemble concept in forms of the alternative models ensemble, parameter diversity ensemble, and meta-ensemble. For identification of parameters of proposed ensembles (in a case of model linearity) least square method or (in a case of non-linearity) optimization methods can be used.
As we need to take into account not only functional space Φ and space of parameters Ξ for a single model but also perform optimal coexistence of models in the system (i.e. Σ), evolutionary and coevolutionary approaches seem to be an applicable technique for this task. It is worth to mention that co-evolutionary approach can be applied to independent model realizations through an ensemble as a connection element. In this case parameters (weights) in the ensemble can be estimated separately from the co-evolution procedure in a constant form or dynamically. As a case study of complex  Although error landscapes for a pair of implementations SWAN+ERA and SWAN+NCEP are close to each other, separated evolution does not consider optimization of ensemble result. For this purpose, we apply co-evolutionary approach that produces the set of Pareto-optimal solutions for each generation. Fig. 4b shows that the error of each model in the ensemble is significant (coNCEP and coERA for models along), but the error of the whole ensemble (coEnsemble) converges to minimum very fast.
Obtained result can be analyzed from the uncertainty reduction point of view. Model parameters optimization helps to reduce parameters uncertainty that can be estimated through error function. But when we apply an ensemble approach to evolutionary optimized results, it is suitable to talk about reduction of the uncertainty connected with input data sources (NCEP and ERA) as well.
Moreover, meta-ensemble approach allowed to reduce uncertainty, connected with ensemble parameters.
Summarizing results of the metocean case study we can denote that EC approach shows significant efficiency up to 120 times compared with grid search without accuracy losses. According to this experimental study, quality of ensemble with evolutionary optimized models is similar to results of the grid search, MAE metric is equal to 0.24 m and DTW metric -51. Also, we can mention that co-evolutionary approach provides10 % accuracy gain compared with results of single evolution of model implementations, but this is still similar to ensemble result with evolutionary optimized models. Nevertheless, co-evolutionary approach allowed to achieve 200 times acceleration. Within the context of the proposed approach space Φ were investigated using defined structure of the model in space Σ for the purpose of model calibration.

Problem #2: Modeling Health Care Process
Modeling health care processes are usually related to the enormous uncertainty and variability even when modeling single disease. One of the way to identify a model of such process is PM [20].
Still, direct implementation of PM methods does not remove a major part of the uncertainty. Within current research, we applied the proposed approach for identification purposes both in the analysis of historical cases and prediction of single process development. Here we consider processes of providing health care in acute coronary syndrome (ACS) cases which is usually considered as one of the major death causes in the world. We used a set of 3434 ACS cases collected during 2010-2015 in Almazov National Medical Research Centre 4 one of the leading cardiological centers in Russia. The data set contains electronic health records of these patients with all registered events and characteristics of a patient. 15 To simplify consideration of multi-dimensional space of possible processes (Γ′ → Ξ→Σ Γ → Ξ for analysis of Σ on layer ) we introduced graph-based representation of this space with vertices representing cases and edges representing proximity of cases. Analysis of such structure enable easy discovering of common cases (e.g. as communities in graph). Such discovering enables explicit, interpretable structuring of the space and representation of further landscape for EC in terms of P6 pattern. Moreover, direct interactive investigation of visual representation of such structure (see Fig.   6) provides significant insights for medical researchers. Figure 6 -Graph-based representation of processes space in healthcare (interactive view) 5 We've developed evolutionary-based algorithm for patterns identification and clustering in such representation with two criteria to be optimized (see Fig. 7). Here processes were represented by a sequence of labels (symbols) denoting key events in PM model. Typical patterns were then selected for Pareto frontier. The convergence process is demonstrated in Fig. 8 (10 best individuals from Pareto frontier according to the integral criterion were selected). As a result, this solution may refer to P5 pattern and operator Γ → Ξ→Σ while discovering model structure. Fig. 9 shows an example of typical process model (i.e. structural characteristic of the model) for one of the identified clusters.
Detailed description of the approach, algorithms, and results on CPs discovering, clustering and analysis including comparison of three version of CP discovery algorithms with performance 5 Demonstration available at https://www.youtube.com/watch?v=EH74f1w6EeY 16 comparison can be found in [10]. An important outcome of the approach being applied in this application is interpretability of the clusters and identified patterns. For example, 10 clusters and corresponding CPs obtained interpretation by cardiologists from Almazov National Medical Research Centre. The obtained interpretation and further discovering and application with CP structure are presented in [17]. Another important benefit given by such space structure discovering is lowering uncertainty of patient's treatment trajectory by a hierarchical positioning of an evolved process (selection of a cluster and selection of position within the cluster). For example, discrete-event simulation model described in [17] provides a more appropriate length of stay distribution within simulation with discovered classes of CPs (Kolmogorov-Smirnov statistics decreased by 51% (from 0.255 to 0.124).   Here a combination of patterns P2, P5, and P6 in the implementation of the proposed algorithm (see Section 2.5) enable interactive investigation of processes space and data assimilation into a population of possible continuations of a single process during its evolving. This solution can be applied in exploratory modeling and simulation of patient flow processing as well as decision support in specialized medical centers.

Problem #3: Mining Social Media
Nowadays social media analysis (that began with static network models emphasizing a topology of connections between users) strives to explore dynamic behavioral patterns of individuals which can be recovered from their digital traces on the web. The prediction of social media activities requires to combine analytical and data-driven models as well as to identify the optimal structure and parameters of these models according to the available data. Here we show an example of the problem in this field involving evolutionary identification of a model. We applied the technique described in Section 3.2 to analyze the processes. Still, the considered process has significantly different structure. By default, it is continuous with random repetition of events, while healthcare process in ACS cases has finite and more "strong" structure. N-grams analysis is often used to detect patterns in people's behaviors [21,22]. N-grams analysis is based on counting frequencies of combinations or sequences of activities. We collected all sorted 3-grams (so called 3-sets) for each user's sequence to analyze the frequency of event combinations. As a result, three clusters of vectors with 3-sets chains were identified with k-means clustering method. Fig. 13 shows all combinations and transitions between them for cluster #3 as an example. Using fig. 12 and table 1, it is possible to see that cluster #3 includes users who often make new records (P) and sometimes comment records (C). So, cluster #3 mostly consists of "bloggers".
Cluster #2 includes "spreaders" who copies other records (R) frequently. And the biggest cluster #1 consists of people who make new records and copies other ones equally but less intensive comparing to other clusters. That may be considered as a typical behavior for user of OSN. N-grams analysis allows detecting typical behavioral patterns and obtaining process models for social media activities using chains of different lengths as input data. Thus, this type of data-driven modeling is more appropriate to research continuous processes. Fig. 14 shows a graph-based representation of process space with of all users' patterns.  Figure 13 -Example of process model for cluster #3 using n-gram analysis Figure 14 -Graph-based representation of processes' space for three clusters in social media activity 22 This subsection provides very early results. Next step within application of the proposed approach in this application includes an extension of process model structure a) with temporal labeling (gaps between events); b) considering process within a sliding time window to get more structured processes; c) linking the model with causal inference; d) introduction of DM techniques for EC positioning of ongoing processes in model space. We believe that these extensions could enhance discovery of model structure (P4) and provide deeper insight on social media activity investigation.

Conclusion and Future Work
The development of the proposed approach is still an ongoing project. We are aimed towards further systematization and detailing of the proposed concepts, methods, and algorithms, as well as more comprehensive and deeper implementation of EC-based applications. Further work of the development includes the following directions: -dualization on the role of data-driven and intelligent operations in proposed approach and described patterns; -extended analysis of various EC techniques applicable within the approach; -investigation on EC-based discovery for models of complex systems with lack or inconsistent observations; -detailed formalization of expertise and knowledge-based methods within the approach; -extending the approach with interactive user-centered modelling and phase space analysis; -development of multi-layered approach for decision support and control of system and process , available data , complex model .