Chain Graph Models to Elicit the Structure of a Bayesian Network

Bayesian networks are possibly the most successful graphical models to build decision support systems. Building the structure of large networks is still a challenging task, but Bayesian methods are particularly suited to exploit experts' degree of belief in a quantitative way while learning the network structure from data. In this paper details are provided about how to build a prior distribution on the space of network structures by eliciting a chain graph model on structural reference features. Several structural features expected to be often useful during the elicitation are described. The statistical background needed to effectively use this approach is summarized, and some potential pitfalls are illustrated. Finally, a few seminal contributions from the literature are reformulated in terms of structural features.

The core of this class of models is made by a directed acyclic graph (DAG) G, where nodes in the graph are labels of modeled variables (elements of vector X), and oriented edges (arrows) capture probabilistic and/or causal relationships. The joint distribution of X is represented by the product of conditional distribution functions following from the structure of G. If substantial prior information is available on a given problem domain, it is possible that an expert defines the structure of G and even the parameters inside the conditional distribution functions at a reasonable extent. Otherwise, structure and parameters have to be estimated from data (structural and parameter learning), for example, from a collection of exchangeable observations D = (x 1 , . . . , x ).
It is often the case that an expert knows some features of DAG G, but knowledge is not enough to define a DAG because several of its aspects are affected by relevant uncertainty. Following the Bayesian paradigm [7,8], the expert is invited to state his/her degree of belief about G by eliciting a prior distribution on the set of DAGs which can be considered given a fixed set of variables (nodes). All other unknown quantities, like missing values and parameters, are considered in the prior distribution [9].
Learning the structure of a BN remains nowadays still challenging for the combinatorial explosion of candidate structures with the increase in the number of considered nodes. Following Robinson [10], the number of possible DAGs on 6 nodes is 3781503; thus the enumeration of all structures while eliciting expert beliefs is unfeasible. Computational difficulties in the full enumeration of DAGs follow from about a dozen of nodes on. For these reasons, several restrictions and simplifications in stating prior beliefs were considered in the past, with the aim of making structural learning tasks affordable in large networks. Widely adopted elicitation techniques are based on restrictions like a total ordering of nodes, the presence of sharp order constraints on nodes, the marginal independence of unknowns, or the existence of a prior network which is a good summary of expert beliefs.
In a recent work [11], graphical models were proposed to elicit beliefs about the structure of a BN. The approach is characterized by the possibility of expressing beliefs about 2 The Scientific World Journal limited (but relevant) aspects of the problem domain, called structural features [12], that by their own nature are often only indirectly related to oriented edges in the unknown DAG.
Here the most general approach based on chain graph (CG) models is reconsidered with the aim of establishing connections with seminal contributions from the literature on structural learning, of detailing a general parameterization, and of describing one way to perform the refinement of elicited beliefs. Common useful structural features are defined and some issues related to their implementation and revision are examined.

Methods
Notation and some background information on graphs and Markov properties are provided below. A comprehensive account may be found in [13][14][15]. An approach to the elicitation of structural features by CG models is described, together with methods for the revision.

Graphs.
A graph G is a pair ( , ) where = {V 1 , V 2 , . . . , V } is a finite set of nodes and ⊂ × is the set of edges. The set represents the structure of the graph because it defines which nodes are linked by an edge and if such edge is directed (arrow) or not. If (V , V ) ∈ and (V , V ) ∈ then an undirected edge joins V and V , indicated as V -V , and V is neighbor of V . If (V , V ) ∈ but (V , V ) ∉ the ordered pair corresponds to the directed edge V → V ; V is said to be the child of V and V is a parent of V . The set pa(V ) includes all parents of node V , that is, all nodes originating arrows with end in V , while the set ch(V ) collects all children of V , namely, all nodes in which arrows originated from V end.
A path is a sequence of vertices such that there is an edge for each pair of subsequent nodes in the sequence, that is, V -V +1 or V ← V +1 or V → V +1 . A directed path is a path in which all edges maintain the head-to-tail orientation, for example, In a directed graph all edges are directed. The ancestors an(V ) of node V are nodes on a directed path reaching V , while descent nodes de(V ) are nodes on a directed path starting from node V . Note that V ∈ de(V ) and that V ∈ an(V ). The extension of the above definition to an( ) with ⊂ is obtained by union of sets an(V ) for each V ∈ . A similar extension holds for de( ).
In a directed graph without cycles it is not possible to visit the same node more than one time by following a directed path, and in this case the graph is called directed acyclic graph. A moralized DAG is an undirected graph obtained by joining pairs of nodes sharing children (if not yet connected) with an undirected edge and by removing the direction of all edges. A subgraph G on ⊂ is obtained by removing all nodes in \ and all edges in which at least one node is in \ from the graph G.
A graph without directed edges is called undirected graph (UG). An UG with = × is said to be complete. A subgraph on a subset ⊂ of nodes is obtained by removing nodes not in and all edges reaching-leaving nodes not in .
Note that a subset of nodes is indicated by capital letters. A clique is a maximal complete subgraph of an UG.
A chain graph, also called partially directed acyclic graph (PDAG), is made by an ordered collection of chain components = ( 1 , 2 , . . .) which are undirected graphs and by directed edges between nodes located in different chain components, so that the arrow V → V is allowed only if V belongs to a chain component preceding the chain component in which V is located. Therefore directed edges are forbidden within a chain component and in the direction from to − , with > 0.
The moralization of a chain graph mimics the moralization of a DAG. A moralized CG, indicated as G , is obtained by the following steps: (1) let = 2; (2) join with undirected edges all pairs of nodes in pa( ), with pa( ) being the union of parents sets for each node in chain component ; (3) iterate step (2) for = 3, 4, . . .; (4) remove directions to all edges.

Some Markov Properties.
Conditional independence [16] is fundamental to reason out highly structured stochastic systems and to simplify the representation of high dimensional distributions.
In this paper the random vector X refers to random variables included in the BN. The notation used hereafter is based on nodes in ; thus X is also indicated as = is the Cartesian product of sample spaces of considered variables.
The joint probability distribution of a random vector is Markov with respect to an UG G if with C being a collection of graph cliques in G and with nonnegative functions called clique potentials; note that is the coordinate projection of vector on the subset of coordinates defined by ; = ∑ Ω ∏ ∈C ( ) is the partition function that normalizes the product of potentials in (1).
Markov properties for positive distributions with respect to an undirected graph may be read using the separation theorem [14]. Let ⊂ , ⊂ , ⊂ be disjoint subsets of nodes. The separation theorem states that subvectors and are conditionally independent given the subvector if and only if all paths from a node in to a node in include nodes located in ; thus nodes in separate nodes in from nodes in .
The joint probability distribution of random variables indexed in is Markov with respect to a DAG G if the following factorization holds: The Scientific World Journal 3 where pa(V ) is the random vector made by variables whose labels belong to the parents set of V . Markov properties on a DAG may be read using the separation theorem for directed graphs [14]. Given three disjoint sets of nodes , , , consider the subgraph G defined on an( ∪ ∪ ) after moralization, say G an( , , ) . Random subvectors and are conditionally independent given the subvector if and only if nodes in separate nodes in from nodes in in G an ( , , ) .
A joint probability distribution is Markov with respect to a CG G if with ∈ , the chain components of G. Furthermore factors on r.h.s. of (3) may be factorized by considering the subgraph on nodes defined by ∪ pa( ): where −1 , ∈ Ω pa( ) are normalization constants, one for each conditioning value of the random subvector pa( ) . The working UG in (4) is obtained by removing the orientation of edges from parents of nodes in and by joining them into a complete undirected subgraph. Conditional independence relationships in CGs are also obtained from an extension of the separation theorem in UGs and DAGs for positive distributions [17]. Let G be a chain graph and ⊂ , ⊂ , ⊂ be three disjoint subsets of . The separation theorem states that subvectors and are conditionally independent given the subvector if and only if all paths from a node in to a node in in G an( ∪ ∪ ) include nodes of ; thus nodes in separate nodes in from nodes in . Note that G an( ∪ ∪ ) is the moral graph of the smallest ancestral set for ∪ ∪ , that is, a subgraph of G described in Section 2.1.

Causal DAGs.
A DAG may represent causal relations among variables. According to the causal semantic, an arrow V → V indicates that V is a direct cause of V with respect to nodes included in , that is, at the considered model granularity. In principle the intervention on variable V may bear an effect on V . The intervention on a subset of variables ⊂ indicates the external setting of variables in to prescribed values; thus the system or process is perturbed, not merely observed.
Pearl [2, pp. 27-32] starts with the definition of functional causal models, which are deterministic in nature, and he demonstrates [2, theorem 1.4.1] that such formulation induces the Markov factorization in (2), the so-called Markov causal assumption. An equivalent representation embeds exogenous variables into the node of interest and transforms the deterministic relationships into probabilistic conditioning, thus leading to Bayesian networks.
A key property of a casual DAG G is the stability under external intervention: if a variable V is manipulated all the other variables maintain their relationships as represented by G. In other terms the intervention is local on manipulated variables and it does not break all the other relationships represented in a causal DAG. The intervention regime is in contrast with the plain observation of values taken by the random vector . The granularity of a causal DAG depends on the variables included in the model. A variable not included in a causal DAG may eventually affect just one variable V , with V ∈ ; otherwise if several variables are affected then a more general class of models is needed, called semi-Markovian networks (not considered further in this work).
For an updated presentation of the approach see [18] while [19] warns against the blind definition of DAGs within the causal semantic in observational studies. He also reconsiders the foundations of causal inference by anchoring them to the extended conditional independence (C.I.), both at algebraic level and with a graphical counterpart based on influence diagrams.

Inference about the Structure of Bayesian Networks.
In all cases in which strong prior information is absent, structural learning is performed by means of a database D = ( 1 , . . . , ) of exchangeable observations of the random vector . Several algorithms have been proposed to infer causal and probabilistic relations, but in Bayesian inference key quantities enter into the joint probability distribution of D and network's unknowns given the context : where = ( V 1 ,pa(V 1 ) , . . . , V ,pa(V ) ) are vectors of parameters characterizing the conditional probability distributions of V given pa(V ) ; variable indicates the unknown DAG, and it is built as a bijection from the set of DAGs (fixed ) to a subset Ω of natural numbers.
The likelihood function (D | , , ) is typically expressed as a product of multinomials by using sufficient statistics. Whenever expert beliefs are reasonably captured by Dirichlet prior distributions for elements of , and they are elicited as marginally independent, closed-form integration marginalizes out thetas. The resulting marginal distribution (D | , ) has a reduced dimensionality and may be optimized with respect to while looking for optimal structures characterized by the highest posterior probability values [9]: with being the number of states taken by and of pa(V ) ; sufficient statistics are , , for the th variable taking 4 The Scientific World Journal the th state while its parent configuration is in the th state; , = ∑ =1 , , and , = ∑ =1 , , . A clever choice of hyperparameters guarantees the likelihood equivalence; that is, all BNs equivalent as regards the set of encoded conditional independence relationships are equally scored by (6).
Equation (6) shows that the elicitation of the prior distribution ( | ) is a key step to perform Bayesian structural learning using D.

Plausible Network Features.
A candidate structure ∈ Ω on a fixed set of nodes is plausible if it has got structural features (SFs) believed to be relevant by the expert. Following Stefanini [11,Definition 1], we recall the formal definition.
Definition 1 (SF). A structural feature (SF) R ( , ) in a reference set R for the set of DAGs on is a predicate describing a plausible probabilistic or causal characteristic of the unknown directed acyclic graph ∈ Ω . Argument is in the partition W of a given numeric domain Ω of variable . An atomic structural feature (ASF) R ( ) does not depend on any auxiliary variable .
A reference set R is a collection R = {R : ∈ } of SFs indexed in a set , with being the number of considered SFs. A proposition might be defined to carry disbelief to a candidate structure, but the feature-rises-belief direction is conveniently adopted here.
An example makes the previous definition operational. Let us define R 1 ( , ) = "The number of immediate causes of V 3 is in " to capture the expert belief about the number of parents of node V 3 in the unknown DAG. An expert may consider Ω = {0, 1, . . . , 12} and W = ({0}, {1, 2, 3}, {4, 5}, {6, . . . , 12}). Given a candidate structure , its plausibility will be determined by the element ∈ W that makes the predicate true. A simple representation of configurations taken by reference features is obtained by descriptors [11,Definition 2].
Vector induces equivalence classes on Ω ; that is, Z = {Z : ∈ Ω } contains sets of structures, with Z made by all those DAGs sharing the same configuration .
Note that members of the same equivalence class must be associated with the same degree of belief by the principle of insufficient reason: they differ in irrelevant and unconsidered ways by construction.
Despite the generality of Definition 1 some features are expected to occur more often than others. Below, some of them are described without pretending to be exhaustive.

Indegree and Outdegree.
In applications characterized by a small sample size, it is useful to impose a sharp constraint during the greedy search of top-scored candidate structures. The maximum number of arrows entering into a node, the indegree, is set to a small integer, say 2 to 5, to exclude structures with a large number of parents from the consideration. A large number of parents implies a CPT with a huge number of parameters which are affected by large uncertainty after conditioning to observed data because of sampling zeros. A similar constraint may be set on the number of arrows leaving a node.
Two reference features naturally embed this kind of information.
(i) "The maximum number of arrows reaching a node is in for all nodes in , " where is a small set of integers close to 1 elicited from the expert.
(ii) "The maximum number of arrows leaving a node is in for all nodes in , " where is a small set of integers close to 1 elicited from the expert.
The above two features may be exploited to increase the plausibility of candidate structures which are sparsely connected. Higher control on connectivity is obtained by considering the fraction of nodes showing a given degree of connectivity, as described below.

Partitioned Connectivity.
A different way of characterizing the connectivity is obtained by eliciting the minimum fraction of nodes in DAG showing a given number of parents (children) and by iterating the elicitation from 0 parents (children) up to a small integer .
Let be a small integer representing the maximum number of parents (children) to be considered. Let = ( 0 , . . . , ) be a vector of nondecreasing numbers in [0, 1], with being the sample space. The elicitation of this feature is based on a vector = ( 0 , 1 , 2 , . . . , , . . . , ), with 0 ≤ ≤ +1 < = 1, and on the induced partition W = { 0 , 1 } where and 0 = Ω \ 1 . Two reference features naturally embed the extended evaluation of connectivity: (i) "The -partitioned inconnectivity of degree is in 1 , " with , , and 1 defined above; a candidate structure shows this feature if the cumulative relative The Scientific World Journal 5 frequency of nodes in with number of parents equal or less than is greater than ; that is, ≥ , = 1, 2, . . . , .
(ii) "The -partitioned outconnectivity of degree is in 1 , " with , , and 1 defined above; a candidate structure shows this feature if the cumulative relative frequency of nodes in with number of children equal or less than is greater than ; that is, ≥ , = 1, 2, . . . , .
Trained experts could prefer a conventional total number of nodes equal to 100 to elicit cumulative percentages instead of cumulative fractions of nodes.
In Table 1, an example is shown where = 5 and the elicited vector is defined on fractions. A candidate structure has the partitioned inconnectivity feature if the fraction of root nodes is equal or above 0.01, while the proportion of nodes with at least 1 parent is equal or above 0.2 and so on.

Direct Cause and Direct Effect.
The reference feature "The variable V is an immediate cause of variable V " refers to V as a parent of V so that by setting (intervening on) the value of the variable V to a given value, the distribution of V is modified. This relation holds at the level of selected granularity; thus it may change if the collection of variables (nodes in ) is modified.

Causal Ancestors.
The "direct cause" feature may be extended by considering a variable V which is on a causal (directed) path reaching node V . In this case the reference feature is "The variable V is an indirect cause of variable V "; thus the expert believes that one or more variables mediate the effect of V on V .

Causal Hubs.
A hub node in a network is characterized by a high number of arrows leaving it. A reference feature which captures the local connectivity of node V is "Node V is a hub node of at least outdegree , " with the outdegree indicating the number of arrows originated in V and a set of integers. Note that the defined feature is a localized version of the outdegree feature.
The expert might believe that a hub node should be present, but without indicating a specific node. In this case the -partitioned outconnectivity feature (with a large ) can be exploited for this purpose.

Conditional Independence Relationships.
A statement about C.I. among three disjoint subsets of random variables may take the following form: "The random vector is conditionally independent from the random vector given vector , " with , , being disjoint subsets of nodes in .
2.6. The Degree of Belief. The prior distribution on the space of structures on is obtained by "extending the argument;" that is, By recognizing that induces the partition Z, it follows that where [ ] is the cardinality of the equivalence class in which is located and where the structural configuration of DAG is [ ] . In [12] the size of each equivalence class Z was estimated by Monte Carlo simulation to face the combinatorial explosion in the number of DAGs to be enumerated with the increase of the number of nodes: where is the total number of DAGs uniformly sampled from the space of all DAGs on , is the size of such space (see [10]), and ≤ is number of sampled DAGs showing configuration .
The numerator on the right of (11) represents the joint belief on each configuration of descriptors. While the elicitation in full generality becomes cognitively and numerically overwhelming around 7 descriptors on, some parsimony is achieved if a small number of descriptors may be considered at one time, so that conditional independence relationships among descriptors may be exploited. This is a choice available to the expert through the definition of an order relation on descriptors.

Definition 3 (ordered partition). The ordered partition
of descriptors is defined by the expert to indicate disjoints subsets of SFs to be jointly considered during the elicitation, from the first subset O 1 to the last.
Otherwise stated, the expert decomposes the whole elicitation problem following an order taken from the substantive content of the specific problem domain: features are grouped according to the priority in the elicitation.
The elicitation of the ordered partition is performed by formulating questions in the language typical of a given problem domain. It is indeed difficult to define those questions in general terms, because they result to be quite abstract and they are likely to be obscure for the domain expert (see [11]). We assume here that questions are properly phrased and that an ordered partition is defined. 6 The Scientific World Journal If a strict order relation is elicited, each element of O contains one descriptor: this is a special case addressed in [20]. Another special case is represented by a trivial partition of just one subset that contains all descriptors [21]. In the two special cases above it is possible to define, respectively, a Bayesian network and a Markov network on descriptors. The general case made by several subsets, each one of cardinality two or more, was addressed by using CG models in [11] for ASFs. Below details are provided about a reference set of nonatomic features.
The joint probability distribution of descriptors The resulting CG may support the elicitation if model parameters are cognitively suited for the quantitative step, that is, if they are interpretable and easy to assess for the expert, at least after some training.
The first chain component is elicited as a marginal distribution which is not conditioned on other descriptors. An undirected graphical model on descriptors in O 1 is defined through the multiplicative model in (4), under empty conditioning. Nevertheless some care is needed because potential functions on cliques are not uniquely defined. Here a log-linear parameterization is suggested, following [13]. For example, if the first chain component has just one clique made by three descriptors, say 1 , 2 , 3 , we define the multiplicative model as below, after exponentiation and rearrangement of log-linear terms: Thus the odds value with respect to the no-feature configuration 1,2,3 = (0, 0, 0) is explained by a multiplicative model where each factor, for example, 2,3 ( 2 3 ), is equal to one if one or more descriptors are null, otherwise they are positive (the so-called treatment parameterization). The elicitation in (14) is performed on the odds scale by asking the domain expert how many times the configuration ( 1 , 2 , 3 ) is more plausible than (0, 0, 0) for descriptors 1 , 2 , 3 . This question is iterated for all configurations in the Cartesian product Ω 1 × Ω 2 × Ω 3 after exclusion of (0, 0, 0). Questions are posed from single main effects towards higher order interactions terms, so that one factor at a time is considered (see algorithm below).
The procedure is iterated for all cliques in the first CG component 1 , and indeed factors already elicited in previously considered cliques are not reconsidered anymore.
The general algorithm for the first chain component is summarized below.
(1) Consider the undirected graph on descriptors elicited as the first chain component. Control questions include the following: "Are all the relevant features included in the elicitation?", "Are all pairs of features jointly affecting the probability of a structure linked by an undirected edge?". If needed, revise the order relation and (or) links within chain component.
(2) Check out that each descriptor V given its neighbors ne(V ) is independent of all other descriptors in the first CG component.
A control question for this step is: "How much above one is the odds value for the sole presence of feature = , > 0?" where ( , ), ( , ) are already elicited; this is the cross product ratio of features , , after dividing by ( , ), ( , ). It is clear that if the interaction is absent , ( , , , ) = 1, while , ( , , , ) ∈ (0, 1) means that the interaction reduces the plausibility and , ( , , , ) > 1 raises the plausibility of the considered configuration. (iii) Iterate the step above with higher order interaction terms (for all configurations) with two constraints: before moving to higher interaction terms all the terms of the same degree must have been already elicited; moreover the maximum degree of interaction among a subset of features is defined by the size of the clique they belong to (model generator). For example, with the interaction of order two { , , ( , , , , , )}, the control question might be: "Having already elicited values of ( , ), ( , ), ( , ), , ( , , , ), , ( , , , ), and , ( , , , ), which is the value of the multiplicative term { , , ( , , , , , )} needed to adjust the odds value after considering the interaction among features whose configurations are , > 0, , > 0, , > 0?"; the helper expression is where factor is already elicited. Similar expressions follow straightforwardly for higher order interactions.  The algorithm presented above may be also applied to chain components 2 , 3 , . . ., that is, with a nonempty set of parents. The elicitation of parameters in a conditional Markov network is performed by iterating the algorithm for the first chain component over each configuration of conditioning parents. It follows that the elicitation burden depends on the number of parents, more precisely, on the cardinality of the Cartesian product where factors are samples spaces of parents.
The algorithms defined above lead to a prior distribution on configurations, but the elicitation does not end before the revision of elicited values takes place.

Implementation and Revision
Issues. The flexibility achieved by defining predicates entails potential pitfalls that should be considered during an elicitation run with structural features.
The revision of elicited beliefs is always needed because the expert is not expected to provide unbiased elicited values, especially with limited training or in complex problem domains. Revision and elaboration of the prior distribution [22, and references therein] are applied to control elicitation bias and other causes of poor elicitation.
Overelicitation is an important practice to check for the presence of elicited quantities that do not reflect the expert degree of belief. Nevertheless, in large networks elaboration of elicited probability values is the main resource for checking the quality of elicitation. In the proposed framework, the elicitation through reference features produces proper probability distributions; therefore one elaboration consists of inspecting one or more margins of the probability distribution ( 1 , . . . , | ), for example, the bivariate margin ( , | ), to look for configurations whose plausibility causes surprise or disbelief in the expert. This operation is particularly meaningful if inspected margins do not straightforwardly relate to elicited odds, for example, taking descriptors belonging to different cliques. Surprise and disbelief ask for the revision of elicited values. If an association between selected margins and bias is suspected, random selection of margins is an option.
The stability of elicited values against different order relations on descriptors (reference features) should not be assumed. Nevertheless, in complex problem domains overelicitation made by the repetition of the interview with other ordered partitions not only seems unacceptable as regards the work load, but it could even be cognitively unfeasible, for example, if the original ordered partition selected by the expert is induced by a scientific hypothesis.
The core of the proposed approach is based on predicates representing structural features. The expert might believe that some configurations of features are plausible although they are incompatible with DAGs, for example, because they imply the presence of cycles. A positive probability value would be assigned to such a configuration * , but this is not an instance of elicitation bias if the elicited distribution properly matches expert beliefs. If this is the case, ( | ) = 0 because [ = | = * ] = 0.
The way a predicate is specified determines the granularity of the elicitation and the cardinality of equivalence classes in Z. For example, let us consider two nodes V ∈ and V ∈ and the reference feature R 1 = "Nodes V , V are not descendent of other nodes in ".
Reworking the original proposition we have two simpler predicates: (i) R 1 = "Node V is not a descendent of other nodes in "; (ii) R 1 = "Node V is not a descendent of other nodes in . " and they can be considered by conjunction. In Table 2, the relation between the original reference feature (right) and the conjoint components (left) is shown: ¬R 1 collects three configurations generated by the conjunction of simpler predicates. It follows that simpler predicates are needed if the expert degree of belief changes over collapsed configurations in R 1 , that is, ¬R 1 ∩ R 1 , R 1 ∩ ¬R 1 , ¬R 1 ∩ ¬R 1 . While nothing prevents the expert from defining rich predicates, care should be taken to select a granularity suited to properly represent expert beliefs. There are indeed several different ways of formulating a predicate. If two reference sets of features induce the same set of equivalence classes Z then they are operationally equivalent. Nevertheless, from a cognitive standpoint, substantial 8 The Scientific World Journal Table 2: Relation between the original feature (right) and the conjunction of two simpler structural features (left).

Simpler feature 1
Simpler feature 2 Original feature differences in the ease of elicitation might depend on the way propositions are formulated [22]. Let us consider the reference feature R 1 = "Nodes V , V precede all other nodes. " Does the expert use "precede" as "come before" all other nodes in the order relationship of nodes in ? Given a DAG with V disconnected from other nodes how to answer? Does the expert use "precede" meaning "are ancestors of all other nodes in " or to indicate that those nodes do not receive arrows coming from other nodes? This example makes clear that some training is mandatory in order to make an expert effective in the elicitation: a trained expert is expected to choose the right granularity in the elicitation and to define meaningful predicates, that is, statements straightforwardly true/false when applied to any DAG defined on . Several reference features are jointly considered in actual applications, a number typically far beyond what the expert may simultaneously consider with success. Thus there is the possibility that exclusion and implication relations are not recognized. For example, let us consider two features: R 3 = "The indegree is three or less" and R 25 = "V is a sink node. " After reworking the last feature, the expert reformulates the statement as "Node V has indegree ten or more. " Clearly the plausibility of R 3 ∧ R 25 should be null; otherwise the actual interpretation of R 3 is "The indegree of all but V is three or less. " Implication must be recognized to maintain the probabilistic coherence; for example, let R 3 = "Variable V is a direct cause of V " and R 5 = "Variable V is an ancestral cause of V " be two reference features. Clearly R 3 implies R 5 and the joint plausibility of both features is bound to the plausibility of R 3 . In a normalized reference set logical relationships among features are properly handled.

Results and Discussion
A few seminal approaches to the elicitation of beliefs on structures are reconsidered as special cases in the proposed framework based on structural features. A published case study on breast cancer (BC case-study) [23] will be (partially) exploited at illustrative purposes. The whole set of nodes includes: age (AGE), the proliferative index-marker Ki67/MIB-1 (PROLN), oestrogen receptors (ER), progesterone receptors (PR), the receptor tyrosine kinase HER2/neu (NEU), and the P53 protein.

Buntine 1991.
In the seminal paper of Buntine [24], a prior distribution on the set of DAGs for a fixed set is defined by assuming a total ordering of nodes in the context . The probability of the parent set for node V is defined by the product of probability for events like "There is an edge → V, " shortly [" → V"], extended to each preceding V in the order relation. The subjective probability value elicited for a network structure is calculated by marginal independence of parent sets. The original formalization defines a total ordering ≺, so that if ≺ it may belong to the parent's set Π of ; then a full specification of beliefs for each edge in the directed graph is needed and measured in units of subjective probability; finally the independence of parent sets (Π 1 , . . . , Π ) is assumed. The distribution on the set of structures is ( [24], modified) Given the order relation Despite the huge importance of Buntine's seminal work [24], some limitations should be underlined. Under the causal semantic of BNs, the expert might fail in stating such node order which defines the "causal flow" along nodes. In large regulatory networks of system biology a lot of assignments are expected to be 0.5 because expert beliefs may involve a small subset of arrows. We also expect that some plausibility assignments depend on what is already assigned, for example, due to biological substantive laws, but the need of such conditioning is not accounted for. Finally, we remark that several node orders are compatible with the same sparse DAG on ; therefore the specification of a strict order should not be enforced.
The above example may be straightforwardly cast in terms of reference features. The order relation is part of the context , and we define the set of reference features , = "An arrow is from V to V , " with all possible pairs (V , V ) in which V precedes V in the total ordering ≺ equal to (V 3 , V 4 , V 1 , V 2 ). The reference set is R = {R 3,4 , R 3,1 , R 3,2 , R 4,1 , R 4,2 , R 1,2 }. In this case a trivial Bayesian network without arrows is equivalent to the prior distribution in (19) and (20) if conditional probability tables are defined by the same values specified in (20) for each feature; that is, features are marginally independent. To see this, note that the context puts a sharp constraint on the space of structures and that the cardinality of the equivalence classes for each configuration of reference features is equal to one; that is, The formal approach proposed in this paper allows much more flexibility, for example, by restricting the set of nodes to be ordered and by introducing dependence among arrows.
The Scientific World Journal 9 For example, let = (V 3 , V 4 , V 1 , V 2 ) ⊂ be an ordered subset of nodes taken from BC case study. The reference feature R = " is the order on the relevant subset of nodes" induces two equivalence classes, the first made by DAGs on which do not satisfy R , the second one is made by structures without arrows violating the left-to-right order defined in . Besides the sharp constraint obtained by setting [ = 1 | ] = 1, the expert might consider such feature uncertain, thus preferring a degree of belief in the set (0, 1). Further features could be defined to build the reference set: and the ordered partition captures weaker causal relationships by features like R , = "Variable V is a causal ancestor of V . " A CG model made by two components makes possible to elicit conditional beliefs about V 4 → V 2 given the presence of an arrow V 3 → V 4 and given the lack of V 3 → V 1 , without assuming a strict order relation on nodes in .

Heckerman et al. (1995).
In Heckerman et al. [9], a prior network P was elicited and compared to a candidate network by counting the number of different edges, , with a high degree of belief assigned to structures closely resembling the prior network. The authors suggested to elicit the hyperparameter 0 < < 1 and to define the prior distribution to be proportional to .
Among the limitations penalizing the use of this prior we found the following: (i) The impossibility of specifying the degree of belief if it depends not only on the number of different edges but also on their position and type; the presence/absence/direction of an arrow may have an impact on the belief about other edges. (ii) The elicitation about a subset of is not addressed. (iii) The causal semantic is natural for this approach because each arrow represents an immediate cause; it seems difficult to mix probabilistic and causal beliefs by counting differences in arrows because a DAG in the probabilistic semantic is just a member of an equivalence class of DAGs representing the same collection of conditional independence relationships.
A simple reformulation is oriented to computation and it involves three operators: arrow deletion, insertion, and change of direction. The reference set of atomic features is R = {R ( ) : = 0, 1, 2, . . . , } ∪ { ( ) : = + 1} with R ( ) = "The application of operations produces P , " and where ( ) = "The application of or more operations produces P . " An undirected graph made by just one clique is associated with potentials represented in Table 3, where = 3. On the right of Table 3, values of the potential function are shown. The normalization constant is = ( 0 + 1 + 2 + 3 + ) −1 . It is obviously possible to make the two approaches as close as desired by setting ∝ with ≤ 3. Table 3: Potential function for the Heckerman et al. [9] prior. An even simpler reformulation exploits the incompatibility of the above features and it is based on just one reference feature: R 1 ( , ) = "The application of operations produces P , " with ∈ A 1 = ({0}, {1}, . . . , {4, 5, . . .}), with arrow manipulations (insert-delete-change) defined as above. Feature R 1 essentially defines a plausible neighborhood with respect to a prior network. Further reference features may be introduced to refine such plausibility, for example, by also considering one causal, R 2 , and one C.I., R 3 , relationships. In other words, a candidate network "close" to the prior network could be associated with a high prior probability which is then tuned according to the presence/absence of two other relevant features. A natural ordered partition on descriptors could be O = ({ 1 }, { 2 , 3 }); thus a twocomponent chain graph model may support the elicitation.
A different kind of reformulation is based on the fullprobabilistic semantic in which a prior DAG P is just a way to define a collection of C.I. relationships. In this case, it is natural to define a structural feature for each conditional independence relationship if it is relevant according to the expert among those represented by P . The reference set R = {R 1 , . . . , R } in this case is a collection of plausible C.I. statements taken from the prior DAG. Note that, although it is not essential to draw such prior DAG, it may be useful because a general collection of C.I. statements does not necessarily imply the existence of a compatible DAG.
If none among the C.I. relations is preeminent the ordered partition of descriptors contains just one element, say O = ({ 1 , . . . , }), and a CG model made by one component (undirected graphical model) is suited for the elicitation. Imoto et al. (2003). In the seminal paper of Imoto et al. [25], the authors developed a framework for combining microarray data and biological knowledge while learning the structure of a BN representing relationships among genes. The proposed criterion has two components, and the second one is particularly interesting because it captures the a priori biological knowledge.

3.3.
Following their original notation with minor modifications, ( ) is the prior distribution of network . Then, the interaction energy , of the edge from (gene) V to (gene) V is defined on a sample space which is categorized into values, say 1 , 2 , . . . , . For example if gene V regulates gene V then , = 1 > 0, but if not much is known about such potential regulation then , = 2 > 1 . The total energy of a network is ( ) = ∑ (V ,V )∈ , ; thus the sum is taken over existing edges in . The total energy may be 10 The Scientific World Journal rewritten by collecting the parents of each node; thus ( ) = ∑ V ∈ ∑ V ∈ (V ) , = ∑ V ∈ . The (prior) probability of network is modeled by the Gibbs distribution: where > 0 is a hyperparameter and is the normalizing constant, also called the partition function = ∑ ∈G exp(− ( )), with G being the collection of all DAGs on a fixed set of nodes .
Operationally, the prior information is coded into a square matrix of size defined by the number of genes, with each , corresponding to The seminal approach of Imoto et al. [25] suffers of two main limitations. They stated that the biological knowledge should suggest the partitioning of the underlining continuous energy function but it is not clear how, even after invoking the metaphor of energy from physics. In Example 3.2, they tried 1 = 0.5 (but also 1 = 1) and optimized the selection of 2 = 2.5, a procedure not much in line with pure preexperimental Bayesian elicitation. Moreover, the sum in the partition function is taken on the set of DAGs on , and it becomes quite intractable from 5 nodes on. It follows that their approach is substantially a way to build a score function in a spirit similar to [26], but without providing a flexible support for the calibration of the score function.
The prior information considered by these authors can be also expressed in terms of reference features, for example, by considering features like R , = "Gene V regulates gene V , " for all relevant pairs of genes. If preeminent features are absent, an UG model formally resembles the expression based on energy functions but with some major differences: (1) the parameterization is related to subjective probability through odds values; (2) the normalization constant is easier to calculate, at least if the number of features is less than the total pair of genes (some genes omitted); (3) the general calibration constant disappears, although a similar tuning has been considered elsewhere [21] to smooth raw elicited odds values while trying to compensate for the well known human tendency towards overstating odds values. Werhli and Husmeier (2007). The authors [27], building on the work of Imoto et al. [25], defined a prior information matrix whose elements , ∈ [0, 1], with , being a pair of integers for nodes V and V and the relation V → V . If no prior preference about the presence of such arrow is elicited, then , = 0.5; if 0 ≤ , < 0.5 elicited beliefs put more plausibility on the lack of arrow V → V ; if 0.5 < , ≤ 1 higher plausibility favors the presence of the arrow V → V . Note that elements , in are not probabilities.

3.4.
The calculation of the prior probability for a candidate DAG is straightforward if it is represented through an adjacency matrix in which the element , in row and column is 1 if the DAG has an arrow form V to V , and it is zero otherwise. At first the energy of the DAG is calculated as with , and , being, respectively, the elements of and . The probability elicited for is where is a hyperparameter regulating the overall strength of the elicited degree of belief and with being the partition function which depends on a sum of energy values over the collection of all DAGs on a fixed set of nodes . It follows that an increase of energy is related to a larger mismatch with the prior . In the limit of → 0 a noninformative prior is obtained, while for diverging to infinity peaked priors are defined.
The relationship with the seminal approach of Imoto et al. [25] is clear but despite the easier parameterization chosen to define the energy function, several limitations are still present. First of all, the overall strength is not straightforwardly related to (subjective) probability; therefore at some point the expert has to play numerically with fake examples to get the feeling on reasonable values for such hyperparameter. The calibration of the whole approach is difficult because it depends on the calculation of the normalization constant which is hard to obtain due to the calculation of energy values for all DAGs on a fixed set of nodes . Despite the shortcut proposed by the authors, the calculation of the normalization constant remains as a bottleneck of the approach.
The authors increased the flexibility of their approach in representing prior beliefs by using two matrices (1) and (2) : with E 1 ( ), E 2 ( ) being the energy functions, respectively, depending on (1) and (2) . In the elicitation based on a widely adopted database of metabolic pathways, values of are defined as the ratio , / , , with , being the number of times two genes appear in a pathway and , the number of times that they are linked inside such pathway. Results of a study on 25 genes measured at 73 time points suggested that the procedures using prior information outperformed those without it. Nevertheless, the above mentioned limitations get even worse under such generalization; for example, the explosion in the number of terms entering the partition function must be taken under control by introducing a limitation on the number of arrows entering a node, a technical constraint which may lead to biased elicited beliefs. The expressivity obtained by these authors through the use of two matrices may be similarly obtained using reference features with some advantages. Back to the BC case study, a major hypothesis under which the elicitation may be performed could be R 1 = "ER is a hub gene, " so that given the configuration 1 = 1 the expert has to express further conditional beliefs about the other markers; for example, R 2 = "PR regulates P53, " R 3 = "P53 acts on KI67, " R 4 = "P53 acts on NEU. " In this context it is natural to consider the order relation on features O = ({ 1 }, { 2 , 3 , 4 }); thus a chain graph model is suited for the elicitation with just node 1 in the first chain component and as many arrows leaving from 1 as needed to capture changes of conditional beliefs due to a switch in the major hypothesis, from 1 = 1 to 1 = 0. Note that, using a chain graph on features, it is possible to tune the flexibility exactly of the amount needed, without the unnecessary and blind consideration of all pairs of genes.
3.5. Discussion. The expressivity achieved through reference features is wide whether probabilistic or causal information is elicited. Many limitations found in other approaches depend on the consideration of arrows as the key building block of the elicitation. Further restrictions are due to the use of marginal relationships among arrows, so that severe constraints on the expressivity of the approach follow. The elicitation based on reference features has maximum resolution because, for a suitable set of reference features, it is possible to define a prior distribution characterized by a probability value for each DAG defined on .
There are useful side effects in the approach based on structural features. First, the elicitation effort does not depend on the size of the space of DAGs on , but on the size of the collection of features. Another side effect is related to the cardinality of equivalence classes in Z. Two candidate DAGs and characterized by two different configurations of structural features may receive a very different prior probability value just because the cardinality of the equivalence classes they belong to is very different, say [ ] ⋘⋙ [ ] (see (11)). Note that the cardinality of equivalence classes must be taken into account to preserve probabilistic coherence.
The above description of the elicitation process did not deal with computational issues that are very important for applications. The proposed approach is suited to large networks if the number of DAGs within each equivalence class is available, for example, as a Monte Carlo point estimate. Monte Carlo simulation makes it possible to explore the space of DAGs and it provides evidences about the presence of features which are logically incompatible; thus it may also suggest predicates on which the expert should focus to improve the definition of reference features. Moreover, the elicitation defines a proper prior distribution, so that different MCMC algorithms could be developed to obtain the posterior distribution of . New greedy search algorithms to find plausible structures in Ω could be investigated to exploit the elicited reference features, for example, by developing an algorithm that generates candidate DAGs belonging to different equivalence classes in Z at each iteration of the optimization.
The approach to elicitation based on graphical models does not necessitate very esoteric software, although software libraries for platforms commonly adopted in scientific computing would offer the opportunity of performing extensive testing and of investigating human heuristics specifically relevant in this elicitation framework. It is well known that graphical user interfaces facilitate the elicitation, especially if experts are not much trained in Bayesian statistics, and this is a resource which is anyway almost mandatory to face the elicitation in problem domains involving a large number of structural features.

Conclusions
Graphical models may be exploited to elicit beliefs about the structure of an unknown BN from experts. The joint plausibility on configurations of structural features is decomposed according to conditional independence relationships that are considered plausible by an expert. The expert may use CG models to elicit structural prior information in quite complex domains without leaving a full Bayesian framework. From the elicited CG model, and eventually by using an auxiliary Monte Carlo simulation to estimate the cardinality of equivalence classes, a (proper) subjective prior distribution on the space of DAGs is built and ready to be used with the likelihood function in order to find BN structures supported both by expert beliefs and by collected observations.
No surprise that in complex domains the definition of a prior distribution may be costly. A trade-off should be found by considering the goal of the analysis, how much prior information is available, and the cost and importance of collected data. Here system biology and medicine are expected to be fields in which the proposed approach might be useful, because subjective prior information, besides providing the above mentioned benefits, also tempers the curse of dimensionality caused by structures defined on a high number of variables.

Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.