{"title": "Gates", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1080, "abstract": "Gates are a new notation for representing mixture models and context-sensitive independence in factor graphs. Factor graphs provide a natural representation for message-passing algorithms, such as expectation propagation. However, message passing in mixture models is not well captured by factor graphs unless the entire mixture is represented by one factor, because the message equations have a containment structure. Gates capture this containment structure graphically, allowing both the independences and the message-passing equations for a model to be readily visualized. Different variational approximations for mixture models can be understood as different ways of drawing the gates in a model. We present general equations for expectation propagation and variational message passing in the presence of gates.", "full_text": "Tom Minka\n\nMicrosoft Research Ltd.\n\nCambridge, UK\n\nGates\n\nAbstract\n\nJohn Winn\n\nMicrosoft Research Ltd.\n\nCambridge, UK\n\nGates are a new notation for representing mixture models and context-sensitive\nindependence in factor graphs. Factor graphs provide a natural representation for\nmessage-passing algorithms, such as expectation propagation. However, message\npassing in mixture models is not well captured by factor graphs unless the en-\ntire mixture is represented by one factor, because the message equations have a\ncontainment structure. Gates capture this containment structure graphically, al-\nlowing both the independences and the message-passing equations for a model\nto be readily visualized. Different variational approximations for mixture models\ncan be understood as different ways of drawing the gates in a model. We present\ngeneral equations for expectation propagation and variational message passing in\nthe presence of gates.\n\n1 Introduction\n\nGraphical models, such as Bayesian networks and factor graphs [1], are widely used to represent\nand visualise \ufb01xed dependency relationships between random variables. Graphical models are also\ncommonly used as data structures for inference algorithms since they allow independencies between\nvariables to be exploited, leading to signi\ufb01cant ef\ufb01ciency gains. However, there is no widely used\nnotation for representing context-speci\ufb01c dependencies, that is, dependencies which are present or\nabsent conditioned on the state of another variable in the graph [2]. Such a notation would be\nnecessary not only to represent and communicate context-speci\ufb01c dependencies, but also to be able\nto exploit context-speci\ufb01c independence to achieve ef\ufb01cient and accurate inference.\n\nA number of notations have been proposed for representing context-speci\ufb01c dependencies, includ-\ning: case factor diagrams [3], contingent Bayesian networks [4] and labeled graphs [5]. None of\nthese has been widely adopted, raising the question: what properties would a notation need, to\nachieve widespread use? We believe it would need to be:\n\n\u2022 simple to understand and use,\n\u2022 \ufb02exible enough to represent context-speci\ufb01c independencies in real world problems,\n\u2022 usable as a data structure to allow existing inference algorithms to exploit context-speci\ufb01c\n\nindependencies for ef\ufb01ciency and accuracy gains,\n\n\u2022 usable in conjunction with existing representations, such as factor graphs.\n\nThis paper introduces the gate, a graphical notation for representing context-speci\ufb01c dependencies\nthat we believe achieves these desiderata. Section 2 describes what a gate is and shows how it can\nbe used to represent context-speci\ufb01c independencies in a number of example models. Section 3\nmotivates the use of gates for inference and section 4 expands on this by showing how gates can be\nused within three standard inference algorithms: Expectation Propagation (EP), Variational Message\nPassing (VMP) and Gibbs sampling. Section 5 shows how the placement of gates can tradeoff cost\nversus accuracy of inference. Section 6 discusses the use of gates to implement inference algorithms.\n\n1\n\n\f(a)\n\nm\n\np\n\nc\n\nGaussian\n\nx\n\n(b)\n\nm1\n\nTrue\n\np1\n\np2\n\nm2\n\nFalse\n\nc\n\nx\n\n(c)\n\nm1\n\np1\n\nm2\n\np2\n\nm3\n\np3\n\n(d)\n\nmn\n\npn\n\n1\n\n2\n\n3\n\nc\n\nn\n\nc\n\nx\n\nn=1..N\n\nx\n\nFigure 1: Gate examples (a) The dashed rectangle indicates a gate containing a Gaussian factor,\nwith selector variable c. (b) Two gates with different key values used to construct a mixture of two\nGaussians. (c) When multiple gates share a selector variable, they can be drawn touching with the\nselector variable connected to only one of the gates. (d) A mixture of N Gaussians constructed using\nboth a gate and a plate. For clarity, factors corresponding to variable priors have been omitted.\n\n2 The Gate\n\nA gate encloses part of a factor graph and switches it on or off depending on the state of a latent\nselector variable. The gate is on when the selector variable has a particular value, called the key,\nand off for all other values. A gate allows context-speci\ufb01c independencies to be made explicit in the\ngraphical model: the dependencies represented by any factors inside the gate are present only in the\ncontext of the selector variable having the key value. Mathematically, a gate represents raising the\n\ncontained factors to the power zero if the gate is off, or one if it is on: (Qi fi(x))\u03b4(c=key) where c is\n\nthe selector variable. In diagrams, a gate is denoted by a dashed box labelled with the value of key,\nwith the selector variable connected to the box boundary. The label may be omitted if c is boolean\nand key is true. Whilst the examples in this paper refer to factor graphs, gate notation can also be\nused in both directed Bayesian networks and undirected graphs.\n\nA simple example of a gate is shown in \ufb01gure 1a.\nThis example represents the term\nN (x; m, p\u22121)\u03b4(c=true) so that when c is true the gate is on and x has a Gaussian distribution with\nmean m and precision p. Otherwise, the gate is off and x is uniformly distributed (since it is con-\nnected to nothing).\n\nBy using several gates with different key values, multiple components of a mixture can be repre-\nsented. Figure 1b shows how a mixture of two Gaussians can be represented using two gates with\ndifferent key values, true and false. If c is true, x will have distribution N (m1, p\u22121\n1 ), otherwise x\nwill have distribution N (m2, p\u22121\n2 ) . When multiple gates have the same selector variable but differ-\nent key values, they can be drawn as in \ufb01gure 1c, with the gate rectangles touching and the selector\nvariable connected to only one of the gates. Notice that in this example, an integer selector variable\nis used and the key values are the integers 1,2,3.\n\nFor large homogeneous mixtures, gates can be used in conjunction with plates [6]. For example,\n\ufb01gure 1d shows how a mixture of N Gaussians can be represented by placing the gate, Gaussian\nfactor and mean/precision variables inside a plate, so that they are replicated N times.\n\nGates may be nested inside each other, implying a conjunction of their conditions. To avoid ambi-\nguities, gates cannot partially overlap, nor can a gate contain its own selector variable.\n\n2\n\n\f(a)\n\nEdge labels\n\n(b)\n\ne12\n\nF\n\nx1\n\nx2\n\ne23\n\nF\n\nx3\n\nPixel intensities\n\nGenetic \nvariant\n\ngn\n\n\u00d7\nyn\n\n+\n\nGaussian\n\nzn\n\nm\n\np\n\nQuant.\n\ntrait\n\nxn\n\nn=1..N\n\nw\n\nc\n\nFigure 2: Examples of models which use gates (a) A line process where neighboring pixel intensi-\nties are independent if an edge exists between them.\n(b) Testing for dependence between a genetic\nvariant gn and an observed quantitative trait xn. The selector variable c encodes whether the linear\ndependency represented by the structure inside the gate is present or absent.\n\nGates can also contain variables, as well as factors. Such variables have the behaviour that, when\nthe gate is off, they revert to having a default value of false or zero, depending on the variable type.\nMathematically, a variable inside a gate represents a Dirac delta when the gate is off: \u03b4(x)1\u2212\u03b4(c=key)\nwhere \u03b4(x) is one only when x has its default value. Figure 2b shows an example where variables\nare contained in gates \u2013 this example is described in the following section.\n\n2.1 Examples of models with gates\n\nFigure 2a shows a line process from [7]. The use of gates makes clear the assumption that two neigh-\nboring image pixels xi and xj have a dependency between their intensity values, unless there is an\nedge eij between them. An opaque three-way factor would hide this context-speci\ufb01c independence.\nGates can also be used to test for independence. In this case the selector variable is connected only\nto the gate, as shown in the example of \ufb01gure 2b. This is a model used in functional genomics [8]\nwhere the aim is to detect associations between a genetic variant gn and some quantitative trait xn\n(such as height, weight, intelligence etc.) given data from a set of N individuals. The binary selector\nvariable c switches on or off a linear model of the genetic variant\u2019s contribution yn to the trait xn,\nacross all individuals. When the gate is off, yn reverts to the default value of 0 and so the trait is\nexplained only by a Gaussian-distributed background model zn. Inferring the posterior distribution\nof c allows associations between the genetic variation and the trait to be detected.\n\n3 How gates arise from message-passing on mixture models\n\nFactor graph notation arises naturally when describing message passing algorithms, such as the\nsum-product algorithm. Similarly, the gate notation arises naturally when considering the behavior\nof message passing algorithms on mixture models.\nAs a motivating example, consider the mixture model of \ufb01gure 1b when the precisions p1 and p2 are\nconstant. Using 1 and 2 as keys instead of true and false, the joint distribution is: p(x, c, m1, m2) =\np(c)p(m1)p(m2)f (x|m1)\u03b4(c\u22121)f (x|m2)\u03b4(c\u22122) where f is the Gaussian distribution. If we apply\nmean-\ufb01eld approximation to this model, we obtain the following \ufb01xed-point system:\n\nq(mk) log f (x|mk)!\n\nq(c = k) \u221d p(c = k) exp Xx\nq(mk) \u221d p(mk) exp Xx\nexp Xmk\nq(x) \u221dYk\n\nq(x)Xmk\nq(x) log f (x|mk)!q(c=k)\nq(mk) log f (x|mk)!q(c=k)\n\n3\n\n(1)\n\n(2)\n\n(3)\n\n\fThese updates can be interpreted as message-passing combined with \u201cblurring\u201d (raising to a power\nbetween 0 and 1). For example, the update for q(mk) can be interpreted as (message from\nprior)\u00d7(blurred message from f). The update for q(x) can be interpreted as (blurred message from\nm1)\u00d7(blurred message from m2). Blurring occurs whenever a message is sent from a factor hav-\ning a random exponent to a factor without that exponent. Thus the exponent acts like a container,\naffecting all messages that pass out of it. Hence, we use a graphical notation where a gate is a con-\ntainer, holding all the factors switched by the gate. Graphically, the blurring operation then happens\nwhenever a message leaves a gate. Messages passed into a gate and within a gate are unchanged.\n\nThis graphical property holds true for other algorithms as well. For example, EP on this model will\nblur the message from f to mk and from f to x, where \u201cblurring\u201d means a linear combination with\nthe 1 function followed by KL-projection.\n\n3.1 Why gates are not equivalent to \u2018pick\u2019 factors\n\nIt is possible to rewrite this model so that the f factors do not have exponents, and therefore\nwould not be in gates. However, this will necessarily change the approximation. This is be-\ncause the blurring effect caused by exponents operates in one direction only, while the blur-\nring effect caused by intermediate factors is always bidirectional. For example, suppose we\ntry to write the model using a factor pick(x|c, h1, h2) = \u03b4(x \u2212 h1)\u03b4(c\u22121)\u03b4(x \u2212 h2)\u03b4(c\u22122).\nWe can introduce latent variables (h1, h2) so that the model becomes p(x, c, m1, m2, h1, h2) =\np(c)p(m1)p(m2)f (h1|m1)f (h2|m2)pick(x|c, h1, h2). The pick factor will correctly blur the\ndownward messages from (m1, m2) to x. However, the pick factor will also blur the message\nupward from x before it reaches the factor f, which is incorrect.\nAnother approach is to pick from (m1, m2) before reaching the factor f, so that the model becomes\np(x, c, m1, m2, m) = p(c)p(m1)p(m2)f (x|m)pick(m|c, m1, m2). In this case, the message from\nx to f is not blurred, and the upward messages to (m1, m2) are blurred, which is correct. However,\nthe downward messages from (m1, m2) to f are blurred before reaching f, which is incorrect.\n\n3.2 Variables inside gates\n\nNow consider an example where it is natural to consider a variable to be inside a gate. The model\nIf we use a structured\n\nvariational approximation where y is conditioned on c, then the \ufb01xed-point equations are [9]:\n\nis: p(x, c, m1, m2, y) = p(c)p(m1)p(m2)Qk (f1(x|y)f2(y|mk))\u03b4(c\u2212k).\nq(c = k) \u221d p(c = k) exp Xx\nq(y|c = k)Xmk\nq(x) log f1(x|y)! exp Xmk\n\nq(y|c = k) log f1(x|y)!\nq(x)Xy\nq(mk) log f2(y|mk)! exp \u2212Xy\nq(mk) log f2(y|mk)!\n\nexp Xy\nq(y|c = k) \u221d exp Xx\nq(mk) \u221d p(mk) exp Xy\nexp Xy\nq(x) \u221dYk\n\nq(y|c = k) log f2(y|mk)!q(c=k)\nq(y|c = k) log f1(x|y)!q(c=k)\n\nq(y|c = k) log q(y|c = k)!\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\nNotice that only the messages to x and mk are blurred; the messages to and from y are not blurred.\nThus we can think of y as sitting inside the gate. The message from the gate to c can be interpreted\nas the evidence for the submodel containing f1, f2, and y.\n\n4\n\n\f4 Inference with gates\n\nIn the previous section, we explained why the gate notation arises when performing message passing\nin some example mixture models. In this section, we describe how gate notation can be generally\nincorporated into Variational Message Passing [10], Expectation Propagation [11] and Gibbs Sam-\npling [7] to allow each of these algorithms to support context-speci\ufb01c independence.\n\nFor reference, Table 1 shows the messages needed to apply standard EP or VMP using a fully factor-\n\nministic factors, that is, factors which have the form fa(xi, xa\\i) = \u03b4(xi \u2212 h(xa\\i)) where xi is the\nderived child variable. Different VMP messages are also used to and from such deterministic derived\n\nized approximation q(x) =Qi q(xi). Notice that VMP uses different messages to and from deter-\nvariables. For both algorithms the marginal distributions are obtained as q(xi) =Qa ma\u2192i(xi), ex-\n\ncept for derived child variables in VMP where q(xi) = mpar\u2192i(xi). The (approximate) model\nevidence is obtained by a product of contributions, one from each variable and each factor. Table 1\nshows these contributions for each algorithm, with the exception that deterministic factors and their\nderived variables contribute 1 under VMP.\nWhen performing inference on models with gates, it is useful to employ a normalised form of gate\nmodel. In this form, variables inside a gate have no links to factors outside the gate, and a variable\noutside a gate links to at most one factor inside the gate. Both of these requirements can be achieved\nby splitting a variable into a copy inside and a copy outside the gate, connected by an equality factor\ninside the gate. A factor inside a gate should not connect to the selector of the gate; it should be\ngiven the key value instead. In addition, gates should be balanced by ensuring that if a variable links\n\nAlg.\n\nType\n\nVariable to factor\n\nEP\n\nAny\n\nVMP\n\nStochastic\n\nmi\u2192a(xi)\n\nmb\u2192i(xi)\n\nma\u2192i(xi)\n\nYb6=a\n\nYa\u220bi\n\nDet. to parent Yb6=a\n\nmb\u2192i(xi)\n\nDet. to child\n\nmpar\u2192i(xi)\n\nFactor to variable\n\nma\u2192i(xi)\n\nmi\u2192a(xi)\n\nmj\u2192a(xj)\uf8f6\nmk\u2192a(xk)\uf8f6\n\nprojhPxa\\xi(cid:16)Qj\u2208a mj\u2192a(xj)(cid:17) fa(xa)i\n\uf8f8 log fa(xa)\uf8f9\nexp\uf8ee\n\uf8eb\n\uf8f0 Xxa\\xi\n\uf8edYj6=i\n\uf8fb\n\uf8eb\nexp\uf8ee\n\uf8f8 log \u02c6fa(xa)\uf8f9\n\uf8ed Yk6=(i,ch)\n\uf8f0 Xxa\\(i,ch)\n\uf8fb\nwhere \u02c6fa(xa) =Pxch\n\uf8f8 fa(xa)\uf8f9\n\uf8eb\nproj\uf8ee\n\uf8edYj6=i\n\uf8f0 Xxa\\xi\n\uf8fb\n\nmj\u2192a(xj)\uf8f6\n\nmch\u2192a(xch)fa(xa)\n\nEvidence for variable xi\n\nEvidence for factor fa\n\nsi =PxiQa ma\u2192i(xi)\n\nq(xi) log q(xi))\n\nsi = exp(\u2212Pxi\n\nsa = P xa (Q j\u2208a mj\u2192a(xj ))fa(xa)\nP xa Q j\u2208a mj\u2192a(xj )ma\u2192j (xj )\n\nsa = exp(cid:16)Pxa(cid:16)Qj\u2208a mj\u2192a(xj)(cid:17) log fa(xa)(cid:17)\n\nAlg.\n\nEP\n\nVMP\n\nTable 1: Messages and evidence computations for EP and VMP The top part of the table shows\nmessages between a variable xi and a factor fa. The notation j \u2208 a refers to all neighbors of\nthe factor, j 6= i is all neighbors except i, par is the parent factor of a derived variable, and ch\nis the child variable of a deterministic factor. The proj[p] operator returns an exponential-family\ndistribution whose suf\ufb01cient statistics match p. The bottom part of the table shows the evidence\ncontributions for variables and factors in each algorithm.\n\n5\n\n\fto a factor in a gate with selector variable c, the variable also links to factors in gates keyed on all\nother values of the selector variable c. This can be achieved by connecting the variable to uniform\nfactors in gates for any missing values of c. After balancing, each gate is part of a gate block \u2013 a set\nof gates activated by different values of the same condition variable. See [12] for details.\n\n4.1 Variational Message Passing with gates\n\nVMP can be augmented to run on a gate model in normalised form, by changing only the messages\nout of the gate and by introducing messages from the gate to the selector variable. Messages sent\nbetween nodes inside the gate and messages into the gate are unchanged from standard VMP. The\nvariational distributions for variables inside gates are implicitly conditioned on the gate selector, as\nat the end of section 3. In the following, an individual gate is denoted g, its selector variable c and\nits key kg. See [12] for the derivations.\nThe messages out of a gate are modi\ufb01ed as follows:\n\n\u2022 The message from a factor fa inside a gate g with selector c to a variable outside g is the\n\nusual VMP message, raised to the power mc\u2192g(c = kg), except in the following case.\n\n\u2022 Where a variable xi is the child of a number of deterministic factors inside a gate block G\nwith selector variable c, the variable is treated as derived and the message is a moment-\nmatched average of the individual VMP messages. Then the message to xi is\n\nmG\u2192i(xi) = proj\uf8ee\n\uf8f0Xg\u2208G\n\nmc\u2192g(c = kg)mg\u2192i(xi)\uf8f9\n\uf8fb\n\nwhere mg\u2192i(xi) is the usual VMP message from the unique parent factor in g and proj is\na moment-matching projection onto the exponential family.\n\nThe message from a gate g to its selector variable c is a product of evidence messages from the\ncontained nodes:\n\nmg\u2192c(c = kg) = Ya\u2208g\n\nsaYi\u2208g\n\nsi,\n\nmg\u2192c(c 6= kg) = 1\n\n(9)\n\nwhere sa and si are the VMP evidence messages from a factor and variable, respectively (Table 1).\nThe set of contained factors includes any contained gates, which are treated as single factors by the\ncontaining gate. Deterministic variables and factors send evidence messages of 1, except where a\ndeterministic factor fa parents a variable xi outside g. Instead of sending sa = 1, the factor sends:\n\n(8)\n\n(10)\n\n(11)\n\nThe child variable xi outside the gate also has a different evidence message:\n\nsa = exp Xxi\nsi = exp \u2212Xxi\n\nma\u2192i(xi) log mi\u2192a(xi)!\n\nmG\u2192i(xi) log mi\u2192a(xi)!\n\nwhere mG\u2192i is the message from the parents (8) and mi\u2192a is the message from xi to any parent.\nTo allow for nested gates, we must also de\ufb01ne an evidence message for a gate:\n\nq(c=kg)\n\nsg =\uf8eb\n\uf8edYa\u2208g\n\nsaYi\u2208g\n\nsi\uf8f6\n\uf8f8\n\n(12)\n\n4.2 Expectation Propagation with gates\n\nAs with VMP, EP can support gate models in normalised form by making small modi\ufb01cations to the\nmessage-passing rules. Once again, messages between nodes inside a gate are unchanged. Recall\nthat, following gate balancing, all gates are part of gate blocks. In the following, an individual gate\nis denoted g, its selector variable c and its key kg. See [12] for the derivations.\n\n6\n\n\fThe messages into a gate are as follows:\n\n\u2022 The message from a selector variable to each gate in a gate block G is the same. It is the\n\nproduct of all messages into the variable excluding messages from gates in G.\n\n\u2022 The message from a variable to each neighboring factor inside a gate block G is the same.\n\nIt is product of all messages into the variable excluding messages from any factor in G.\n\nLet nbrs(g) be the set of variables outside of g connected to some factor in g. Each gate computes\nan intermediate evidence-like quantity sg de\ufb01ned as:\n\nsg = Ya\u2208g\n\nsaYi\u2208g\n\nsi Yi\u2208nbrs(g)\n\nsig\n\nwhere sig =Xxi\n\nmi\u2192g(xi)mg\u2192i(xi)\n\n(13)\n\nwhere mg\u2192i is the usual EP message to xi from its (unique) neighboring factor in g. The third term\nis used to cancel the denominators of sa (see de\ufb01nition in Table 1). Given this quantity, the messages\nout of a gate may now be speci\ufb01ed:\n\n\u2022 The combined message from all factors in a gate block G with selector variable c to a\n\nvariable xi is the weighted average of the messages sent by each factor:\n\nmG\u2192i(xi) =\n\nprojhPg\u2208G mc\u2192g(c = kg)sgs\u22121\n\nmi\u2192g(xi)\n\nig mg\u2192i(xi)mi\u2192g(xi)i\n\n(Note mi\u2192g(xi) is the same for each gate g.)\n\n\u2022 The message from a gate block G to its selector variable c is:\n\nmG\u2192c(c = kg) =\n\nsg\n\nPg\u2208G sg\n\nFinally, the evidence contribution of a gate block with selector c is:\n\nsc =\n\nQi\u2208nbrs(g)Pxi\n\nPg\u2208G sg\n\nmi\u2192g(xi)mG\u2192i(xi)\n\n4.3 Gibbs sampling with gates\n\n(14)\n\n(15)\n\n(16)\n\nGibbs sampling can easily extend to gates which contain only factors. Gates containing variables\nrequire a facility for computing the evidence of a submodel, which Gibbs sampling does not provide.\nNote also that Gibbs sampling does not support deterministic factors. Thus the graph should only\nbe normalised up to these constraints. The algorithm starts by setting the variables to initial values\nand sending these values to their neighboring factors. Then for each variable xi in turn:\n\n1. Query each neighboring factor for a conditional distribution for xi. If the factor is in a gate\nthat is currently off, replace with a uniform distribution. For a gate g with selector xi, the\nconditional distribution is proportional to s for the key value and 1 otherwise, where s is\nthe product of all factors in g.\n\n2. Multiply the distributions from neighboring factors together to get the variable\u2019s conditional\n\ndistribution. Sample a new value for the variable from its conditional distribution.\n\n5 Enlarging gates to increase approximation accuracy\n\nGates induce a structured approximation as in [9], so by moving nodes inside or outside of gates,\nyou can trade off inference accuracy versus cost. Because one gate of a gate block is always on, any\nnode (variable or factor) outside a gate block G can be equivalently placed inside each gate of G.\nThis increases accuracy since a separate set of messages will be maintained for each case, but it may\nincrease the cost.\n\nFor example, Archambeau and Verleysen [14] suggested a structured approximation for Student-t\nmixture models, instead of the factorised approximation of [13]. Their modi\ufb01cation can be viewed\nas a gate enlargement (\ufb01gure 3). By enlarging the gate block to include unm, the blurring between\nthe multiplication factor and unm is removed, increasing accuracy. This comes at no additional cost\nsince unm is only used by one gate and therefore only one message is needed per n and m.\n\n7\n\n\fDirichlet\n\nGaussian\n\nGamma\n\n(a)\n\nDirichlet\n\nGaussian\n\nGamma\n\n(b)\n\n\u03c0\n\n\u00b5m\n\n\u03bbm\n\n\u03c0\n\n\u00b5m\n\n\u03bbm\n\nm\nGaussian\n\nDiscrete\n\nzn\n\nGamma\n\n\u00d7\n\nunm\n\nm\nGaussian\n\nDiscrete\n\nzn\n\nGamma\n\n\u00d7\n\nunm\n\nxn\n\nm=1..M\n\nn=1..N\n\nxn\n\nm=1..M\n\nn=1..N\n\nFigure 3: Student-t mixture model using gates (a) Model from [13] (b) Structured approximation\nsuggested by [14], which can be interpreted as enlarging the gate.\n\n6 Discussion and conclusions\n\nGates have proven very useful to us when implementing a library for inference in graphical mod-\nels. By using gates, the library allows mixtures of arbitrary sub-models, such as mixtures of fac-\ntor analysers. Gates are also used for computing the evidence for a model, by placing the entire\nmodel in a gate with binary selector variable b. The log evidence is then the log-odds of b, that is,\nlog P (b = true) \u2212 log P (b = false). Similarly, gates are used for model comparison by placing\neach model in a different gate of a gate block. The marginal over the selector gives the posterior\ndistribution over models.\n\nGraphical models not only provide a visual way to represent a probabilistic model, but they can\nalso be used as a data structure for performing inference on that model. We have shown that gates\nare similarly effective both as a graphical modelling notation and as a construct within an inference\nalgorithm.\n\nReferences\n[1] B. Frey, F. Kschischang, H. Loeliger, and N. Wiberg. Factor graphs and algorithms. In Proc. of the 35th\n\nAllerton Conference on Communication, Control and Computing, 1998.\n\n[2] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-speci\ufb01c independence in Bayesian\nnetworks. In Proc. of the 12th conference on Uncertainty in Arti\ufb01cial Intelligence, pages 115\u2013123, 1996.\n[3] D. McAllester, M. Collins, and F. Pereira. Case-factor diagrams for structured probabilistic modeling.\n\nUncertainty in Arti\ufb01cial Intelligence, 2004.\n\n[4] B. Milch, B. Marthi, D. Sontag, S. Russell, D. L. Ong, and A. Kolobov. Approximate inference for in\ufb01nite\ncontingent Bayesian networks. In Proc. of the 6th workshop on Arti\ufb01cial Intelligence and Statistics, 2005.\n[5] E. Mjolsness. Labeled graph notations for graphical models: Extended report. Technical Report TR# 04-\n\n03, UCI ICS, March 2004.\n\n[6] W. L. Buntine. Operations for learning with graphical models. JAIR, 2:159\u2013225, 1994.\n[7] S. Geman and D. Geman. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of\n\nimages. IEEE Trans. on Pattern Anal. Machine Intell., 6:721\u2013741, 1984.\n\n[8] E. S. Lander and D. Botstein. Mapping Mendelian factors underlying quantitative traits using RFLP\n\nlinkage maps. Genetics, 121(1):185\u2013199, 1989.\n\n[9] W.A.J.J. Wiegerinck. Variational approximations between mean \ufb01eld theory and the junction tree algo-\n\nrithm. In UAI, pages 626\u2013633, 2000.\n\n[10] J. Winn and C. M. Bishop. Variational Message Passing. JMLR, 6:661\u2013694, 2005.\n[11] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, pages 362\u2013369, 2001.\n[12] T. Minka and J. Winn. Gates: A graphical notation for mixture models. Technical report, Microsoft\n\nResearch Ltd, 2008.\n\n[13] M. Svens\u00b4en and C. M. Bishop. Robust Bayesian mixture modelling. Neurocomputing, 64:235\u2013252, 2005.\n[14] C. Archambeau and M. Verleysen. Robust Bayesian clustering. Neural Networks, 20:129\u2013138, 2007.\n\n8\n\n\f", "award": [], "sourceid": 875, "authors": [{"given_name": "Tom", "family_name": "Minka", "institution": null}, {"given_name": "John", "family_name": "Winn", "institution": null}]}