## Graphical ModelsIn the previous tutorials, we've learned how to train a family of models that ranged from - define a model, and define a set of trainable/adjustable parameters
- choose a loss function, an objective to minimize
- pick an optimization algorithm, to minimize the loss function above, by modifying the trainable parameters
These three steps are really general, and can be found in many bodies of research: statistics, machine learning, computer vision, operating research, ... In this tutorial, we're going to explore a different class of models, which also relies on the procedure above: undirected graphical models, with pairwise and unary potentials. To describe graphical models, we use the package Although a bit out of scope in a deep-learning tutorial session, making the parallel between deep-learning models and graphical models is of interest. In fact, there are many tasks in signal processing that benefit from a structured prediction framework, in which a graphical model imposes a structure on the predictions, and the unary terms of the model are coming from some classifier. For tasks like scene understanding, or OCR, or in general, tasks where there's a need for jointly segmenting The code associated with this tutorial is provided on GitHub, on this page. ## Defining a ModelIn pairwise undirected graphical models, the joint probability of a particular assignment to all the variables xiis represented as a normalized product of a set of non-negative potential functions: p(x1,x2,...,xN)=1Z∏i=1Nϕi(xi)∏e=1Eϕe(xej,xek) There is one potential function ϕ for each node i, for each edge e. Each edge connects two nodes, ej and ek. In a full graph, there is one edge between each possible pair of nodes. For common applications, such as computer vision, it's much more common to have locally connected graphs, The The normalization constant Z, or partition function, is a scalar value that forces the distribution to sum to one, over all possible joint configurations of the variables: Z=∑x1∑x2…∑xn∏i=1Nϕi(xi)∏e=1Eϕe(xej,xek) This normalizing constant ensures that the model defines a valid probability distribution. Given the graph structure and the potentials, the - Decoding: finding the joint configuration of the variables with the highest probability.
- Inference: computing the normalization constant Z, as well as the probabilities of each variable taking each state.
- Training (or parameter estimation): the task of computing the potentials that maximize the likelihood of set of data.
## A Simple ModelHere we first show how to describe a very simple model. The first thing to define is the adjacency matrix of the graph, which by itself fully describes the architecture of the model. The package provides a couple of standard matrices. -- define graph nNodes = 10 adjacency = gm.adjacency.full(nNodes) Once the adjacency matrix is defined, we create a graph, using the nStates = 2 g = gm.graph{adjacency=adjacency, nStates=nStates, maxIter=10, verbose=true} In this first example, we are not going to train the potentials, but assume that we already have them. We need to define two types of potentials, the ϕi potentials, and the ϕe potentials. For each possible configuration, we need to provide a potential. Potentials are positively correlated with probabilities, so a high potential means a high probability. -- unary potentials nodePot = Tensor{{1,3}, {9,1}, {1,3}, {9,1}, {1,1}, {1,3}, {9,1}, {1,3}, {9,1}, {1,1}} -- joint potentials: these are Potts potentials edgePot = Tensor(g.nEdges,nStates,nStates) basic = Tensor{{2,1}, {1,2}} for e = 1,g.nEdges do edgePot[e] = basic end -- set potentials g:setPotentials(nodePot,edgePot) At this stage, we have a fully parametrized graphical model, in which we can do inference and decoding. But first a few definitions: - the
*decoding*task is to find the most likely configuration of the variables. - the
*inference*task is to find the normalizing constant Z, as well as all the marginal probabilities of individual nodes taking each state.
There are multiple methods to do inference and decoding. Exact methods involve an exhaustive search through all possible combinations of states. optimal = g:decode('exact') print('<gm.testme> maximum likelihood configuration:') print(optimal) local nodeBel,edgeBel,logZ = g:infer('exact') print('<gm.testme> node beliefs:') print(nodeBel) print('<gm.testme> edge beliefs:') print(edgeBel) print('<gm.testme> log(Z):') print(logZ) Exact inference is of course impossible for any reasonably-sized model. A very well known inference technique is Same code using belief propagation: optimal = g:decode('bp') print('<gm.testme> optimal config with belief propagation:') print(optimal) local nodeBel,edgeBel,logZ = g:infer('bp') print('<gm.testme> node beliefs:') print(nodeBel) print('<gm.testme> edge beliefs:') print(edgeBel) print('<gm.testme> log(Z):') print(logZ) So that's basically all there is to describing arbitrary graphs, thanks to the adjacency matrix; doing inference; and decoding the optimal joint configuration of states. ## Conditional Random Field for ImagesLet's now move on to a slightly more complex scenario: instead of defining the potential functions by hand, we're now going to train a set of parameters to produce In this example, we will try to denoise an image, which was corrupted by some Gaussian noise. In this case, we have input variables (the observed pixels), and output variables (the clean labels). This is a typical scenario in which we can use conditional random fields (CRFs). In CRFs, we have two types of variables: (i) the I invite you to take a look at this great tutorial on CRFs, for a more formal introduction. In our case, the The first step, as before, is to define the graph structure. We use a 4-connexity lattice, which is rather typical in the image world: -- assuming images of geometry: nRow = 32 nCol = 32 nNodes = nRows*nCols -- define adjacency matrix (4-connexity lattice) local adj = gm.adjacency.lattice2d(nRows,nCols,4) -- create graph nStates = 2 g = gm.graph{adjacency=adj, nStates=nStates, verbose=true, type='crf', maxIter=10} Note the ## Training our Model (CRF)We then need to generate some training data: -- load a clean, black and white image of an X sample = image.load('some_32x32_image.png') -- how many instances to generate in the training data: nInstances = 100 -- generate some labels (y), which correspond to the MAP solution to the CRF: y = tensor(nInstances,nRows*nCols) for i = 1,nInstances do y[i] = sample end y = y + 1 -- Lua is 1-based -- generate an ensemble of noisy versions of that image: X = tensor(nInstances,1,nRows*nCols) for i = 1,nInstances do X[i] = sample end X = X + randn(X:size())/2 Here are some examples of the noisy input samples: Once we have training data, we need to generate all the -- create node features (normalized X and a bias) Xnode = tensor(nInstances,2,nNodes) Xnode[{ {},1 }] = 1 -- bias -- normalize features: nFeatures = X:size(2) for f = 1,nFeatures do local Xf = X[{ {},f }] local mu = Xf:mean() local sigma = Xf:std() Xf:add(-mu):div(sigma) end Xnode[{ {},2 }] = X -- features (simple normalized grayscale) nNodeFeatures = Xnode:size(2) -- create edge features nEdges = g.edgeEnds:size(1) nEdgeFeatures = nNodeFeatures*2-1 -- sharing bias, but not grayscale features Xedge = zeros(nInstances,nEdgeFeatures,nEdges) for i = 1,nInstances do for e =1,nEdges do local n1 = g.edgeEnds[e][1] local n2 = g.edgeEnds[e][2] for f = 1,nNodeFeatures do -- get all features from node1 Xedge[i][f][e] = Xnode[i][f][n1] end for f = 1,nNodeFeatures-1 do -- get all features from node1, except bias (shared) Xedge[i][nNodeFeatures+f][e] = Xnode[i][f+1][n2] end end end The last step before training is to decide what the trainable parameters are going to be. To do so, we create a map that associates trainable parameters (weights) to nodes and edges in the graph. Here is the procedure: -- tie node potentials to parameter vector nodeMap = zeros(nNodes,nStates,nNodeFeatures) for f = 1,nNodeFeatures do nodeMap[{ {},1,f }] = f end -- tie edge potentials to parameter vector local f = nodeMap:max() edgeMap = zeros(nEdges,nStates,nStates,nEdgeFeatures) for ef = 1,nEdgeFeatures do edgeMap[{ {},1,1,ef }] = f+ef edgeMap[{ {},2,2,ef }] = f+ef end -- initialize parameters g:initParameters(nodeMap,edgeMap) And voila!, we can now train the CRF using arbitrary optimization techniques, as we did with the MLPs, ConvNets, and autoencoders. In this case, we minimize the negative log-likelihood in order to maximize the likelihood. The negative log-likelihood (and its) gradient are computed by the With log-linear CRFs, the negative log-likelihood is still differentiable and jointly convex in the node and edge parameters, so we could compute its optimal value using a batch algorithm, such as L-BFGS. As usual though, if the training data is quite large, starting with an epoch or two of SGD can be significantly faster, and then turning to simple averaging SGD might be an option, as shown by Leon Bottou. -- and train on 30 samples sgdconf = {learningRate=1e-3} for iter = 1,100 do local i = floor(uniform(1,nInstances)+0.5) local feval = function() return g:nll(Xnode[i],Xedge[i],y[i],'bp') end optim.sgd(feval,g.w,sgdconf) end Note/Exercise: the trainable parameters are availabel in The results of the training procedure are shown here: t7> gm.examples.trainCRF() SGD @ iteration 1: objective = 709.78271289324 SGD @ iteration 2: objective = 86.476069966592 SGD @ iteration 3: objective = 22.048118240666 SGD @ iteration 4: objective = 10.34447115497 SGD @ iteration 5: objective = 12.507434214913 SGD @ iteration 6: objective = -0.24103941715157 SGD @ iteration 7: objective = 13.513625673658 SGD @ iteration 8: objective = 23.680651547275 SGD @ iteration 9: objective = 13.759790868401 SGD @ iteration 10: objective = 24.419772339165 SGD @ iteration 11: objective = 19.331090159347 ... This is inference ( And this is the decoding ( tutorial_graphicalmodels.txt · Last modified: 2012/10/02 13:28 (external edit) |

Artificial Intelligence + NLP + deep learning > AI > Machine Learning > Neural Networks > Deep Learning > Torch > Madbits Tutorial >