12.1 Optimization and Sampling in Probabilistic Models
12.1.1 Drawbacks of Markov Chain Monte Carlo sampler
Obviously from lecture our instructor has complaints about (1) fake random variable and (2) sample data are correlated. The sampling process is demanding because there is a “burn-in” time limit to get samples for Monte Carlo sampler. This “burn-in” hint is shown by the Edx system when you did not select the right multiple-choice answers-- whatever “burn-in” means.
12.1.2 Markov Chain Monte Carlo method
Does a Monte Carlo method always reach same results for two different approaches of a problem? The hint of the question says the two Monte Carlo random-variable configurations may be very different. Therefore, my guess would be NOT the same results. And my answer is right.
12.1.3 Can a quantum sampler generate correlated data?
I think quantum researchers are managing to do quantum control for correlation of data. So, the sample data points are normally not correlated -- and that is the answer to this quiz question -- but quantum control can generate correlated data. Well, I always wonder what a quantum sampler is. Does it generate sample data? From today’s lecture, yes, the quantum sampler generates data - but why do we need quantum or digital computer to generate samples instead of getting data from actual environment? Also, I really think quantum computers are used to run quantum algorithms, not generating samples, and why it is a big deal to name it “quantum sampler”? (however, from this QML course topic “kernel method”, I understand quantum data preparation is a big deal – the effort such as figuring out how quantum circuit is built may be more than that of the algorithm processing!) Also, there is a big reason to have quantum correlation, especially for our “quantum intelligence” on bosonic quantum sampler.
12.2 Assignment of 12_Training_Probabilistic_Graphical_Models
12.2.1 Probabilistic Graphical Models
Probabilistic graphical models capture a compact representation of a joint probability distribution through conditionally independence: random variable X is conditionally independent of Y given Z (X⊥Y|Z) , if P(X=x, Y=y|Z=z)=P(X=x|Z=z)P(Y=y|Z=z) for all x∈X, y∈Y, z∈Z.
The graph can be directed -- these are called Bayesian networks in general -- or undirected, in the case of Markov networks (also known as Markov random fields). Graphical models are quintessentially generative: we explicitly model a probability distribution. Thus, generating new samples is trivial and we can always introduce extra random variables to ensure certain properties. These models also take us a step closer to explainability, either by the use of the random variables directly for explanations (if your model is such) or by introducing explanatory random variables that correlate with the others.
In a Markov random field, we can allow cycles in the graph and switch from local normalization (conditional probability distribution at each node) to global normalization of probabilities (i.e. a partition function). Examples include countless applications in computer vision, pattern recognition, artificial intelligence, but also Ising models that we have seen before: the factors are defined as degree-1 and degree-2 monomials (一元一次或二次) of the random variables connected in the graph. The factorization of the joint probability distribution is given as a sum P (X1, …, XN) = (1/Z) exp (−∑k E[Ck]), where Ck are cliques of the graph, and E[.] is an energy defined over the cliques. If P is a Boltzmann distribution over G, all local Markov properties will hold. The other way also holds if P is a positive distribution.
Let us define a Markov field of binary variables. This will be an Ising model over random variables.
12.3.2 The dimod sampler
Since I never know how a quantum sampler generate data points, here is an example:
Define a Markov random field of four binary random variables in dimod. Random variables X1 and X3 are conditionally independent given X2. The random variable X4 is independent of all the other variables. The coupling strength on all edges in the graph is -1. Apart from the coupling between nodes, we also consider an external field of strength 1 applied to all nodes. Use BinaryQuadraticModel as our model. Here is the data generating process:
import matplotlib.pyplot as plt
import numpy as np
import dimod
n_spins = 4
h = {v: 1 for v in range(n_spins)}
J = {(0, 1): -1,
(1, 2): -1}
model = dimod.BinaryQuadraticModel(h, J, 0.0, dimod.BINARY)
print ([i for i in model.linear])
print ([i for i in model.linear.values()])
print ([i for i in model.quadratic])
print ([i for i in model.quadratic.values()])
sampler = dimod.SimulatedAnnealingSampler()
So far, I put the h and J of Ising in, and build the sampler. (So, it is I generate the data, not the sampler?!!!)
12.3.3 Networkx Graph
You can also build a networkx graph to see how the four data points look like and how dwave chimera chip handles them.
import networkx
G = networkx.Graph()
G.add_nodes_from([0, 1, 2, 3])
G.add_edges_from([(0, 1), (1, 2)])
print (list(G.nodes))
print (list(G.edges))
networkx.draw(G)
Embed the network graph on a single Dwave chimera cell using minorminer. This is a very simple Markov network that does not need multiple physical qubits to represent a logical qubit. Note that the independent random variable X4 does not appear in the embedding.
connectivity_structure = dwave_networkx.chimera_graph(1, 1)
embedded_graph = minorminer.find_embedding(G.edges(), connectivity_structure.edges())
dwave_networkx.draw_chimera_embedding(connectivity_structure, embedded_graph)
12.3.4 Normalizing coefficient (partition function)
Now the above network does not seem affect how the following code runs to estimate the normalizing coefficient Z for probabilities of the three nodes at temperature T=1 from 100 samples. (see the next code snippet)
In this case, the conditional independences are already encapsulated by the model: for instances, spins 0 and 2 do not interact. In general, it is hard to learn the structure of a probabilistic graphical given a set of observed correlations in the sample S. We can only rely on heuristics. The typical way of doing it is to define a scoring function and do some heuristic global optimization.
Once we identified or defined the graph structure G, we must learn the probabilities in the graph. We again rely on our sample and its correlations, and use a maximum likelihood or a maximum a posteriori estimate of the corresponding parameters θG with the likelihood P(S|θG). This is again a hard problem.
Applying the learned model means probabilistic inference to answer queries of the following types:
· Conditional probability: P(Y|E=e) = P(Y,e)/P(e).
· Maximum a posteriori: argmaxyP(y|e) = argmaxy∑ZP(y,Z|e).
This problem is in #P. Contrast this to deep learning: once the neural network is trained, running a prediction on it is relatively cheap. In the case of probabilistic graphical models, inference remains computationally demanding even after training the model. Instead of solving the inference problem directly, we use approximate inference with sampling, which is primarily done with Monte Carlo methods on a classical computer. These have their own problems of slow burn-in time and correlated samples, and this is exactly the step we can replace by sampling on a quantum computer.
For instance, let us do a maximum a posteriori inference on our Ising model. We clamp the first spin to -1 and run simulated annealing for the rest of them to find the optimal configuration. We modify the simulated annealing routine in dimod to account for the clamping. See code snippet (not following snippet but the one next to it).
temperature = 1
response = sampler.sample(model, beta_range=[1/temperature, 1/temperature], num_reads=100)
g = {} # dictionary that associate to each energy E the degeneracy g[E]
for solution in response.aggregate().data():
if solution.energy in g.keys():
g[solution.energy] += 1
else:
g[solution.energy] = 1
print("Degeneracy", g)
probabilities = np.array([g[E] * np.exp(-E/temperature) for E in g.keys()])
Z = probabilities.sum()
print ("normalizing coefficient", Z)
probabilities /= Z
fig, ax = plt.subplots()
ax.plot([E for E in g.keys()], probabilities, linewidth=3)
ax.set_xlim(min(g.keys()), max(g.keys()))
ax.set_xticks([])
ax.set_yticks([])
ax.set_xlabel('Energy')
ax.set_ylabel('Probability')
plt.show()
Here are the results:
from dimod.reference.samplers.simulated_annealing import greedy_coloring
clamped_spins = {0: -1}
num_sweeps = 1000
βs = [1.0 - 1.0*i / (num_sweeps - 1.) for i in range(num_sweeps)]
# Set up the adjacency matrix.
adj = {n: set() for n in h}
for n0, n1 in J:
adj[n0].add(n1)
adj[n1].add(n0)
# Use a vertex coloring for the graph and update the nodes by color
__, colors = greedy_coloring(adj)
spins = {v: np.random.choice((-1, 1)) if v not in clamped_spins else clamped_spins[v]
for v in h}
for β in βs:
energy_diff_h = {v: -2 * spins[v] * h[v] for v in h}
# for each color, do updates
for color in colors:
nodes = colors[color]
energy_diff_J = {}
for v0 in nodes:
ediff = 0
for v1 in adj[v0]:
if (v0, v1) in J:
ediff += spins[v0] * spins[v1] * J[(v0, v1)]
if (v1, v0) in J:
ediff += spins[v0] * spins[v1] * J[(v1, v0)]
energy_diff_J[v0] = -2. * ediff
for v in filter(lambda x: x not in clamped_spins, nodes):
logp = np.log(np.random.uniform(0, 1))
if logp < -1. * β * (energy_diff_h[v] + energy_diff_J[v]):
spins[v] *= -1
The “spins” in the code is the most likely configuration.
12.3.5 Boltzmann Machines
Sampling the Negative Phase From Boltzmann Machine Log-Likelihood Gradient is not a trivial topic. Here is TA’s advices:
You can read more in books on deep learning, for instance Deep Learning by Goodfellow et al.
The positive phase can be interpreted as pushing down on the energy of training examples and the negative phase as pushing up on the energy of samples drawn from the model. For most undirected models, in particular Boltzman machines, the positive phase is tractable, but the negative is not.
There are Monte Carlo methods to approximately maximizing the likelihood of models with intractable partition function. The most frequently used algorithm is Persistent Contrastive Divergence. It is well explained in [this] video 1.
You can also read here an implementation on D-Wave.
12.4 Quiz of Module 3
12.4.1 Encoding Salesman Traveling Problem
What is the most natural encoding for STP? Basic encoding since a road connecting two cities is a binary problem, Amplitude encoding since distance between two cities is a probability problem, or Hamiltonian encoding since Ising can model the minimum cost to travel between two cities? The answer is Hamiltonian encoding. I was under the wrong impression “most natural” means “most straightforward way” in section 9.3.3 quiz problem where the basic encoding as the first step and “most straightforward” which is the answer to quiz, then improve it to amplitude, and then Hamiltonian encoding. In fact, it is just a way that lecturer Peter leads students from shallow to deep understanding as a good teaching skill. The table in section 9.2 although says various encoding mechanism has advantages and drawbacks, it does demonstrate progressive improvements (from basic to amplitude to Hamiltonian) regarding the problem-solving capability. So, whether “natural” or “straightforward” is not the point. The point is that you understand STP is featured by (1) NP-hard and (2) optimal cost (shortest route) is the goal. Only Hamiltonian encoding is involved with these two features.
12.4.2 Most efficient circuit to prepare state |111>
· Quantum Circuits. Guess I am totally dumb in quantum circuits. I suppose a “Quantum Engineering Institute” has hardware department and software department. The hardware department has at least 3 courses about quantum mechanics, quantum computing, and quantum circuits (similarly to EE courses of Electromagnetics, Electronics, and circuits). The software department has at least 3 courses about quantum information processing, quantum machine learning, and quantum emergence / cognition. A textbook for quantum circuits course should cover quantum state, quantum operators, quantum circuit rules, state preparation with quantum circuits, and more complicated quantum circuits.
· Ket Reversing. A pauli-x gate (i.e. not-gate) can reverse |0> to |1> (is this correct?). So, three qubits with state |0> can be reversed to get |111>
· Alternative Circuit Design. A state can be prepared with multiple ways, some efficient, some more controllable. (NOT single way!) I guess this quiz problem shows efficiency is more important than controllability. As the circuit on the right of the following diagram to prepare |111> is a better design:
12.4.3 Hamiltonian Encoding for Non-Hermitian matrix A
Still don’t know what this quiz question is about: “whether you can use Hamiltonian encoding for non-Hermitian matrix A?” If yes, how?
First, how this non-Hermitian matrix A is involved in Hamiltonian encoding? For instance, in Ising model, J and h can be like matrix A? Or, trying to calculate a huge matrix A or its determinant after Hamiltonian model is formed? Or, matrix A as a quantum state to be prepared by Hamiltonian encoding? If this is a state-preparation
optimize w and possibly add a regularization term to do better.
12.4.7 Small data set and large number of weak predictors (classifiers)
You optimize an objective that heavily penalizes for the inclusion of weak predictors. Never vote with all weak learners included because since everybody is weak you don’t want to listen to their voice.
12.4.8 Correlated predictors
12.4.15 Interference circuit of kernel method
The interference circuit for calculating a kernel relies on a single operation, a Hadamard gate. The steps before are just state preparation, and the steps afterwards are just measurement and post-processing. It is important because the number of gates that we can use on contemporary quantum computers is limited, and this gives a recipe for a shallow circuit. I wrongly select the answer “The Hadamard gate creates entanglement, so we use quantum correlations in generating the output.”, thinking entanglement must have to do with the importance. However, I guess Hadamard gate is just a single qubit which cannot have entanglement!
12.4.16 Second Measurement for interference circuit
The interference circuit relies on post-selection, which is a common and important patter in quantum computing. We only keep the output and perform a second measurement if we first measured the outcome 0 on the ancilla. Here is the reason: After interference, the part of the superposition corresponding to |0⟩ on the ancilla has the state we are looking for. By measuring the ancilla, we get rid of the part of the superposition we are not interested in.
12.4.17 Conditional independence and Uncorrelatedness of Markov network variables
It is interesting to say x2 and x3 are conditional independent given x1 in the following diagram, rather than saying x2 and x3 are unrelated. I guess it is important to understand you always must know the “conditional part” in the conditional probability.
12.4.19 Hidden nodes of Boltzmann Machine
In a Boltzmann machine, we marginalize over the hidden nodes to reproduce a probability distribution: P(v)=∑hP(v,h). What we achieve is that the hidden nodes can encode long-range correlations between the visible nodes.
12.4.20 Value assignment of a simple Markov network random variables
In the following network, x1 has an observation value of 1. X2 and x3 are binary. The assignments to x2 and x3 are {(0,0), (0,1), (1,0), (1,1)}, total of 4 ways to do assignment. From the diagram, x2 is depend on x1 and x3 is depend on x1. However, we don’t know what the dependences are. There is no saying that the ways of dependence are same (i.e. both repulsion (1,1) or both attraction (0,0)), or 2 ways. There is also no saying that the way of dependence must be attraction (0,0), or 1 way. These are pitfalls student can fall into.
12.4.22 Thermalization of Markov Network
For a Markov network of 5 variables, how many qubits are needed to do thermalization sampling? The hint of this quiz problem says to use ancilla qubits of topic 8, thermalization. First, it is unwise to do 25 = 32 assignments hence to require 32 qubits. Since we have no idea how these 5 variables are correlated with conditional probability “locally” in classical case, hence no way to know number of cliques (i.e. number of data qubits or spins), I wonder if ancilla qubits are the way to make Markov network (topic 12) quantum method “global”. May be the introduced extra ancilla qubits can make data qubits conditionally depend on them? However, how many extra ancilla qubits do we need? Here are the wrong guesses rejected by Edx and their reason: (1) total of 1 (i.e. ancilla) qubit since topic 8.3.2 says one qubit is sufficient to do state preparation; (2) total of 2 qubits since just like the Boltzmann machine case we can use 2 ancilla qubits (vertical and horizontal) such that there are only 2 cliques for the entire 5-variables Markov network; (3) total of 3 qubits since one ancilla + two cliques of data; (4) total of 4 qubits since two cliques each needs one ancilla + one data; (5) total of 5 qubits since we only need 5 data points, already entailing ancilla qubits; (6) total of 6 qubits since 5 data qubits + 1 ancilla = 6. (7) total of 7 qubits since 5 data qubits + 2 ancilla as the vertices of vertical & horizontal cliques= 7; (8) total of 8 (=5+1+2) qubits since from assignment of topic 8, initially circuit n_spin is 5, later variational circuits n_qubits=1, and n_system=n_qubits*2=2; (9) total of 9 since above 8 qubits plus one ancilla = 9.
Now, Edx responds to my final trial that 10 is the right answer, since initial circuit attached to variational circuits 8 (=initial 5 data + variational (1classical+2quantum)) plus 2 ancilla (i.e. 2 cliques) = 10 (or 5 data+5 data=10? which is very unlikely.)