PHYS771 Lecture 15: Computational Learning
Scribe: Chris Granade
Last lecture, you were given Hume's Problem of Induction as a homework assignment.
Puzzle. If you observe 500 black ravens, what basis do you have for supposing that the next one you observe will also be black?
Many people's answer would be to apply Bayes’ Theorem. For this to work, though, we need to make some assumption like that all the ravens are drawn from the same distribution. If we don’t assume that the future resembles the past at all, then it’s very difficult to get anything done. This kind of problem has led to lots of philosophical arguments like the following.
Suppose you see a bunch of emeralds, all of which are green. This would seem to lend support to the hypothesis that all emeralds are green. But then, define the word grue to mean “green before 2050 and blue afterwards.” Then, the evidence equally well supports the hypothesis that all emeralds are grue, not green. This is known as the grue paradox.
If you want to delve even “deeper,” then consider the “gavagai” paradox. Suppose that you’re trying to learn a language, and you’re an anthropologist visiting an Amazon tribe speaking the language. (Alternatively, maybe you’re a baby in the tribe. Either way, suppose you’re trying to learn the language from the tribe.) Then, suppose that some antelope runs by and some tribesman points to it and shouts “gavagai!” It seems reasonable to conclude from this that the word “gavagai” means “antelope” in their language, but how do you know that it doesn’t refer to just the antelope’s horn? Or it could be the name of the specific antelope that ran by. Worse still, it could mean that a specific antelope ran by on some given day of the week! There’s any number of situations that the tribesman could be using the word to refer to, and so we conclude that there is no way to learn the language, even if we spend an infinite amount of time with the tribe.
There’s a joke about a planet full of people who believe in anti-induction: if the sun has risen every day in the past, then today, we should expect that it won’t. As a result, these people are all starving and living in poverty. Someone visits the planet and tells them, “Hey, why are you still using this anti-induction philosophy? You’re living in horrible poverty!”
“Well, it never worked before...”
What we want to talk about today is the efficiency of learning. We’ve seen all these philosophical problems that seem to suggest that learning is impossible, but we also know that learning does happen, and so we want to give some explanation of how it happens. This is sort of a problem in philosophy, but in my opinion the whole landscape around the problem has been transformed in recent years by what's called “computational learning theory.” This is not as widely known as it should be. Even if you’re (say) a physicist, it’s nice to know something about this theory, since it gives you a framework---different from the better-known Bayesian framework, but related to it, and possibly more useful in some contexts---for deciding when you can expect a hypothesis to predict future data.
I think a key insight that any approach has to take on board---whether it's Bayesianism, computational learning theory, or something else---is that we're never considering all logically conceivable hypotheses on an equal footing. If you have 500 ravens, each either white or black, then in principle there are 2500 hypotheses that you have to consider. If the ravens could also be green, that would produce still more hypotheses. In reality, though, you’re never considering all of these as equally possible. You’re always restricting your attention to some minuscule subset of hypotheses---broadly speaking, those that are "sufficiently simple"---unless the evidence forces you to a more complex hypothesis. In other words, you're always implicitly using what we call Occam’s Razor (all though it isn’t at all clear if it’s what Occam meant).
Why does this work? Fundamentally, because the universe itself is not maximally complicated. We could well ask why it isn’t, and maybe there’s an anthropic explanation, but whatever the answer, we accept as an article of faith that the universe is reasonably simple, and we do science.
This is all talk and blather, though. Can we actually see what the tradeoffs are between the number of hypotheses we consider and how much confidence we can have in predicting the future? One way we do this was formalized by Leslie Valiant in 1984. His framework is called PAC-learning, where PAC stands for “probably approximately correct.” We aren’t going to predict everything that happens in the future, nor will we even predict most of it with certainty, but with high probability, we’ll try to get most of it right.
So how does this work? We’ll have a set S which could be finite or infinite, called our sample space. For example, we’re an infant trying to learn a language, and are given some examples of sentences which are grammatical or ungrammatical. From this, we need to come up with a rule for deciding whether a new sentence is grammatical or not. Here, our sample space is the set of possible sentences.
A concept is a Boolean function f : S → {0,1} (we can later remove the assumption that concepts are Boolean, but for simplicity, we’ll stick with it for now) that maps each element of the sample space to either 0 or 1. In our example, the concept is the language that we’re trying to learn; given a sentence, the concept tells us whether it is or isn’t grammatical. Then, we can have a concept class, denoted C. Here, C can be thought of as the set of languages which our baby comes in to the world thinking a priori to be possible, before gathering any data as to the actual language spoken.
Q: Good thing there aren’t any experimental philosophers.
Scott: You can actually connect some of this stuff to experiments. For example, this theory has been used in experiments on things like neural networks and machine learning. When I was writing a paper on PAC-learning recently, I wanted to find out how the theory was actually used, so I looked on Google Scholar. The paper by Valiant was cited about 2,000 times and about half of the citations seemed to be from experimenters of various sorts. Based on this, we can infer that further papers are likely.
For now, we’re going to say that we have some probability distribution D over the samples. In the infant example, this is like the distribution from which the child’s parents or peers draw what sentences to speak. The baby does not have to know what this distribution is. We just have to assume that it exists.
So what’s the goal? We’re given m examples xi drawn independently from the distribution D, and for each xi, we’re given f(xi); that is we’re told whether each of our examples is or isn’t grammatical. Using this, we want to output a hypothesis language h such that:
Response from the floor: You might get unlucky with the samples you’re given. If there are only finitely many of them, I guess you can do it with certainty.
Scott: Even then, you could be given the same sentence over and over again. If the only sentence you’re ever exposed to as a baby is “what a cute baby!” then you’re not going to have any basis for deciding whether “we hold these truths to be self-evident” is also a sentence.
Response from the floor: So if you could hear all the sentences, then you’d be sure.
Scott: That’s correct, but you don’t know that you’d hear all the sentences. In fact, we should assume that there are exponentially many possible sentences, of which the baby only hears a polynomial number.
So, we say that we only need to output an ε-good hypothesis with probability 1 − δ over the choice of samples. Now, we can give the basic theorem from Valiant's paper:
Theorem: In order to satisfy the requirement that the output hypothesis h agrees with 1 − ε of the future data from drawn from D, with probability 1 − δ over the choice of samples, it suffices to find any hypothesis h that agrees withsamples chosen independently from D.
The key point about this bound is that it's logarithmic in the number of possible hypotheses |C|. Even if there are exponentially many hypotheses, this bound is still polynomial. Now, why do we ask that the distribution D on which the learning algorithm will be tested is the same as the distribution from which the training samples are drawn?
Response from the floor: Because if your example space is a limited subset of sample space, then you're hosed.
Scott: Right. This is like saying that nothing should be on the quiz that wasn’t covered in class. If the sentences that you hear people speaking have support only in English, and you want a hypothesis that agrees with French sentences, this is not going to be very possible. There’s going to have to be some assumption about the future resembling the past.
Once you make this assumption, then Valiant’s theorem says that for a finite number of hypotheses, with a reasonable number of samples, you can learn.
Q: So there’s no other assumption involved?
Scott: There’s really no other assumption involved.
Q: But a Bayesian would certainly tell you that if your priors are different, then you’ll come to an entirely different conclusion.
Scott: Certainly. If you like, you can see this entire lecture as a critique of the Bayesian religion. I mean, I respect their faith and all, but not when they try to impose it on others.
Q: But there should be some point where the two either are reconciled, or disagree.
Scott: I can speak to that. The Bayesians start out with a probability distribution over the possible hypotheses. As you get more and more data, you update this distribution using Bayes’ Rule. That’s one way to do it, but computational learning theory tells us that it's not the only way. You don’t need to start out with any assumption about a probability distribution over the hypotheses. You can make a worst-case assumption about the hypothesis (which we computer scientists love to do, being pessimists!), and then just say that you'd like to learn any hypothesis in the concept class, for any sample distribution, with high probability over the choice of samples. In other words, you can trade the Bayesians' probability distribution over hypotheses for a probability distribution over sample data. In a lot of cases, this is actually preferable: you have no idea what the true hypothesis is, which is the whole problem, so why should you assume some particular prior distribution? We don’t have to know what the prior distribution over hypotheses is in order to apply computational learning theory. We just have to assume that there is a distribution.
The proof of Valiant's theorem is really simple. Given a hypothesis h, call it bad if it disagrees with f for at least an ε fraction of the data. Then, for any specific bad hypothesis h, since x1, ..., xm are independent we have that:
This gives us a bound on the number samples needed for a finite set of hypotheses, but what about infinite concept classes? For example, what if we're trying to learn a rectangle in the plane? Then our sample space is the set of points in the plane, and our concept class is the set of all filled-in rectangles. Suppose we’re given m points, and for each one are told whether or not it belongs to a a "secret rectangle."
Well, how many possible rectangles are there? There are 2ℵ0 possibilities, so we can’t apply the previous theorem! Nevertheless, given 20 or 30 random points in the rectangle, and 20 or 30 random points not in the rectangle but near it, intuitively it seems like we have a reasonable idea of where the rectangle is. Can we come up with a more general learning theorem to apply when the concept class is infinite? Yes, but first we need a concept called shattering.
For some concept class C, we say that a subset of the sample space {s1, s2, ..., sk} is shattered by C if, for all 2k possible classifications of s1, s2, ..., sk, there is some function f ∈ C that agrees with that classification. Then, define the VC dimension of the class C, denoted VCdim(C) as the size of the largest subset shattered by C.
What is the VC dimension of the concept class of rectangles? We need the largest set of points such that for each possible setting of whether each point is or is not in the rectangle, there is some rectangle that contains only the points we want and not the other ones. The diagram below illustrates how to do this with four points. On the other hand, there's no way to do it with five points (proof: exercise for you!).
One corollary of the next theorem is that one can perform PAC learning, with a finite number of samples, if and only if the VC dimension of the concept class is finite.
Theorem (Blumer, Ehrenfeucht, Haussler, Warmuth 1989). In order to produce a hypothesis h that will explain a 1 − ε fraction of future data drawn from distribution D, with probability 1 − δ, it suffices to output any h in C that agrees withsample points drawn independently from D. Furthermore, this is tight (up to the dependence on ε).
This theorem is harder to prove than the last one, and would take a whole lecture in itself, so we'll skip it here. The intuition behind the proof, however, is simply Occam’s Razor. If the VC dimension is finite, then after seeing a number of samples that's larger than the VC dimension, the entropy of the data that you’ve already seen should only go roughly as the VC dimension. You make m observations, after which the possible number of things that you’ve seen is less than 2m, otherwise VCdim(C) ≥ m. It follows that to describe these m observations takes less than m bits. This means that you can come up with a theory which explains the past data, and which has fewer parameters than the data itself.
If you can do that, then intuitively, you should be able to predict the next observation. On the other hand, supposing you had some hypothetical theory in (say) high-energy physics such that, no matter what the next particle accelerator found, there'd still be some way of–I don't know–curling up extra dimensions or something to reproduce those observations [pause for laughter]---well, in that case you’d have a concept class whose VC dimension was at least as great as the number of observations you were trying to explain. In such a situation, computational learning theory gives you no reason to expect that whatever hypothesis you output will be able to predict the next observation.
The upshot is that this intuitive trade-off between the compressibility of past data and the predictability of future data can actually be formalized and proven; given reasonable assumptions, Occam’s Razor is a theorem.
What if the thing that we’re trying to learn is a quantum state, say some mixed state ρ? We could have a POVM measurement E with two outcomes. In this case, I’ll say that E “accepts” ρ with probability tr (Eρ), and “rejects” ρ with probability 1 − tr (Eρ). For simplicity, we’ll restrict ourselves to two outcome measurements. If we’re given some state ρ, what we’d like to be able to do is to predict the outcome of any measurement made on the state. That is, to predict tr (Eρ) for any two-outcome POVM measurement E. This is easily seen to be equivalent to quantum state tomography, which is recovering the density matrix ρ itself.
But, what is ρ? It's some n-qubit state represented as a 2n × 2n matrix with 4n independent parameters. The number of measurements needed to do tomography on an n-qubit state is well-known to grow exponentially with n. Indeed, this is already a serious practical problem for the experimenters. To learn an 8-qubit state, you might have to set your detector in 65,536 different ways, and to measure in each way hundreds of times to get a reasonable accuracy.
So again, this is a practical problem for experimenters. But is it a conceptual problem as well? Some quantum computing skeptics seem to think so; we saw in the last lecture that one of the fundamental criticisms of quantum computing is that it involves manipulating these exponentially long vectors. To some skeptics, this is an inherently absurd way of describing the physical world, and either quantum mechanics is going to break down when we try to do this, or there's something else that we must not have taken into account, because you clearly can’t have 2n "independent parameters" in your description of n particles.
Now, if you need to make an exponential number of measurements on a quantum state before you know enough to predict the outcome of further measurements on it, then this would seem to be a way of formalizing the above argument and making it more persuasive. After all, our goal in science is always to come up with hypotheses that succinctly explain past observations, and thereby let us predict future observations. We might have other goals, but at the least we want to do that. So if, to characterize a general state of 500 qubits, you had to make more measurements than you could in the age of the universe, that would seem to be a problem with quantum mechanics itself, considered as a scientific theory. I'm actually inclined to agree with the skeptics about that.
Recently I had a paper where I tried to use computational learning theory to answer this argument. At a conference in London recently, Umesh Vazirani had a really nice way of explaining my result. He said, suppose you're a baby trying to learn a rule for predicting whether or not a given object is a chair. You see a bunch of objects labeled "chair" or "not-chair", and based on that you come up with general rules ("a chair has four legs," "you can sit on one," etc.) that work pretty well in most cases. Admittedly, these rules might break down if (say) you're in a modern art gallery, but we don’t worry about that. In computational learning theory, we only want to predict most of the future observations that you’ll actually make. If you’re a Philistine, and don’t go to MOMA, then don’t you worry about any chair-like objects that might be there. We need to take into account the future intentions of the learner, and for this reason, we relax the goal of quantum state tomography to the goal of predicting the outcomes of most measurements drawn from some probability distribution D.
More formally, given a mixed state ρ on n qubits, as well as given measurements E1,E2, ..., Em ~ D and estimated probabilities pj ≈ tr (Ejρ) for each j ∈ {1,2,...,m}, is to produce a hypothesis state σ which has, with probability at least 1 − δ, the property that:
For this goal, I gave a theorem that bounds the number of sample measurements needed:
Theorem. Fix error parameters ε, δ and γ and fix η > 0 such that γε ≥ 7η. Call E=(E1,...,Em) a "good" training set of measurements if any hypothesis state σ that satisfies |Tr(Eiσ)−Tr(Eiρ)|≤η also satisfies:Then, there exists a constant K > 0 such that E is a good training set with probability at least 1 − δ over E1,...,Em drawn from D, provided that m satisfies:
It’s important to note that this bound is only linear in the number of qubits n, and so this tells us that the dimensionality is not actually exponential in the number of qubits, if we only want to predict most of the measurement outcomes.
Why is this theorem true? Remember the result of Blumer et al, which said that you can learn with a number of samples that grows linearly with the VC dimension of your concept class. In the case of quantum states, we’re no longer dealing with Boolean functions. You can think of a quantum state as a real-valued function that takes as input a two-outcome measurement E, and produces as output a real number in [0,1] (namely, the probability that the measurement accepts). That is, ρ takes a measurement E and returns tr (Eρ).
So, can one generalize the Blumer et al. result to real-valued functions? Fortunately, this was already done for me by Alon, Ben-David, Cesa-Bianchi, and Haussler, and by Bartlett and Long among others.
Next, recall from Lecture 13 Ambainis, Nayak, et al.'s lower-bound on random access codes, which tells us how many classical bits can be reliably encoded into a state of n qubits. Given an m-bit classical string x, suppose we want to encode x into a quantum state of n qubits, in such a way that any bit xi of our choice can later be retrieved with probability at least 1-ε. Ambainis et al. proved that we really can't get any savings by packing classical bits into a quantum state in this way. That is to say, n still must be linear in m. Since this is a lower-bound, we can view it as a limitation of quantum encoding schemes. But we can also turn it on its head, and say: this is actually good, as it implies some upper bound on the VC dimension of quantum states considered as a concept class. Roughly speaking, the theorem tells us that the VC dimension of n-qubit states considered as a concept class is at most m = O(n). To make things more formal, we need a real-valued analogue of VC dimension (called the “fat-shattering” dimension; don’t ask), as well as a theorem saying that we can learn any real-valued concept class using a number of samples that grows linearly with its fat-shattering dimension.
What about actually finding the state? Even in the classical case, I’ve completely ignored the computational complexity of finding a hypothesis. I’ve said that if you somehow found a hypothesis consistent with the data, then you’re set, and can explain future data, but how do you actually find the hypothesis? For that matter, how do you even write down the answer in the quantum case? Writing out the state explicitly would take exponentially many bits! On the other hand, maybe that’s not quite so bad, since even in the classical case, it can take exponential time to find your hypothesis.
What this tells us is that, in both cases, if you care about computational and representational efficiency, then you’re going to have to restrict the problem to some special case. The results from today's lecture, which tell us about sample complexity, are just the beginning of learning theory. They answer the first question, the information-theoretic question, telling us that it suffices to take a linear number of samples. The question of how to find and represent the hypothesis comprises much of the rest of the theory. As yet, almost nothing is known about this part of learning theory in the quantum world.
I can tell you, however, some of what's known in the classical case. Maybe disappointingly, a lot of what's known takes the form of hardness results. For example, with a concept class of Boolean circuits of polynomial size, we believe it's a computationally hard problem to find a circuit (or equivalently, a short efficient computer program) that outputs the data that you’ve already seen, even supposing such a circuit exists. Of course we can’t actually prove that this problem has no polynomial-time algorithm (for that would prove P≠NP), nor, as it turns out, can we even prove in our current state of knowledge that it's NP-complete. What we do know is that the problem is at least as hard as inverting one-way functions, and hence breaking almost all modern cryptography. Remember when we were talking about cryptography in Lecture 8, we talked about one-way functions, which are easy to compute but hard to invert? As we discussed then, Hâstad, Impagliazzo, Levin, and Luby proved in 1997 that from any one-way function one can construct a pseudorandom generator, which maps n "truly" random bits to (say) n2 bits that are indistinguishable from random by any polynomial-time algorithm. And Goldreich, Goldwasser and Micali had shown earlier that from any pseudorandom generator, one can construct a pseudorandom function family: a family of Boolean functions f:{0,1}n→{0,1} that are computed by small circuits, but are nevertheless indistinguishable from random functions by any polynomial-time algorithm. And such a family of functions immediately leads to a computationally-intractable learning problem.
Thus, we can show based on cryptographic assumptions that these problems of finding a hypothesis to explain data that you’ve seen are probably hard in general. By tweaking this result a bit, we can say that if finding a quantum state consistent with measurements that you’ve made can always be done efficiently, then there’s no one-way function secure against quantum attack. What this is saying is that we kind of have to give up hope of solving these learning problems in general, and that we have to just look at special cases. In the classical case, there are special concept classes that we can learn efficiently, such as constant-depth circuits or parity functions. I expect that something similar will be true in the quantum world.
In addition to the aforementioned rectangle learning puzzle, here's another raven puzzle, due to Carl Hempel. Let’s say that you want to test our favorite hypothesis that all ravens are black. How do we do this? We go out into the field, find some ravens, and see if they’re black. On the other hand, let’s take the contrapositive of our hypothesis, which is logically equivalent: “all not-black things are non-ravens.” This suggests that I can do ornithology research without leaving my office! I just have to look at random objects, note that they are not black, and check if they are ravens. As I go along, I gather data that increasingly confirms that all not-black things are non-ravens, confirming my hypothesis. The puzzle is whether this approach works. You’re allowed to assume for this problem that I do not go out bird-watching in fields, forests or anywhere else.
[Discussion of this lecture on blog]