Language learning and program synthesis
Applying the "i+1" principle
December 2020
The "i+1" principle
One particularly compelling theory I’ve heard about how humans learn languages is the „i+1“ principle. It states that if „i“ is your current level of understanding of a language, you learn the most when you’re learning from content at a level „i+1“ (just a little bit harder).
In particular, it is not only overwhelming, but inefficient, for the brain to learn sentences in a foreign language where you don’t know multiple of the words in a sentence. But if you mostly hear sentences where you don’t know just one of the words, your brain tends to be good at filling in that blank.
Although this theory describes how we unconsciously fill in blanks of a sentence, it has been successfully applied to consciously learning words too. In particular, given the choice between learning a new sentence that has five words we don’t know (an i+5 sentence), or five sentences each with one word we don’t know (i+1 sentences), it may be more efficient to focus on the i+1 sentences.
Consider learning five words either through this sentence:
- Tomar toale choob nongra cheelo.
Or through these sentences (where, let's assume, because you already know most of the words in these sentences in your target foreign language, most of the words "look like English" to you.)
- Is that my phone or tomar phone ringing?
- I don’t think I need dry off with separate hair toale.
- This soup turned out choob spicy.
- I wouldn't use that nongra bathroom if I were you.
- He cheelo very selfish, but isn't anymore.
In my personal experience, I’ve noticed I learn those five words a lot quicker when I spend my time learning them in the „i+1“ contexts rather than spending the same amount of time learning them in the "i+5" sentence.
What could this have to do with program synthesis?
When learning a foreign language, there are so many sentences you can explore, we want to learn how to be most efficient. It is the same with program synthesis: program synthesis is by its nature exponential in the number of building-block functions we give it, so we always need to come up with intelligent tricks to make the program search more efficient.
In language learning, we find this efficiency by narrowing down the sentences to focus on using the "i+1" principle. In program synthesis, we can narrow down the tasks we focus on solving using the "i+1" principle.
Say we want to find a program that solves the following (quite elaborate) task from the ARC dataset:
A quick rundown on this dataset: A main argument in the 2019 paper On the Measure of Intelligence is that what we refer to as „intelligence“ is often not visible in the final AI system. Rather, intelligence is how efficiently the machine learns, whereas its skill is shown in the final product. According to this definition, you might say that a girl who plays chess with a grandmaster once and then beats her on the second try is intelligent. One who does chess training for years and then beats that grandmaster might be skilled. And so, Chollet proposes a dataset that seeks to measure intelligence over skill using the ARC dataset. This is in part enforced by instructing developers to only provide their AI with the „core knowledge“ human priors (which are skills humans are more or less born with: an ability to detecting objects, adding and subtracting small numbers, a basic understanding of geometry), and nothing else. Starting with that, and not loading your machine up with a ton of training data, how can you solve the ARC tasks?
This is where the „i+1“ principle of language learning comes in. If we want to learn efficiently, as we want to do both in language learning and program synthesis, it makes sense to use an "i+1" approach.
It doesn’t make sense for your machine to start out with a domain specific language (DSL) containing only the core priors of counting, geometry, and object detection, as the dataset instructions mandate, and then try to solve a complicated problem like above. The solution program would be too long when written out in terms of these simple functions, and program enumeration would take an extraordinary amount of time before arriving at that solution.
Rather, it might make sense for the machine to first solve a problem like this:
An ARC problem that requires counting.
...and after the machine has solved a problem like this, it might be wise to add a count_objects function to its library.
And then, because count_objects is already in its library, it can more quickly find a program to solve a problem like this:
An ARC problem that requires counting & then returning the most common object.
...and once it's solved this, it can add a most_commonly_counted_object function to its library, which will help it more efficiently find a solution to a problem like this:
An ARC problem that requires counting & then returning the object that has red squares most frequently occurring inside it.
...at which point it may devise an object_with_most_squares_of_color(x) function to add to its library, which will then enable it to quickly solve our goal problem:
An ARC problem that requires counting & then returning the object that has the most black squares frequently occurring inside it & then adding a gray border between the squares.
To find the solution to the above ARC task working from only the initial library of core human priors is way too complicated and unrealistic to solve by brute force.
And...people in the program synthesis community already know this. The recently-released program synthesis tool Dreamcoder was built to take advantage of this principle. It finds the tasks that can be solved after having only explored a shallow search, then adds new functions to the function library if they allow multiple of the currently solved tasks to be expressed more succinctly, and then embarks on another shallow search to solve all the given tasks in another iteration. Because of the growing DSL, Dreamcoder can solve more and more complicated problems in the same amount of time on successive iterations.
Given basic functional programming primitives like map, fold, and cons, Dreamcoder takes advantage of "i+1" learning to discover how to sort a list.
Note that the solution to the sort list task (bottom of the image) if expressed in initial primitives would be infeasible to find through a brute-force search -- the search tree would have to extend at least thirty layers deep.
By the nature of the Dreamcoder algorithm which limits the program search enumeration time on each iteration, it will only solve the "i+1" type tasks on each iteration. Then, the library grows, and there are a whole new set of "i+1" tasks ripe for the picking. Just as in when learning languages -- as your vocabulary expands, more sentences that used to be "i+2" sentences are now "i+1."
Applying "i+1" to tackle ARC: a success story
So, can we apply Dreamcoder to solve ARC?
My research colleague Simon Alford at MIT ran a preliminary experiment, showing the effectiveness of this method on a subset of ARC tasks -- the tasks related to symmetry.
On the first iteration, given a certain amount of time, Dreamcoder solved 19 tasks, including the following:
An ARC problem that requires mirroring across the x-axis
Program that solves the task:(lambda (combine_grids_vertically (input $0) (x_mirror (input $0))))
An ARC problem that requires mirroring across the x-axis.
Program that solves the task:(lambda (combine_grids_vertically (input $0) (x_mirror (input $0))))
From these solved tasks, Dreamcoder added a some new functions to its library (where $0 indicates a function argument):
#(lambda (x_mirror (input $0)))
#(lambda (combine_grids_vertically (input $0) (#(lambda (x_mirror (input $0)))) )).
Then using these new primitives, on the next iteration, in the same amount of time, it could now solve 25 tasks, including the following:
An ARC problem that requires four-way mirroring.
Program that solves the task:lambda (combine_grids_horizontally (#(lambda (combine_grids_vertically (input $0) (#(lambda (x_mirror (input $0))) $0))) $0) (y_mirror (#(lambda (combine_grids_vertically (input $0) (#(lambda (x_mirror (input $0))) $0))) $0)))
The above task could not be feasibly solved using only the starting primitives, given the length of the program. But, by first isolating the "i+1" tasks and then adding relevant primitives, and giving it another try, it could.
One thing I find interesting is that at first glance, it may seem just like curriculum learning, but the crucial thing is that the developers are not providing the curriculum. The machine itself is looking at all of the possible tasks it could solve, and figuring out its own best curriculum, by focusing on only solving "i+1" tasks (in this case, the tasks it can solve given a limited time and therefore a limited program length).
Especially interesting is that this experiment confirms our ideas about "i+1" being the most efficient route. When we give a program synthesis machine, say, 8000 seconds to solve i+1 tasks and i+2 tasks, it usually only solves the i+1 tasks. But if we give it 4000 seconds, a break to synthesize new primitives to add to the library, and 4000 more seconds, then it solves the i+1 tasks in the first iteration, and then i+2 tasks in the second iteration. So by only going for the low-hanging fruit in each iteration, the machine actually solves more tasks in the same amount of total time.
This is my intuition about language learning --- that given the same amount of time, I learn more vocabulary given five i+1 sentences rather than one i+5 sentence. But of course, a statement about how humans efficiently acquire languages is more complicated to prove than how a computer efficiently searches a program space.
Applying "i+1" to solve ARC: a not-success story
It is worth noting where the "i+1" principle breaks down.
So far, we've demonstrated that you can chain together primitives to get more-and-more specific primitives to add to a library. However, it could be useful to take it a step further, and show that you can create primitives that are not only more specific, but more abstract.
For example, say we provide a synthesis tool with the following primitives: (1) move object downward (2) drawing line downward (3) rotate grid. Then, it should theoretically also be able to move objects to the left, and draw lines to the left. Indeed, with our synthesis machine, we are able to solve the following tasks (which, admittedly, are not ARC tasks, but rather tasks we created to be similar to ARC tasks):
A "move object left" task.
Program that solves the task:(lambda (rotate_cw (move_down (rotate_ccw $0))))
A "draw line towards the left" task.
Program that solves the task:(lambda (rotate_cw (draw_line_down (rotate_ccw $0))))
So now, we might expect our synthesis machine will generate a new function that can apply any initially-downward-action to the left: #(lambda (lambda (rotate_cw ($0 (rotate_ccw $1))))).
...at which point, this more generic and complicated task is now an easily-solvable "i+1" task:
A "crop object left and then reflect object left" task.
Program that solves the task:(lambda (rotate_cw (crop_down(reflect_down(rotate_ccw $0)))))
But, our synthesis machine never generates that #(lambda (lambda (rotate_cw ($0 (rotate_ccw $1))))) primitive. This is because it only adds a new function to the library when use of the new function will result in a shorter program to solve the task. And as it turns out, by a trick of lambda calculus, the expression
#(lambda (lambda (rotate_cw ($0 (rotate_ccw $1))))) (move_down)
is not actually any shorter than
(lambda (rotate_cw (move_down (rotate_ccw $0)))).
So is there another way to tell our program synthesis engine that the above function might be a useful one to add to the library? We could certainly tell our engine to add a new function to the library whenever it sees two programs that share some sort of common structure, regardless of whether that new function will lead to quicker enumeration in the future. But this sort of rule is dangerous. It’s not obvious that the more abstract primitive will be useful in the future. And adding functions regardless of their ability to make program search more efficient could lead to overwhelming the library with lots of primitives, resulting in slowing down program enumeration to the point that it doesn't matter that our machine has added generalization ability -- because it just doesn't solve many tasks.
In any case, the point is that "i+1" learning is less straightforward to implement in the context of creating program synthesis tools that truly generalize. Just because we as humans might see a primitive as a useful "i+1" stepping stone, doesn't mean we can clearly tell a computer why it is.
Figuring out a way to add more generalized versions of functions to a program synthesis library seems worthwhile -- it somehow seems more intelligent than just gluing existing primitives together and increasing the machine's "tunnel vision." Ideally, the machine could "get the general idea" behind something, and use that general idea to solve an entirely different task.
Takeaways
So, yes, using "i+1" learning implemented by a clever program synthesis tool like Dreamcoder can allow machines to solve successively more complicated tasks, even using very few training examples such as is the case with the ARC data set. This strategy proves promising on the ARC dataset -- I believe our best shot at cracking this dataset in the future will involve some process involving going through all 400 ARC training tasks on a first iteration, adding new functions to the program library, and going through all 400 again, and repeating this process. And this process is somehow reminiscent of how humans learn -- we tend to achieve the most efficient results when we go for the low hanging fruit first, which allows us to solidify our skills, and then unlock a whole new level of low hanging fruit.
To be clear, I didn't present any novel ideas here -- I only wanted to demonstrate how powerful the "i+1" language learning principle is in a not-too-dissimilar field -- program synthesis, and provide an example of when it works and when its application is less straightforward (because, as is the case with most AI algorithms, it seems like there's more work to be done before we can use this principle to achieve truly generalizing algorithms).