CNRS, LaBRI, Bordeaux, and The Alan Turing Institute of data science, London
Program Synthesis
in the learning era
Two choices
Program space:
models from formal languages (automata, grammars)
models from machine learning (linear models, neural networks)
more syntactically structured programs
Specification:
set of I/O examples (supervised or unsupervised)
logical formulas over I/O or traces
Examples
FlashFill (Gulwani, Microsoft Research)
Included in Microsoft Excel 2013!
Correct with one I/O in 60% of the cases!
Hackers' Delight benchmarks
Can we compute the average of two numbers in 32 bits without using 64 bits?
Problem: $x + y$ is prohibited!
A solution: $$(x\ \&\ y) + (x\ \wedge\ y) >> 1$$
where $\&$ is bitwise and, $\wedge$ is bitwise xor, and $>>$ is right shift.
Legacy code rewriting at Siemens
Some old softwares use (assembly type) code which can be executed but whose documentation (and semantics!) has been lost.
Task: rewrite the programs
Probabilistic programming
A program defines a parameterised family of distributions.
Inference task: find parameters to match a dataset
Automata Learning
The setting
Specification: a function $f : A^* \to \mathbb{R}$
Program space: weighted automata
Weighted automata
Induces $f : A^* \to \mathbb{R}$
Hankel matrix
Let $f : A^* \to \mathbb{R}$.
The Hankel matrix of $f$ is the bi-infinite matrix $H_f \in \mathbb{R}^{A^* \times A^*}$ defined by
$$H_f(u,v) = f(uv)$$
Theorem: (Fliess '74)
Any automaton recognising $f$ has at least $\text{rank}(H_f)$ many states,
There effectively! exists an automaton recognising $f$ with $\text{rank}(H_f)$ many states.
Angluin's style learning
A learner and a teacher
membership: the learner can choose $w \in A^*$ and ask the teacher for $f(w)$
equivalence: the learner can submit a hypothesis automaton $\mathcal{A}$ to the teacher, who agrees or gives a word $w$ such that
$f(w) \neq \mathcal{A}(w)$
A polynomial time algorithm
Theorem: (Beimel, Bergadano, Bshouty, Kushilevitz, Varricchio, 2000) Weighted automata are efficiently learnable.
Key idea: use a partial Hankel matrix as data structure!
Invariant: maintain $X,Y$ set of words such that $H_f(X,Y)$ has full rank
Extending the matrix
Learning procedure:
Using $H_f(X,Y)$, construct a hypothesis automaton $\mathcal{A}$ and submit it to the teacher
Using the counter-example, construct $X' = X \cup \{u\}$ and $Y' = Y \cup \{v\}$ such that $H_f(X',Y')$ has full rank
Applications
Best algorithm to learn deterministic (boolean) automata
Best algorithm to learn boolean circuits
Best algorithm to learn polynomials
Best algorithm to learn boolean formulas
(Generative) probabilistic automata
Induces a distribution over finite words $\mathbb{P} : A^* \to [0,1]$
The anchor assumption
Open problem: efficiently learn probabilistic automata
If we learn it as a weighted automaton, the learned object is not probabilistic!
Anchor assumption: for every state $q$, there exists a word (anchor) $w$ which characterises $q$
Theorem: (Arora, Ge, Kannan, and Moitra OR Stratos, Collins and Hsu) Probabilistic anchored automata are efficiently learnable
Probabilistic context-free grammars
$$S \rightarrow \frac{1}{2} SS + \frac{1}{3} a\ S\ b + \frac{1}{6} b\ S\ a \qquad ; \qquad S \rightarrow \varepsilon$$
Open problem: efficiently learn probabilistic context-free grammars from words
It is easy if we observe derivations (trees)
Theorem: (F., Clark) Probabilistic anchored context-free grammars are efficiently learnable
Puzzle
I have a biassed coin $(p,1-p)$
you pick a biassed coin $(p',1 - p')$
I toss my coin
you toss your coin until you get the same outcome as mine
Your score is the (expected) number of tosses
How should you choose $p'$ to minimise the score?
Syntax-guided Synthesis
The SyGuS setting
Program space: a context-free grammar generates suitable programmes
Specification: an unknown function $f : I \to O$
The problem is:
$$\exists E \text{ expression}, \forall x \in I, E(x) = f(x)$$
Subtask: Given $E$, check whether $\forall x,\ E(x) = f(x)$:
easy using a SAT / SMT solver
Main question: How do we enumerate expressions?
A solution framework
A learner and a teacher maintain a set of I/O
the learner chooses an expression consistent with the set of I/O
the teacher rules out the given expression by expanding the set of I/O
Weaknesses of the SyGuS framework
If the CFG is too large, exploration becomes very ineffective
The set of programs is not very well described by a CFG: most expressions are rubbish
Programming by example
INPUT: a few* I/O
OUTPUT: synthetise a program
a few = 5
The program space
38 high-level functions operating on lists in [-255,255]
DeepCoder's line of attack
train a model using programs and I/O.
The model reads I/O and predicts which functions appear in the target program
given I/O, use the model as an oracle to guide the search using a simple DFS
Idea: the model learns
patterns relating I/O to programs
biasses on the distribution of programs
Major problem
We do NOT have data! Solution: synthetic training data
This talk is branching
On the importance of data generation and the use of formal methods for it
On the underlying algorithmic foundations of these techniques
Program generation
Remove syntactically inept programs
Remove (probably) equivalent programs
I/O generation
Problem: given a program, find interesting I/O
Domain restriction (original DeepCoder approach)
Non-uniform sampling
Constraint-based (SMT solver z3)
Semantic variation
Input distribution
Timing distribution
Home court / away court
What does the model learn?
patterns relating I/O to programs
biasses on the distribution of programs
Example: dependence between "ordered output" and "sort", or "large numbers" and "scanl(*)"
Mean Mastermind
Set of combinations: subsets of size $k$ of $\{1,\dots,n\}$
Assumption: the secret combination is sampled according to an unknown distribution $\mathcal{D}$
When presented a combination, the master answers YES or NO
Question: what is the best algorithm for finding the secret combination?
Key assumption: we know properties of the distribution $\mathcal{D}$,
such as marginals $\mathcal{D}(c)$ for each $c \in \{1,\dots,k\}$, or conditionals $\mathcal{D}(c \mid c')$
Two settings:
Memoryless: algorithm = distribution $\mathcal{D'}$
General: remembers wrong combinations
Theorem: (Ohlmann, Lagarde, F.)
With full knowledge of $\mathcal{D}$, the best memoryless algorithm is
$$\mathcal{D'}(x) = \frac{\sqrt{\mathcal{D}(x)}}{\sum_x \sqrt{\mathcal{D}(x)}}$$
Solution: against the biassed coin $(\frac{1}{3},\frac{2}{3})$, your best bet is
$$\left( \frac{1}{1 + \sqrt{2}}, \frac{\sqrt{2}}{1 + \sqrt{2}} \right)$$