Nathanaël Fijalkow

CNRS, LaBRI, Bordeaux, and The Alan Turing Institute of data science, London

Program Synthesis

in the learning era

Two choices


Program space:
Specification:

Examples

FlashFill (Gulwani, Microsoft Research)


Included in Microsoft Excel 2013!
Correct with one I/O in 60% of the cases!

Hackers' Delight benchmarks


Can we compute the average of two numbers in 32 bits without using 64 bits? Problem: $x + y$ is prohibited!

A solution: $$(x\ \&\ y) + (x\ \wedge\ y) >> 1$$ where $\&$ is bitwise and, $\wedge$ is bitwise xor, and $>>$ is right shift.

Legacy code rewriting at Siemens


Some old softwares use (assembly type) code which can be executed but whose documentation (and semantics!) has been lost.

Task: rewrite the programs

Probabilistic programming


A program defines a parameterised family of distributions.

Inference task: find parameters to match a dataset

Automata Learning

The setting


Specification: a function $f : A^* \to \mathbb{R}$
Program space: weighted automata

Weighted automata

word read "" 1 0 word read "b" -1 1 word read "bb" 1 3 - 1 = 2 word read "bba" 1 + 4 = 5 0 word read "bbab" -5 5

Induces   $f : A^* \to \mathbb{R}$

Hankel matrix


Let $f : A^* \to \mathbb{R}$. The Hankel matrix of $f$ is the bi-infinite matrix $H_f \in \mathbb{R}^{A^* \times A^*}$ defined by $$H_f(u,v) = f(uv)$$
Theorem: (Fliess '74)
  • Any automaton recognising $f$ has at least $\text{rank}(H_f)$ many states,
  • There effectively! exists an automaton recognising $f$ with $\text{rank}(H_f)$ many states.

Angluin's style learning


A learner and a teacher

A polynomial time algorithm


Theorem: (Beimel, Bergadano, Bshouty, Kushilevitz, Varricchio, 2000) Weighted automata are efficiently learnable.
Key idea: use a partial Hankel matrix as data structure!

Invariant: maintain $X,Y$ set of words such that $H_f(X,Y)$ has full rank

Extending the matrix


Learning procedure:

Applications



(Generative) probabilistic automata



Induces a distribution over finite words $\mathbb{P} : A^* \to [0,1]$

The anchor assumption


Open problem: efficiently learn probabilistic automata
If we learn it as a weighted automaton, the learned object is not probabilistic!

Anchor assumption: for every state $q$, there exists a word (anchor) $w$ which characterises $q$

Theorem: (Arora, Ge, Kannan, and Moitra OR Stratos, Collins and Hsu) Probabilistic anchored automata are efficiently learnable

Probabilistic context-free grammars


$$S \rightarrow \frac{1}{2} SS + \frac{1}{3} a\ S\ b + \frac{1}{6} b\ S\ a \qquad ; \qquad S \rightarrow \varepsilon$$
Open problem: efficiently learn probabilistic context-free grammars from words

It is easy if we observe derivations (trees)

Theorem: (F., Clark) Probabilistic anchored context-free grammars are efficiently learnable

Puzzle



Your score is the (expected) number of tosses

How should you choose $p'$
to minimise the score?

Syntax-guided Synthesis

The SyGuS setting


Program space: a context-free grammar generates suitable programmes
Specification: an unknown function $f : I \to O$
The problem is: $$\exists E \text{ expression}, \forall x \in I, E(x) = f(x)$$
Subtask: Given $E$, check whether $\forall x,\ E(x) = f(x)$:
easy using a SAT / SMT solver

Main question: How do we enumerate expressions?

A solution framework


A learner and a teacher maintain a set of I/O

Weaknesses of the SyGuS framework


Programming by example


INPUT: a few* I/O
OUTPUT: synthetise a program

a few = 5

The program space


38 high-level functions operating on lists in [-255,255]

DeepCoder's line of attack



Idea: the model learns
  • patterns relating I/O to programs
  • biasses on the distribution of programs

Major problem


We do NOT have data!
Solution: synthetic training data

This talk is branching


Program generation


I/O generation

Problem: given a program, find interesting I/O

  • Domain restriction (original DeepCoder approach)
  • Non-uniform sampling
  • Constraint-based (SMT solver z3)
  • Semantic variation

Input distribution


Timing distribution


Home court / away court


What does the model learn?



Example: dependence between "ordered output" and "sort", or "large numbers" and "scanl(*)"

Mean Mastermind


Set of combinations: subsets of size $k$ of $\{1,\dots,n\}$

Assumption: the secret combination is sampled according to an unknown distribution $\mathcal{D}$

When presented a combination, the master answers YES or NO

Question: what is the best algorithm for finding the secret combination?
Key assumption: we know properties of the distribution $\mathcal{D}$, such as marginals $\mathcal{D}(c)$ for each $c \in \{1,\dots,k\}$, or conditionals $\mathcal{D}(c \mid c')$

Two settings:
Theorem: (Ohlmann, Lagarde, F.) With full knowledge of $\mathcal{D}$, the best memoryless algorithm is $$\mathcal{D'}(x) = \frac{\sqrt{\mathcal{D}(x)}}{\sum_x \sqrt{\mathcal{D}(x)}}$$

Solution: against the biassed coin $(\frac{1}{3},\frac{2}{3})$, your best bet is $$\left( \frac{1}{1 + \sqrt{2}}, \frac{\sqrt{2}}{1 + \sqrt{2}} \right)$$

DeepSynth hires!


For internship / PhD / postdoc in LaBRI, Bordeaux

Theoretical:
Experimental: