Nathanaël Fijalkow
CNRS, LaBRI, Bordeaux,
and The Alan Turing Institute, London
Data generation
Feynman: What I cannot create, I do not understand.
Scenario: two-step solutions
- Step 1: Learn from the input distribution
- Step 2: Guide the algorithm using the model
Key idea: the input distribution is
heavily biassed!
Perils of data generation
- The input distribution is usually not known!
- Bad data generation leads to bad performances
Even worse, too good performances!
Confucius: When a wise man points at the moon, the imbecile examines the finger.
The model should learn the input distribution, not the data generation.
Research program
- Unbiassed data generation algorithms
- Heuristic-based algorithms with provable guarantees
Programming by example
FlashFill (Gulwani, Microsoft Research)
Included in Microsoft Excel 2013!
Correct with one I/O in 60% of the cases!
Problem
INPUT: a few* input / output pairs
OUTPUT: synthetise a program
*a few = 5
The domain specific language (DSL)
DeepCoder: Learning to Write Programs (ICLR 2017)
38 high-level functions operating on lists in
[-255,255]
Objective: train a model using (synthetic) programs and I/O
The model takes I/O and predicts which functions appear in the target program
Program generation
- Remove syntactically inept programs
- Remove (probably) equivalent programs
I/O generation
Problem: given a program, find interesting I/O
- Domain restriction (original DeepCoder approach)
- Non-uniform sampling
- Constraint-based (leveraging z3)
- Semantic variation
Experiments
Input distribution
Timing distribution
Home court / away court