Nathanaël Fijalkow

CNRS, LaBRI, Bordeaux,
and The Alan Turing Institute, London

Data generation

Feynman: What I cannot create, I do not understand.

Scenario: two-step solutions


Key idea: the input distribution is heavily biassed!

Perils of data generation


Even worse, too good performances!
Confucius: When a wise man points at the moon, the imbecile examines the finger.

The model should learn the input distribution, not the data generation.

Research program


Programming by example

FlashFill (Gulwani, Microsoft Research)


Included in Microsoft Excel 2013!
Correct with one I/O in 60% of the cases!

Problem


INPUT: a few* input / output pairs
OUTPUT: synthetise a program

*a few = 5

The domain specific language (DSL)


DeepCoder: Learning to Write Programs (ICLR 2017)
38 high-level functions operating on lists in [-255,255]

Objective: train a model using (synthetic) programs and I/O



The model takes I/O and predicts which functions appear in the target program

Program generation


I/O generation

Problem: given a program, find interesting I/O

  • Domain restriction (original DeepCoder approach)
  • Non-uniform sampling
  • Constraint-based (leveraging z3)
  • Semantic variation

Experiments

Input distribution


Timing distribution


Home court / away court