Nathanaël Fijalkow

CNRS, LaBRI, Bordeaux,
and The Alan Turing Institute, London

Data generation

Feynman: What I cannot create, I do not understand.

Scenario: two-step solutions

Key idea: the input distribution is heavily biassed!

Perils of data generation

Even worse, too good performances!
Confucius: When a wise man points at the moon, the imbecile examines the finger.

The model should learn the input distribution, not the data generation.

Research program

Programming by example

FlashFill (Gulwani, Microsoft Research)

Included in Microsoft Excel 2013!
Correct with one I/O in 60% of the cases!


INPUT: a few* input / output pairs
OUTPUT: synthetise a program

*a few = 5

The domain specific language (DSL)

DeepCoder: Learning to Write Programs (ICLR 2017)
38 high-level functions operating on lists in [-255,255]

Objective: train a model using (synthetic) programs and I/O

The model takes I/O and predicts which functions appear in the target program

Program generation

I/O generation

Problem: given a program, find interesting I/O

  • Domain restriction (original DeepCoder approach)
  • Non-uniform sampling
  • Constraint-based (leveraging z3)
  • Semantic variation


Input distribution

Timing distribution

Home court / away court