My main research theme is related to similarity detection in source code (especially between projects with obfuscation patterns).
I'm mostly interested in exploiting token sequences and syntax tree representations.
Plade
I'm the creator and the main contributor to Plade. Plade is a flexible software framework to find and evaluate similarities in source code using token sequences or abstract syntax trees representations. Further information about Plade and its companion libraries will be very soon publicly available.
A selection of publications
- Finding similarities in source code through factorization (LDTA'2008, Budapest). We present a technique to highlight duplicated code in source code using functions of token sequences. They are organized as a call graph for each project. Call graph of projects are then merged together into a common call graph with new synthetized functions related to nested duplicated chunks of code. Some slides presented at LDTA are available here.
- Fingerprinting syntax trees (ICPC'2009, Vancouver). Some hashing methods for subtrees (considering some abstraction operations) are studied to find similar subtrees in a database of indexed projects.
- Towards a multiscale approach to find approximate matches (IWSC'2010, Cap Town). A general approach to infer approximate matches from exact match germs is evoked.
Talks and seminar presentations
- Une introduction à la recherche de correspondances dans du code source (French presentation made at the Bordeaux Computer Science Laboratory LaBRI on December 6th 2010)
PhD thesis
An introduction to my PhD thesis and an online version can be found here.
Web sites of my co-authors
- Gilles Roussel (was my PhD supervisor)
- Étienne Duris