All posts by sealexandru

“Reducing Redundancies in Multi-Revision Code Analysis” @ SANER’17

We’re happy to announce that the paper

“Reducing Redundancies in Multi-Revision Code Analysis”

written by Carol V. Alexandru, Sebastiano Panichella and Harald C. Gall, has been accepted into the technical research track of SANER 2017.


Software engineering research often requires analyzing multiple revisions of several software projects, be it to make and test predictions or to observe and identify patterns in how software evolves. However, code analysis tools are almost exclusively designed for the analysis of one specific version of the code, and the time and resources requirements grow linearly with each additional revision to be analyzed. Thus, code studies often observe a relatively small number of revisions and projects. Furthermore, each programming ecosystem provides dedicated tools, hence researchers typically only analyze code of one language, even when researching topics that should generalize to other ecosystems. To alleviate these issues, frameworks and models have been developed to combine analysis tools or automate the analysis of multiple revisions, but little research has gone into actually removing redundancies in multi-revision, multi-language code analysis. We present a novel end-to-end approach that systematically avoids redundancies every step of the way: when reading sources from version control, during parsing, in the internal code representation, and during the actual analysis. We evaluate our open-source implementation, LISA, on the full history of 300 projects, written in 3 different programming languages, computing basic code metrics for over 1.1 million program revisions. When analyzing many revisions, LISA requires less than a second on average to compute basic code metrics for all files in a single revision, even for projects consisting of millions of lines of code.

Use and extend LISA:

Or try out LISA using a simple template:


A Search-based Training Algorithm for Cost-aware Defect Prediction @ GECCO 2016

We’re happy to announce that the paper

“A Search-based Training Algorithm for Cost-aware Defect Prediction”

has been accepted into GECCO 2016 as a full paper. The paper was written in collaboration with our good friends at the TU Delft software engineering research group (SERG). It is authored by Annibale Panichella and co-authored by Carol V. Alexandru, Sebastiano
Panichella, Alberto Bacchelli and Harald C. Gall. GECCO is an A conference on Genetic and Evolutionary Computations.

Research has yielded approaches to predict future defects in software artifacts based on historical information, thus assisting companies in effectively allocating limited development resources and developers in reviewing each others’ code changes. Developers are unlikely to devote the same effort to inspect each software artifact predicted to contain defects, since the effort varies with the artifacts’ size (cost) and the number of defects it exhibits (effectiveness). We propose to use Genetic Algorithms (GAs) for training prediction models to maximize their cost-effectiveness. We evaluate the approach on two well-known models, Regression Tree and Generalized Linear Model, and predict defects between multiple releases of six open source projects. Our results show that regression models trained by GAs significantly outperform their traditional counterparts, improving the cost-effectiveness by up to 240%. Often the top 10% of predicted lines of code contain up to twice as many defects.

Paper accepted for ICSE’15 NIER

We are excited to announce that our paper Rapid Multi-Purpose, Multi-Commit Code Analysis was accepted for the New Ideas and Emerging Results (NIER) track of the 37th International Conference on Software Engineering (ICSE) in Florence.

Existing code- and software evolution studies typically operate on the scale of a few revisions of a small number of projects, mostly because existing tools are unsuited for performing large-scale studies. We present a novel approach, which can be used to analyze an arbitrary number of revisions of a software project simultaneously and which can be adapted for the analysis of mixed-language projects. It lays the foundation for building high-performance code analyzers for a variety of scenarios. We show that for one particular scenario, namely code metric computation, our prototype outperforms existing tools by multiple orders of magnitude when analyzing thousands of revisions.

You can find a preprint of this paper online.