Abstract

Code clone detection is an important problem for software maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These techniques also depend on generic, handcrafted features to represent code fragments. We introduce learning-based detection techniques where everything for representing terms and fragments in source code is mined from the repository. Our code analysis supports a framework, which relies on deep learning, for automatically linking patterns mined at the lexical level with patterns mined at the syntactic level. We evaluated our novel learning-based approach for code clone detection with respect to feasibility from the point of view of software maintainers. We sampled and manually evaluated 398 file- and 480 method-level pairs across eight real-world software systems; 93% of the file- and method-level samples were evaluated to be true positives. Among the true positives, we found pairs mapping to all four clone types. We compared our approach to a traditional structure-oriented technique and found that our learning-based approach detected clones that were either undetected or suboptimally reported by the prominent tool Deckard. Our results affirm that our learning-based approach is suitable for clone detection and a tenable technique for researchers.

Subject Systems

Our subject systems included eight real-world Java systems.

Thresholds

We applied the following general thresholds to generate 1,573 file- and 60,474 method-level candidates.

System
File-levelMethod-level
AST-basedGreedyAST-basedGreedy
ANTLR 41.00E-051.00E-091.00E-161.00E-16
Apache Ant 1.9.61.00E-051.00E-071.00E-161.00E-16
ArgoUML 0.341.00E-051.00E-101.00E-161.00E-16
CAROL 2.0.51.00E-051.00E-061.00E-051.00E-07
dnsjava 2.0.01.00E-051.00E-051.00E-051.00E-07
Hibernate 21.00E-051.00E-081.00E-051.00E-07
JDK 1.4.21.00E-051.00E-111.00E-161.00E-16
JHotDraw 61.00E-051.00E-141.00E-161.00E-16

Empirical Results

Research Question 1

To support consistent evaluations, we used the editing taxonomy for code fragments proposed by Roy et al. in their paper, Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach.

The complete results of the manual evaluation can be found at the following links:
  1. File-level Results
  2. Method-level Results
Interesting true and false positives can be found in this online appendix:
Our second research question was intended to frame an exploratory study on our results as compared to state-of-the practice results where differences may admit important practical impacts and theoretical advances. For a traditional, structure-oriented technique, we selected the prominent tool Deckard, which has been used in many evaluations and seeded many practical applications of detection tools. Our exploratory study found that our learning-based approach detected clones that were either undetected or suboptimally reported by Deckard. Here, we provide a few remarks on a small sample of learning-based candidates that were undetected or suboptimally reported by Deckard.

apache-ant0

Replaced control statements. Both classes extend BaseResourceCollectionContainer with their main functionality in the method getCollection. The first eight lines of getCollection are identical between the two fragments, but the classes differ on how they populate the collection. Difference uses a for loop to iterate over the list of ResourceCollection objects whereas Intersect uses a while loop. Our learning-based paradigm detected the clone pair despite the classes using distinctly different control statements to loop over an iterable object. Deckard only detected similarities between the package declarations and import statements.

argouml0

Modified control flow. Both classes extend the abstract class AbstractPerspectiveRule which implements the interface PerspectiveRule. The interface specifies three methods: getRuleName, getChildren, and getDependencies; these are the only methods implemented in both classes. While getRuleName and getDependencies are Type I clones, getChildren has two different implementations. GoNamespaceToDiagram defines an ArrayList wrapped as a List, iterates through a list of Diagram objects, and performs checks before adding them to the list. On the other hand, GoProjectToStateMachine defines an ArrayList wrapped as a Collection and iterates through a list of Model objects, adding them to a collection. Our approach was robust against variations in syntax from the conditional statements. Deckard only reported similarities between the package declarations and import statements.

dnsjava0

Inserted and deleted lines. Both classes extend the Record class and have a similar set of methods save a few additional getters since SRVRecord has a few more additional private fields than MINFORecord. Despite the syntactic differences between constructors and helper methods, there are evident similarities among the implementations. For example, both constructors execute a call to super before setting their private fields. Likewise, many of the helper methods have the same interface and are exclusively designed to set private fields. Deckard did not report any similar fragments between these classes yet Deckard did link some fragments of these classes to other files in the system.

hibernate0

Overloaded constructor. Both classes have the same private fields and implement the same methods using similar syntax. Discounting Type I and Type II variations, the few notable differences are reordered data independent statements in the constructors, minor syntactic differences in getMessage, and one class overloads its constructor with an additional (one-line) method. Deckard does not report any similar fragments for this pair of files; however, Deckard does report clones between NonUniqueObjectException and UnresolvableObjectException, which overloads its constructor too.

hibernate2

Reordered data-dependent statements. Deckard detected similar fragments from the package declarations up the method declaration for processCollection, the core method for each class. Both classes extend the abstract class ReattachVisitor, and the required method processCollection similar across the implementations. We manually analyzed the two implementations and found that while the implementations are distinctly different, the methods are similar. Notably, one difference provides evidence that our learning-based approach is capable of handling reordered data-dependent statements such as session.removeCollection(persister, key);, which is placed in different positions relative to the first if statement.

jhotdraw0

Fragmentation. The classes share most of their source code (with identical syntax) except for small numbers of additional lines (in some cases one line) in different locations throughout the files. These were larger files, which indicates that our approach is capable of handling gaps throughout a pair of large files and detecting their similarity. Deckard reported nearly 20 clone pairs that covered most of the files; however, from a software maintainer’s point of view, splintering these files makes it difficult to detect these (strong) Type III clones.

Binary Grammar

In order to obtain a binary tree, subtrees rooted at Case II nodes (i.e., nodes with degree greater than two) need to be reorganized so the children are suitably arranged. We defined a grammar-based approach, for each nonterminal type, to systematically reorganize the children of Case II nodes. To do so, we modified the original grammar and introduced new artificial nonterminal types.

The original and modified production rules can be found in this document.

Tree Statistics

During our analysis, we generated Tree Statistics of the ASTs associated with each source code file in the analyzed systems. The following resources can be downloaded: