Deep Learning Code Similarities

RQ1: How effective are different representations

in detecting similar code fragments?

In this Research Question we assess the effectiveness of each representation to detect similar code fragments. Here we report:

Complete Results
Manually validated sample
All the candidates extracted by each representation
Interesting similar code fragments
False Positives

Results - Partitions & Representations

The following spreadsheets show the results for each single candidate partition and aggregated by representation.

Please change the sheet on the bottom to switch from partitions to representations.

Methods

Classes

Manual Evaluation

The zipped file contains the manual evaluation performed on the sample both for classes and methods. In particular the .csv files contain the evaluation for all the three human evaluators and the final label for each candidate in the sample. The sample files can be inspected downloading the file below.

Download Manual Evaluations

Download Sample Files

Candidates

The zipped file contains all the candidate clone fragments for each project extracted by each representation, both at class- and method-level. The folder structure is the following:

<project>
- <representation>
  - methods.cand.csv
  - types.cand.csv

methods.cand.csv and types.cand.csv are structured with the following three fields:

ID_A: Numerical ID for the first clone fragment;
ID_B: Numerical ID for the second clone fragment;
Distance: the distance between the fragments' embeddings.

Download Candidates

Interesting Cases

Here we show similar code fragments (true positives) identified only by a specific representation. Is interesting to note the variations in the vocabulary (identifiers).

Bytecode Only

18A

18B

CFG Only

65A

65B

AST Only

120A

120B

152A

152B

False Positives

Here we show pair of code fragments labeled as false positives (in terms of clones), but for which structural and low-level similarities are still evident. They mostly refer to Java Beans with similar structure and low-level operations.

Bytecode

17A

17B

CFG

49A

49B

51A

51B

Bytecode & CFG

91A

91B

Google Sites

Report abuse