RQ1: How effective are different representations
in detecting similar code fragments?
In this Research Question we assess the effectiveness of each representation to detect similar code fragments. Here we report:
Complete Results
Manually validated sample
All the candidates extracted by each representation
Interesting similar code fragments
False Positives
Results - Partitions & Representations
The following spreadsheets show the results for each single candidate partition and aggregated by representation.
Please change the sheet on the bottom to switch from partitions to representations.
Manual Evaluation
The zipped file contains the manual evaluation performed on the sample both for classes and methods. In particular the .csv files contain the evaluation for all the three human evaluators and the final label for each candidate in the sample. The sample files can be inspected downloading the file below.
Candidates
The zipped file contains all the candidate clone fragments for each project extracted by each representation, both at class- and method-level. The folder structure is the following:
<project>
<representation>
methods.cand.csv
types.cand.csv
methods.cand.csv and types.cand.csv are structured with the following three fields:
ID_A: Numerical ID for the first clone fragment;
ID_B: Numerical ID for the second clone fragment;
Distance: the distance between the fragments' embeddings.
Interesting Cases
Here we show similar code fragments (true positives) identified only by a specific representation. Is interesting to note the variations in the vocabulary (identifiers).
Bytecode Only
CFG Only
AST Only
False Positives
Here we show pair of code fragments labeled as false positives (in terms of clones), but for which structural and low-level similarities are still evident. They mostly refer to Java Beans with similar structure and low-level operations.