RQ1: How effective are different representations

in detecting similar code fragments?

In this Research Question we assess the effectiveness of each representation to detect similar code fragments. Here we report:

  • Complete Results

  • Manually validated sample

  • All the candidates extracted by each representation

  • Interesting similar code fragments

  • False Positives

Results - Partitions & Representations

The following spreadsheets show the results for each single candidate partition and aggregated by representation.

Please change the sheet on the bottom to switch from partitions to representations.

Methods

Methods

Classes

Classes

Manual Evaluation

The zipped file contains the manual evaluation performed on the sample both for classes and methods. In particular the .csv files contain the evaluation for all the three human evaluators and the final label for each candidate in the sample. The sample files can be inspected downloading the file below.

Candidates

The zipped file contains all the candidate clone fragments for each project extracted by each representation, both at class- and method-level. The folder structure is the following:

  • <project>

    • <representation>

      • methods.cand.csv

      • types.cand.csv

methods.cand.csv and types.cand.csv are structured with the following three fields:

  • ID_A: Numerical ID for the first clone fragment;

  • ID_B: Numerical ID for the second clone fragment;

  • Distance: the distance between the fragments' embeddings.

Interesting Cases

Here we show similar code fragments (true positives) identified only by a specific representation. Is interesting to note the variations in the vocabulary (identifiers).

Bytecode Only

18A
18B

CFG Only

65A
65B

AST Only

120A
120B
152A
152B

False Positives

Here we show pair of code fragments labeled as false positives (in terms of clones), but for which structural and low-level similarities are still evident. They mostly refer to Java Beans with similar structure and low-level operations.

Bytecode

1A
1B
17A
17B

CFG

49A
49B
51A
51B

Bytecode & CFG

91A
91B