Code Representations
We use the following representations of the code:
Identifiers: a stream of identifiers and constants used in the code;
AST: a stream of the node types composing its AST;
Bytecode: a stream of bytecode mnemonic opcodes (e.g., iload, invokevirtual) forming the compiled code;
CFG: a code fragment is expressed as its CFG.
Representations for Projects
The zipped file contains a folder for each project system in the Projects dataset. Each project folder contains a folder for each representation. The representations folders contain a text file for methods and class representations. Each line in the representation's file represents a single artifact (i.e., method or class). The signature for each artifact is contained in the .key files for classes and methods. There is a mapping between the lines of the .key files and the lines of the representations files. The following summarize the structure of the dataset:
<project>
<representation>
methods.src
types.src
methods.src.key
types.src.key
Representations for Libraries
The zipped file contains the bytecode representation for all the 47 libraries in the Library dataset. The representation is at class-level only and aggregated for all the libraries. The following summarize the structure of the dataset:
types.src
types.key