Code Representations

We use the following representations of the code:

  • Identifiers: a stream of identifiers and constants used in the code;

  • AST: a stream of the node types composing its AST;

  • Bytecode: a stream of bytecode mnemonic opcodes (e.g., iload, invokevirtual) forming the compiled code;

  • CFG: a code fragment is expressed as its CFG.


Representations for Projects

The zipped file contains a folder for each project system in the Projects dataset. Each project folder contains a folder for each representation. The representations folders contain a text file for methods and class representations. Each line in the representation's file represents a single artifact (i.e., method or class). The signature for each artifact is contained in the .key files for classes and methods. There is a mapping between the lines of the .key files and the lines of the representations files. The following summarize the structure of the dataset:

  • <project>

    • <representation>

      • methods.src

      • types.src

    • methods.src.key

    • types.src.key

Representations for Libraries

The zipped file contains the bytecode representation for all the 47 libraries in the Library dataset. The representation is at class-level only and aggregated for all the libraries. The following summarize the structure of the dataset:

  • types.src

  • types.key