https://drive.google.com/drive/folders/1Hc0t5SNymjfHu6PQt-Um-ZsANAi23aNY?usp=sharing
https://drive.google.com/file/d/1EN1f6W_XX5QtaGxhY4_a9Ft94qN_e3gM/view?usp=sharing
Here is the early-released source code.
To run scripts, please always run `export PYTHONPATH=/path/to/src` (the src folder in the released archive).
Please install the packages in `requirements.txt` first.
The compiled binary will link with code from submodules and installed libs.
Thus, the database of B2B is too large to be released, please run following tools to get the embeddings. (The released embedding vectors of CodeCMR exclude all submodules.)
SAFE:
https://github.com/gadiluna/SAFE
Please install the latest Radare2 (https://github.com/radareorg/radare2). The radare2 installed via commands like `apt install radare2` is outdated and cannot analyze many samples.
Asm2vec:
https://github.com/oalieno/asm2vec-pytorch
PalmTree_G
https://drive.google.com/drive/folders/12usxXV6WianhDOfXUq8SptyXFtxKIdNP?usp=sharing
The files are embedding vectors of samples and vectors of OSS database.
After unzip, the files should be loaded with python3 `pickle.load`
To test our framework with released CodeCMR's data, please download the source code and have a look at `src/data_format_transfer/CodeCMR_to_palmtree`, the scripts `query_data_formatter.py` and `oss_data_formatter.py` are used to convert the data to the format that we use in our study.
Additionally, you need to use Centris (https://github.com/wooseunghoon/Centris-public) to do redundancy elimination and code segmentation first.
After building the OSS database, please see scripts in `src/analysis/codecmr_scripts`.
Modify the paths in `config.py`, and generate files like `src/analysis/codecmr_scripts/data/multiplex_db_list.txt` and `src/analysis/codecmr_scripts/oss_db_list.txt`
Run `build_milvus_db.py`, you need to install Milvus(v1.1.1) and sqlite3 first. (We tested Milvus v2.+ but it has bugs on our server. See `milvus_database/FuncVecOssDB.py` if you want to try the latest Milvus)
Then run scripts for querying and analyzing, script `src/analysis/codecmr_scripts/batch.sh` should explain this step.
Please have a look at above section, Test with Data of CodeCMR.
After converting the data to the format we want, other steps are exactly the same.
You may find some useful scripts here. https://drive.google.com/drive/folders/1wU5FxQxY_bgo4uImVgq63UhAfTR2tk4n?usp=sharing