CantReader

This is the code release page for the paper: "Reading Thieves’ Cant: Automatically Identifying and Understanding Dark Jargons from Cybercrime Marketplaces ".

CantReader can discover and interpret the potential dark jargons inside the text file. The detailed description can be found in the paper.

You can download the code here and follow the following steps to replicate some of the results we get in the paper.

The raw text file we collected and used in the paper can be found here.

The code is arranged according to the system design described in figure 3 in the paper. Following steps will give you a quick look at how to use the code and what the output looks like. The first step is to train a Semantics Comparison Model (SCM):

    • cd discoverer/scm
    • chmod +x *.sh
    • ./train.sh textFile1 textFile2 outputPath

The training process is time-consuming, so we included a pre-trained Semantics Comparison Model (SCM) in the code archives. You should be fine to directly jump to the following steps without training the SCM.

1. Make sure you have python3 installed and following packages available: Numpy, Gensim, sklearn, Pandas, NLTK

2. To replicate the detecting thresholds mentioned in Table4 row 1 of the paper, run:

    • python discover.py -gb silkroad.it100 -gr silkroad.it100

3. To generate the final dark jargon results:

    • python rf_classifier.py -gb silkroad.it100
    • The jargon can be find in /resource /Interpret/ with the format of a TSV file for each jargon categories.

More detailed description regarding the code is in the README file.

Please feel free to use the code and try it on some new text corpus for interesting results.