Data & Code

Datasets

1. IM Group discussions

For each role, we joined the top 50 most active and popular groups among the search results on the IM platform. For each group, we tracked the chatlogs for the past 16 months (07/2017 ~ 10/2018). One million group chat traces, 50K dialog pairs were collected. Data available upon request.

QQ_group_msgs.txt data format:

currentTime;;groupID;;groupNickname;;senderID;;senderNickname;;senderFirstSeen;;senderLastSeen;;msgText

2. Underground forum discussions

We gathered discussion threads from two popular underground e-commerce forums: htys123.com and zuanke8.com. 700K dialog pairs were gathered.

Data available upon request.

htysPosts.qa.txt & zuanke8.qa.txt data format:

postID;:;threadTitle;:;threadPost1;:;threadPost2;:;threadPost3;:;threadPost4...

Code

1. Data Collection

1.1 Forum crawlers

There are two forums used in our study to collect underground e-commerce corpus, htys123 and zuanke8, which are popular among e-commerce fraudsters.

htys123Crawler.py is used to crawl the forum threads in htys123.com

zuanke8Crawler.py is used to crawl the forum threads in zuanke8.com

After crawling, the result should be in the format:

postID;:;threadTitle;:;threadPost1;:;threadPost2;:;threadPost3;:;threadPost4...


1.2 Knowledge extension

After getting the context in threads, we extract the dialogue pairs to extend our knowledge base. Topic detection with SinglePass algorithm is used on each thread, aiming at segment the thread into topic blocks and further extract dialogue pairs based on the topic blocks.

knowledgeExt.py is used to topic segmentation for forum thread as well as IM group discussions and generate dialogue pairs on topics.


1.3 FSM extension

The questions in the seed conversations are not enough for Aubrey to chat with cybercriminals. We enhance the FSM question base with FSM extension.

FSMExt.py is used to extend the questions ask by FSM with topic keywords.


2. Chatbot implementation

2.1 QQ manager

We used QQ as the IM platform to chat with criminals. To manage the message sending and receiving, we used public tool CoolQ and CoolQ HTTP API (https://cqhttp.cc/).

QQManager.py is to start the CoolQ service and use API to control the chatting.


2.2 FSM construction

FSMs are built to control the conversations with target roles.

Miscrants.py defines the three roles targeted in our study and the corresponding FSMs.


2.3 Aubrey

aubrey.py combines the FSMs and retrieval model and corporate with QQManager to chat with target roles.


3. Resources

3.1 dict.txt

dictionary words used in building word2Vec model, for better tokenization

3.2 endingwords.txt

the meaningless words, which can be consider as the end of a conversation

3.3 stopwords.txt

stopwords to be filtered