The author outlines the creation of a system for detecting online sexual predators in an online IM chat feed. The paper references the work of a number of previous researchers and seeked to build on and combine the success of these previous system. The system was built on SAS Enterprise Guide and used a number of programming features that SAS offered that other opensource systems may not have. The author points to a number of previous research papers on the characteristics of sexual predators, their online behaviour, sources for calibrating data, indicator variables of sexual predation and actual key code from the final working system. The author does not provide a stand alone system or code for general use. Word corpus that were used are also referenced.
The link to download the full paper is here.
Authors contact: BoydEOwens at gmail dot com
Knowledge of SAS programming and statistical modeling is assumed. The author makes reference to many SAS products and in no way intends to infringe on any of SAS corporations intellectual property. Readers are referred to SAS website for information about licensing, fees and support for SAS products. The author does not work for SAS and has not received any payment from SAS for work on this or any other projects. The author does not represent SAS in any way. Neither the author or SAS assume any liability for any code in this paper and all information is provided as-is. Nothing in this paper is intended to create any legal contract, liability or guarantee of performance.
Some of the emotional models sited are copyrighted material and their use may require fees. The reader is cautioned to research sources prior to use. I have attempted to indicate those sources which I know may be copyrighted, but may have missed some. Please bring any errors to my attention for correction as it is not my intent to infringe in any way.
The purpose of this paper is to provide an outline of how to create and program a system to detect online sexual predators of children using advanced analytics. It is not my goal to develop revolutionary new techniques, but to combine established methods into a robust system that works in the real world. The goal is a system that identifies the majority of the predators with minimum false alarms and is scalable to a variety of applications, websites and business models.
The science and technology behind creating these types of systems involves knowledge of Linguistics, Statistics and Mathematics, Criminal psychology, Law Enforcement, Advanced Analytics and programming. This is not exactly a combination of skills that most organizations have among their employees, or that is cost effective to obtain via contract work. In creating these systems, the developer ends up having to understand knowledge from a diverse set of disciplines. Because of this, one primary goal of this paper is to summarize and simplify an approach to detecting sexual predators that presents the key necessary information but provides references that can provide further details.
This paper is formated with several conventions for ease of use. Any sizable word corpus or code scripts are referenced, but placed in an appendix of this paper. Several publicly available corpus’s are too large to print and will be referenced with hyperlinks to their storage location on the public internet. Short code scripts will be included in the front section of the paper. The reader is cautioned that an entire functioning system is not provided as much of that is architecture dependent and will vary by company or organization. The relative strength of each variable may also change depending on the site the system is monitoring, an example would be that a web site of a non-profit addressing teenage sex would have a higher level of explicit words than a site offering puppy adoptions. The key pieces of code for building a functioning system are provided, but the details of data feeds and final statistical models should be customized by the user.
As this paper is addressing development of an actual application for detecting Online Sexual Predators, I would like to define some terminology. This will help to bridge the gap between the disciplines of Law Enforcement, Analytics, Programming and Behavioural Psychology. It should be noted that I will attempt to be consistent in terminology, but make no guarantee.
Author - A participant in a conversation. Linguists prefer to us “interlocutor” which the reader may see used in some of the references.
Predator - For our purposes, a Predator is an Author who is attempting to groom a underage victim for real world criminal sexual contact. In law enforecment terms a Pedaphile or Child Sexual Predator. We do not distinguish between the various subtypes of Child Sexual Predators.
Victim - A victim is an author who participates in a conversation with a predator.
Bystander - A bystander is an author who is neither a victim nor a predator.
Molester - A person who commits criminal sexual molestation against a minor.
Pedophile - A person suffering from the metal disorder of Pedophilia as defined by “Diagnostic and Statistical Manual of Mental Disorders” published by the American Psychiatric Association.
Message- A block of text sent by a particular author, with an associated time stamp. This represents a single carriage return. Also referred to as a “Post”.
Conversation - A sequence of messages from two or more authors. It never contains a gap of more than 25 minutes between consecutive messages.
Conversation Thread - A sequence of conversations over time (days, weeks or months from start to finish) between at least a victim and a predator that encompasses the predatory grooming process.
Troll - An online psychological predator whose intent is to abuse a victim mentally, not sexually, even though they may use sexually related methods as an abuse method. The trolls final goal is not to meet the victim for criminal sexual contact.
A number of challenges exist in getting an adequate data-set for the purpose of training, testing and validating an analytic based detection system. Some of these challenges are listed here.
Profiles of Sexual Predators
In the interest of updating and improving the predation models used to detect Child Sexual Predators, we will first review some characteristics as identified by the FBI Behavioral Science Unit and try to incorporate these findings into the model framework. From SSA Kenneth Lanning [9] we get the following characteristics of sexual predator behavior.
Child molesters fall into two categories, situational and preferential. The situational molester does not actually have a true sexual preference for children but may engage in sex with them for a variety of reasons. Frequency can vary from once, to a long term pattern, they usually have fewer victims and may victimize other vulnerable people.
The preferential molesters have a definite preference for children. They are sexually attracted to and prefer children and typically engage in highly predictable forms of sexual behavior called sexual rituals. These rituals are often used even it they increase the chances of getting caught. Preferential molesters are fewer in number than situational molesters, but have the potential of higher numbers of victims. The following is an outline list of characteristics of child molesters, but it is important to note that no one characteristic is indicative of being a child molester.
It is important to know that no single characteristic above in an indicator of a child molester, but when taken together, they can be strong indicators or at least characteristics of typical molester profiles. It is important to note that people suffering from Pedophilia may not be child molesters, but one of the big questions is how many pedophiles are not molesters.
Predation Models
A number of the sexual predation models used in current research have moved to simplification of the models typically used by law enforcement. These simplified models have worked well, as many steps of the predation process do not produce any variables that were detectable with the NLP or Bag of Words approaches used so far by researchers. I have chosen to migrate back toward a more complex predation model for several reasons. The first is the SAS analytics platform that I use has a number of built in NLP functions that may produce valuable variables for the detection of Sexual predators, second, some recent research advances in the use of sentence structures as a detection variables get excluded in the simpler models [15, 16], Third, the addition of Gender and Age detection to the models may enhance the strength of some known variables and these get excluded in the simpler models[17].
While the actual model that is developed will be different for each website, blog or IM feed, the steps for developing the model can be more consistent. Entire books have been written discussing the process of developing statistical or analytic models, a basic approach is outlined below.
For those who are familiar with SAS programming and data mining, you will recognize this as the SEMMA model of data mining. Any similar approach to developing a predictive model will should work well.
The next several figures shows diagram comparisons of several predation models used by researchers and the model used by the author. Some of the diagrams also show where in the predation model that some of the common variables typically come into play. The readers should review some of the references for more details about the predation models from other authors.
Figure 1: Historical Predation analytical models.
Figure 2: Proposed Predation Model (Owens).
For the purpose of this paper the data feed is assumed to be Blog posts, IM Feeds, Twitter feeds, AOL Feeds, or other similar feeds. Nearly all of these types of systems have some common data elements, but the process of bringing multiple feeds together and merging them will take some customized code to achieve. We will discuss some basic example coding for getting the data into a SAS data format. We will discuss starting with either an XML file format feed or a single field CSV format, but the key data contained in the field will include
A typical line of incoming XML format would look like this:
<date>2014-02-14</date><time>08:16:19</time><user>minkks</user><msg>OH MY GOODNESS</msg>
If your organization has licensed the SAS XML MAPPER tool, that is the easiest way to bring in XML data. The tool allows you to import the defined fields and define an output SAS dataset. Properties in each of the defined fields can be retained and brought into the SAS server.
If you do not have the SAS XML MAPPER tool, you can bring in the XML file as a single field text or CSV feed / file and parse it into the defined fields that you need. The PROC SQL function works well for this task.
Single field CSV feeds from Instant Message (IM) systems typically look like this.
<Name>minkks<date-time>2014-02-14:08:16:19<Message>OH MY GOODNESS
An example of the Import and parsing code used by the author is below.
/* Code for importing the single field data using a DATA step from “ChatLog” */
DATA WORK.ChatLog;
LENGTH
F1 $ 513 ;
FORMAT
F1 $CHAR513. ;
INFORMAT
F1 $CHAR513. ;
INFILE '<path name for the data file or data feed'>
LRECL=32767
ENCODING="LATIN1"
DLM='09'x
MISSOVER
DSD ;
INPUT
F1 : $CHAR513. ;
RUN;
/* Code for Parsing a single field feed into NAME, DATETIME and MESSAGE fields *?/
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG AS
SELECT t1.F1,
/* NAME */
(STRIP(SUBSTR(t1.F1,1,(INDEX(t1.F1," "))))) LABEL="NAME" AS NAME,
/* DATETIME */
(SUBSTR(t1.F1, INDEX(t1.F1,"(")+1, 20)) LABEL="DATETIME" AS DATETIME,
/* MESSAGE */
(STRIP(SUBSTR(t1.F1,(INDEX(t1.F1,")")+2)))) LABEL="MESSAGE" AS MESSAGE
FROM WORK.CHATLOG t1;
QUIT;
If the user has a need, additional PROC SQL code can be added to further parse the DATETIME field into a separate DATA and TIME field.
/* PROC SQL code for separating DATETIME into DATE and TIME */
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0000 AS
SELECT t1.F1,
t1.NAME,
t1.DATETIME,
/* DATE */
(INPUT(SUBSTR(t1.DATETIME, 1, 8), MMDDYY10.)) FORMAT=MMDDYY10. LABEL="DATE" AS DATE,
/* TIME */
(INPUT(SUBSTR(t1.DATETIME, 10, 8), IS8601TM8.)) FORMAT=IS8601TM8. LABEL="TIME" AS TIME,
t1.MESSAGE
FROM WORK.QUERY_FOR_CHATLOG t1;
QUIT;
In the case of the authors system, the incoming data field was parsed into separate DATE and TIME fields. The balance of the paper will work code examples based on this, but the users should be able to substitute DATETIME fairly easily.
Many of the metrics used to detect online predators function by comparing against a non-predator or a victim using normalized data. Therefore we need to have appropriate normalization values for our data. In some cases we will be focusing on how a predator uses individual graphs (letters or symbols) or characters, so a count of graphs in each post as well as in total is needed. We will also look at how they use words, so again, word count by individual post and in total are needed. The last normalization would be the total number of message posts or message posts per unit time.
All of these baselines can be created within a PROC SQL command against our message log data. The following code does this.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1, /* F1 is the Message index number. */
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* WORDCOUNT */
(COUNTW(t1.MESSAGE)) LABEL="WORDCOUNT" AS WORDCOUNT,
/* LETTERCOUNT */
(LENGHTN(t1.MESSAGE)) LABEL=”LETTERCOUNT” AS LETTERCOUNT,
/* MESSAGE COUNT */
(COUNT(1) LABEL=”MESSAGECOUNT” AS MESSAGECOUNT,
QUIT;
The above code in combination with summation by the t1.NAME variable will be used for normalizing the varous indexes used in the program. Most of the following index calculations can be added to the same PROC SQL statement as above right after the MESSAGECOUNT code.
Interrogatives for the basis for questions within the English language and are used by people to request and gather information from others during a conversation. For Sexual predators, this is vital for Step 1 of the Attack Sequence and failure here stops the entire process. Inclusive in this list is the standard ones, who, what, when, where, why, whom, which and how. With the advent of internet slang, a larger list of words, abbreviation and symbols have entered the vernacular that function as interrogatives. This include ?, dig, 3rd degree, 3rd, third degree, fish, grill, hammer, pimp, pump, d, D. The d and D are short for “Details” and get used as follows
VICTIM: I am going out with a friend tonight.
PREDATOR: d
VICTIM: I am going out with a friend tonight.
PREDATOR: D
Words like Fish and Hammer can also have Emoticon substitutes which may also be added to the list at the programmer's discretion. The variable becomes a count of Interrogatives and can be normalized in the same manner as other variables. A corpu of Interrogatives is listed in Appendix E and the reader is also referred to section on Forensic Countermeasures.
The use of ALL CAPS in the entire word post is the real world equivalent of shouting. In terms of a predator, this is an emotional outburst, which sexual predators are known to use often due to emotional immaturity. A simple Yes/No Count of the number of posts using ALL CAPS in one indicator of a sexual predator and can be used in conjunction with other variables for detection. In terms of SAS code we can use the following to create an ALLCAPS indicator variable. We make use of the ANYLOWER function to detect if the MESSAGE string has any lower case letters and return the first position of the lowercase letter. Any returns would be greater than 1, so we use the IFN function to flip the 1, 0 results around and output an ALLCAPS = 1 if the MESSAGE is ALL CAPS and a 0 if it is not.
/* ALLCAPS Function to determine if the message is all caps or not. */
(IFN(ANYLOWER(t1.MESSAGE)=0,1,0)) LABEL="ALLCAPS" AS ALLCAPS
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
The total number of hashtags used is also an indicator of child sexual predators. The predator tends to over-use hashtags in an attempt to talk like a child. We’ll typically see normal adults using hashtags as some level per post or per 1000 words, we will see children using Hashtags at a higher level, and sexual predators at a higher level still. The actual numerical levels will vary depending on the forum or site, so this detection is comparing the level of use between the predator and the victim or the predator and other adults of similar age.
To create a variable from out MESSAGE posts called HASHTAG, we can use SAS code like this.
/* HASHTAG Function to detect the use of a Hashtag in a MESSAGE post. */
(IFN(t1.MESSAGE CONTAINS "#", 1,0) ) LABEL="HASTAG" AS HASHTAG
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
The increased use of the word “Not” for negation to modify other words is a characteristic of sexual predators. By using positive words and negating them, the predator attempts to stay “more approachable” and “open” and avoid appearing “pessimistic” or “negative” to their victims. Normalized comparisons can be made against victims or other adults to identify predators. To create a NEGATION variable from a MESSAGE post, the following SAS code can be used.
/* NEGATION */
(IFN(INDEXW(t1.MESSAGE, "not")>=1, 1, 0) ) LABEL="NEGATION" AS NEGATION
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
The overuse of relation words during the grooming phase of a sexual predators online attack can also be used to identify the predator. Relation words like “meet”, “hookup”, “boyfriend” and “girlfriend” typically are overused by sexual predators. [6] A corpus of relation words for english have been developed and used create the RELATIONWORD variable from the MESSAGE field. This is a count of the number of such words in each message post. This variable can be aggregated in total as well as compared by trend usage during different attack phases as indicators of a sexual predator. The SAS code for creating this variable is below.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* RelationWord */
(COUNT(t1.MESSAGE, "meet", 'i' ) + COUNT(t1.MESSAGE, "date", 'i' ) + COUNT(t1.MESSAGE, "boyfriend", 'i' ) +
COUNT(t1.MESSAGE, "girlfriend", 'i' ) + COUNT(t1.MESSAGE, "hookup", 'i' ) + COUNT(t1.MESSAGE, "hang", 'i'
) + COUNT(t1.MESSAGE, "acquainted" , 'i' )+ COUNT(t1.MESSAGE, "affiliated" , 'i' )
+ COUNT(t1.MESSAGE, "arm’s-length" , 'i' )+ COUNT(t1.MESSAGE, "brittle" , 'i' )
+ COUNT(t1.MESSAGE, "broken" , 'i' )+ COUNT(t1.MESSAGE, "bromantic" , 'i' )
+ COUNT(t1.MESSAGE, "brotherly" , 'i' )+ COUNT(t1.MESSAGE, "chummy" , 'i' )
+ COUNT(t1.MESSAGE, "clannish" , 'i' )+ COUNT(t1.MESSAGE, "close" , 'i' )
+ COUNT(t1.MESSAGE, "close" , 'i' )+ COUNT(t1.MESSAGE, "close" , 'i' )
+ COUNT(t1.MESSAGE, "close" , 'i' )+ COUNT(t1.MESSAGE, "connected" , 'i' )
+ COUNT(t1.MESSAGE, "cosy" , 'i' )+ COUNT(t1.MESSAGE, "cozy" , 'i' )
+ COUNT(t1.MESSAGE, "dysfunctional" , 'i' )+ COUNT(t1.MESSAGE, "estranged" , 'i' )
+ COUNT(t1.MESSAGE, "fragile" , 'i' )+ COUNT(t1.MESSAGE, "fraternal" , 'i' )
+ COUNT(t1.MESSAGE, "fraternal" , 'i' )+ COUNT(t1.MESSAGE, "friendly" , 'i' )
+ COUNT(t1.MESSAGE, "go" , 'i' )+ COUNT(t1.MESSAGE, "have" , 'i' )
+ COUNT(t1.MESSAGE, "have" , 'i' )+ COUNT(t1.MESSAGE, "heavy" , 'i' )
+ COUNT(t1.MESSAGE, "illicit" , 'i' )+ COUNT(t1.MESSAGE, "immediate" , 'i' )
+ COUNT(t1.MESSAGE, "inseparable" , 'i' )+ COUNT(t1.MESSAGE, "interpersonal" , 'i' )
+ COUNT(t1.MESSAGE, "intimate" , 'i' )+ COUNT(t1.MESSAGE, "intimate" , 'i' )
+ COUNT(t1.MESSAGE, "intimate" , 'i' )+ COUNT(t1.MESSAGE, "intimately" , 'i' )
+ COUNT(t1.MESSAGE, "long-lost" , 'i' )+ COUNT(t1.MESSAGE, "loveless" , 'i' )
+ COUNT(t1.MESSAGE, "maternal" , 'i' )+ COUNT(t1.MESSAGE, "matrilineal" , 'i' )
+ COUNT(t1.MESSAGE, "monogamous" , 'i' )+ COUNT(t1.MESSAGE, "monogamously" , 'i' )
+ COUNT(t1.MESSAGE, "mouth" , 'i' )+ COUNT(t1.MESSAGE, "one-sided" , 'i' )
+ COUNT(t1.MESSAGE, "one-to-one" , 'i' )+ COUNT(t1.MESSAGE, "one-way" , 'i' )
+ COUNT(t1.MESSAGE, "patriarchal" , 'i' )+ COUNT(t1.MESSAGE, "patrilineal" , 'i' )
+ COUNT(t1.MESSAGE, "personal" , 'i' )+ COUNT(t1.MESSAGE, "personally" , 'i' )
+ COUNT(t1.MESSAGE, "platonic" , 'i' )+ COUNT(t1.MESSAGE, "platonically" , 'i' )
+ COUNT(t1.MESSAGE, "political" , 'i' )+ COUNT(t1.MESSAGE, "polyandrous" , 'i' )
+ COUNT(t1.MESSAGE, "polygamous" , 'i' )+ COUNT(t1.MESSAGE, "related" , 'i' )
+ COUNT(t1.MESSAGE, "rocky" , 'i' )+ COUNT(t1.MESSAGE, "same-sex" , 'i' )
+ COUNT(t1.MESSAGE, "serious" , 'i' )+ COUNT(t1.MESSAGE, "sexual" , 'i' )
+ COUNT(t1.MESSAGE, "shifting" , 'i' )+ COUNT(t1.MESSAGE, "strong" , 'i' )
+ COUNT(t1.MESSAGE, "suited" , 'i' )+ COUNT(t1.MESSAGE, "symbiotic" , 'i' )
+ COUNT(t1.MESSAGE, "thick" , 'i' )+ COUNT(t1.MESSAGE, "tight" , 'i' )
+ COUNT(t1.MESSAGE, "tightknit" , 'i' )+ COUNT(t1.MESSAGE, "unstable" , 'i' )
+ COUNT(t1.MESSAGE, "warming" , 'i' )+ COUNT(t1.MESSAGE, "a hungry mouth" , 'i' )
+ COUNT(t1.MESSAGE, "a hungry mouth to feed" , 'i' )+ COUNT(t1.MESSAGE, "an old friend" , 'i' )+ COUNT(t1.MESSAGE, "an old ally" , 'i' )+ COUNT(t1.MESSAGE, "an old enemy" , 'i' )
+ COUNT(t1.MESSAGE, "an old student" , 'i' )+ COUNT(t1.MESSAGE, "an old girlfriend" , 'i' )
+ COUNT(t1.MESSAGE, "thick as thieves" , 'i' )+ COUNT(t1.MESSAGE, "at arm’s length" , 'i' )
+ COUNT(t1.MESSAGE, "at arms length" , 'i' )+ COUNT(t1.MESSAGE, "be on good terms" , 'i' )
+ COUNT(t1.MESSAGE, "be on bad terms" , 'i' )+ COUNT(t1.MESSAGE, "be on friendly terms" , 'i' )+ COUNT(t1.MESSAGE, "get along famously" , 'i' )+ COUNT(t1.MESSAGE, "get on famously" , 'i' )+ COUNT(t1.MESSAGE, "not on speaking terms" , 'i' )+ COUNT(t1.MESSAGE, "on the good side of" , 'i' )+ COUNT(t1.MESSAGE, "on the bad side of" , 'i' )
+ COUNT(t1.MESSAGE, "on the right side of" , 'i' )+ COUNT(t1.MESSAGE, "on the wrong side of" , 'i' )+ COUNT(t1.MESSAGE, "nodding acquaintance" , 'i' )+ COUNT(t1.MESSAGE, "nodding terms" , 'i' )+ COUNT(t1.MESSAGE, "the best of friends" , 'i' )) LABEL="RelationWord" AS RelationWord,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
In a pattern similar to RELATION words, FAMILY words are also used at an increased rate by sexual predators. Predators use family references to gather information about the victim's relationship to family members in order to learn information about the victim, like how close a victim is to their parents and if they are likely to confide in them; These words are also used by the predator show the victim how different they are from their family in order to induce the victim to emotionally separate from their family and the inherent protection they offer.
Again, FAMILYWORD index is a total count of the number of usages of these words and it will need to be normalized. The SAS code to generate this variable is below.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* FamilyWords */
(+ COUNT(t1.MESSAGE, "mom", 'i' )
+ COUNT(t1.MESSAGE, "father", 'i' )
+ COUNT(t1.MESSAGE, "dad", 'i' )
+ COUNT(t1.MESSAGE, "parent", 'i' )
+ COUNT(t1.MESSAGE, "children", 'i' )
+ COUNT(t1.MESSAGE, "son", 'i' )
+ COUNT(t1.MESSAGE, "daughter", 'i' )
+ COUNT(t1.MESSAGE, "sister", 'i' )
+ COUNT(t1.MESSAGE, "brother", 'i' )
+ COUNT(t1.MESSAGE, "grandmother", 'i' )
+ COUNT(t1.MESSAGE, "grandfather", 'i' )
+ COUNT(t1.MESSAGE, "grandparent", 'i' )
+ COUNT(t1.MESSAGE, "grandson", 'i' )
+ COUNT(t1.MESSAGE, "granddaughter", 'i' )
+ COUNT(t1.MESSAGE, "grandchild", 'i' )
+ COUNT(t1.MESSAGE, "aunt", 'i' )
+ COUNT(t1.MESSAGE, "uncle", 'i' )
+ COUNT(t1.MESSAGE, "niece", 'i' )
+ COUNT(t1.MESSAGE, "nephew", 'i' )
+ COUNT(t1.MESSAGE, "cousin", 'i' )
+ COUNT(t1.MESSAGE, "husband", 'i' )
+ COUNT(t1.MESSAGE, "wife", 'i' )
+ COUNT(t1.MESSAGE, "sister-in-law", 'i' )
+ COUNT(t1.MESSAGE, "brother-in-law", 'i' )
+ COUNT(t1.MESSAGE, "mother-in-law", 'i' )
+ COUNT(t1.MESSAGE, "father-in-law", 'i' )
+ COUNT(t1.MESSAGE, "partner", 'i' )
+ COUNT(t1.MESSAGE, "fiancé", 'i' )
+ COUNT(t1.MESSAGE, "fiancée", 'i' )
+ COUNT(t1.MESSAGE, "fiance", 'i' )
+ COUNT(t1.MESSAGE, "fiancee", 'i' )
+ COUNT(t1.MESSAGE, "sis", 'i' )
+ COUNT(t1.MESSAGE, "mum", 'i' )
+ COUNT(t1.MESSAGE, "cuz", 'i' )
+ COUNT(t1.MESSAGE, "bro", 'i' )
+ COUNT(t1.MESSAGE, "pop", 'i' ) ) LABEL="FamilyWords" AS FamilyWords,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
Overuse of personal pronouns as a bonding method is common with sexual predators. A comparison needs to be drawn between victims and normal forum users and the predators on any particular venue as the level of absolute usage can vary. It is the delta between the normal user and the predator that we look for with this variable. The SAS code for the creation of PERSPRONOUN from MESSAGE.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* PersPronoun */
(COUNT(t1.MESSAGE, "I ") + COUNT(t1.MESSAGE, "me ", 'i') + COUNT(t1.MESSAGE, "my ", 'i') +
COUNT(t1.MESSAGE, "mine ", 'i') + COUNT(t1.MESSAGE, "you", 'i') + COUNT(t1.MESSAGE, "your ", 'i') +
COUNT(t1.MESSAGE, "yours", 'i') + COUNT(t1.MESSAGE, "he ", 'i') + COUNT(t1.MESSAGE, "she ", 'i') +
COUNT(t1.MESSAGE, " it ", 'i') + COUNT(t1.MESSAGE, "him", 'i') + COUNT(t1.MESSAGE, "his", 'i') +
COUNT(t1.MESSAGE, "her", 'i') + COUNT(t1.MESSAGE, "its", 'i') + COUNT(t1.MESSAGE, "ours", 'i') +
COUNT(t1.MESSAGE, "they", 'i') + COUNT(t1.MESSAGE, "hers", 'i') + COUNT(t1.MESSAGE, "we ", 'i') +
COUNT(t1.MESSAGE, " us ", 'i') + COUNT(t1.MESSAGE, "our ", 'i') + COUNT(t1.MESSAGE, "them", 'i') +
COUNT(t1.MESSAGE, "their", 'i') + COUNT(t1.MESSAGE, "theirs", 'i') + COUNT(t1.MESSAGE, "u ", 'i') +
COUNT(t1.MESSAGE, "u?", 'i')) LABEL="PersPronoun" AS PersPronoun,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
A list of English Personal pronouns is here. [ I me my mine you your yours
he she it him his her its hers we us our ours they them their theirs and the slang version “u” and “u?” ]
McGhee [2] indicated in his results that separating the pronouns into 1st, 2nd and 3rd person and tracking each as a separate variable increased the power of detection of sexual predators. He offered no details on how they were actually used in the program code, but I suspect it was in combination with other variables within a given post. There may also be value in counting pronouns in the Case Dimension (SUBJECTIVE, OBJECTIVE, POSSESSIVE) and in the PLURALITY Dimension (SINGULAR, PLURAL).
A Personal_Pronoun_Corpus is included in APPENDIX
Overuse of Reflexive Pronouns like “myself” or “yourself”. Again these get overused during the grooming process to draw distinction between the victim and others.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* REFPronouns, also contain archaic Logophors */
(COUNT(t1.MESSAGE, "myself", 'i') + COUNT(t1.MESSAGE, "yourself", 'i')
+ COUNT(t1.MESSAGE, "thyself", 'i')+ COUNT(t1.MESSAGE, "himself", 'i')
+ COUNT(t1.MESSAGE, "hisself", 'i')+ COUNT(t1.MESSAGE, "herself", 'i')
+ COUNT(t1.MESSAGE, "itself", 'i')+ COUNT(t1.MESSAGE, "oneself", 'i')
+ COUNT(t1.MESSAGE, "ourselves", 'i')+ COUNT(t1.MESSAGE, "ourself", 'i')
+ COUNT(t1.MESSAGE, "yourselves", 'i')+ COUNT(t1.MESSAGE, "themself", 'i')
+ COUNT(t1.MESSAGE, "themselves", 'i')+ COUNT(t1.MESSAGE, "theirselves", 'i')
) LABEL="REFPronouns" AS REFPronouns,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
The use of “xo” or “XO” to mean hugs and kisses is also a form of verbal grooming by sexual predators. This introduces the idea of being touched and interacting sexually with the predator and is used during the grooming phase of the attack. Extended variant so this like “xoxoxoxoxo” are common and can be detected by a simple search for “xo”, which occurs infrequently enough in regular english usage that the error induced typically falls to zero. Like the measures above, a normalized comparison can be used, but predators often stand out in a pure count as well.
SAS code for use inside of a PROC SQL command follows.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* XOCount */
(COUNT(t1.MESSAGE, "xo", 'i')) LABEL="XOCount" AS XOCount,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
Modal Verbs, is a type of auxiliary verb that is used to indicate a likelihood, ability, permission, and obligation. Examples include the English verbs can, could, may, might, must, will/, would, and, shall/, should. These types of words are used by predators to obligate the victim to certain actions, trick them into giving permission, control their actions, or transfer responsibility of some action back to the victim. These words get overused by manipulative predators and stand out easily. This variable is usually a significant variable in any detection model of predators in general and sexual predators especially.
SAS code to calculate MODALVERB from MESSAGE inside a PROC SQL statement is below.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* ModalVerbs */
(COUNT(t1.MESSAGE, "can", 'i') + COUNT(t1.MESSAGE, "could", 'i') + COUNT(t1.MESSAGE, "may", 'i') +
COUNT(t1.MESSAGE, "might", 'i') + COUNT(t1.MESSAGE, "shall", 'i') + COUNT(t1.MESSAGE, "shall", 'i') +
COUNT(t1.MESSAGE, "should", 'i') + COUNT(t1.MESSAGE, "will", 'i') + COUNT(t1.MESSAGE, "would", 'i') +
COUNT(t1.MESSAGE, "must", 'i') + COUNT(t1.MESSAGE, "ought", 'i') + COUNT(t1.MESSAGE, "dare", 'i') +
COUNT(t1.MESSAGE, "need", 'i') + COUNT(t1.MESSAGE, "darest", 'i') + COUNT(t1.MESSAGE, "had better", 'i') +
COUNT(t1.MESSAGE, "used to", 'i')) LABEL="ModalVerbs" AS ModalVerbs,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
Stretch words are words that are elongated for emphasis, typically used by the predator to sound more like a child in their use of words and lingo as a way to bond with the victim. For example, Noooooooooo! Often this variable would need to be combined with some type of Age index or measurement in order to be an indicator of a predator. Children and younger people often use this type of word play for emphasis on short message systems, so unless it is combined with an age indicator, or other variables, it is not an indicator on its own.
The SAS code for calculating the STRETCH_INDEX from MESSAGE is below.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* Stretch_Index */
(COUNT(t1.MESSAGE, "aaa", "i")+ COUNT(t1.MESSAGE, "bbb", "i")+ COUNT(t1.MESSAGE, "ccc", "i")+
COUNT(t1.MESSAGE, "ddd", "i")+ COUNT(t1.MESSAGE, "eee", "i")+ COUNT(t1.MESSAGE, "fff", "i")+
COUNT(t1.MESSAGE, "ggg", "i")+ COUNT(t1.MESSAGE, "hhh", "i")+ COUNT(t1.MESSAGE, "iii", "i")+
COUNT(t1.MESSAGE, "jjj", "i")+ COUNT(t1.MESSAGE, "kkk", "i")+ COUNT(t1.MESSAGE, "lll", "i")+
COUNT(t1.MESSAGE, "mmm", "i")+ COUNT(t1.MESSAGE, "nnn", "i")+ COUNT(t1.MESSAGE, "ooo", "i")+
COUNT(t1.MESSAGE, "ppp", "i")+ COUNT(t1.MESSAGE, "qqq", "i")+ COUNT(t1.MESSAGE, "rrr", "i")+
COUNT(t1.MESSAGE, "sss", "i")+ COUNT(t1.MESSAGE, "ttt", "i")+ COUNT(t1.MESSAGE, "uuu", "i")+
COUNT(t1.MESSAGE, "vvv", "i")+ COUNT(t1.MESSAGE, "www", "i")+ COUNT(t1.MESSAGE, "xxx", "i")+
COUNT(t1.MESSAGE, "yyy", "i")+ COUNT(t1.MESSAGE, "zzz", "i")+ COUNT(t1.MESSAGE, "111", "i")+
COUNT(t1.MESSAGE, "222", "i")+ COUNT(t1.MESSAGE, "333", "i")+ COUNT(t1.MESSAGE, "444", "i")+
COUNT(t1.MESSAGE, "555", "i")+ COUNT(t1.MESSAGE, "666", "i")+ COUNT(t1.MESSAGE, "777", "i")+
COUNT(t1.MESSAGE, "888", "i")+ COUNT(t1.MESSAGE, "999", "i")+ COUNT(t1.MESSAGE, "000", "i")+
COUNT(t1.MESSAGE, "```", "i")+ COUNT(t1.MESSAGE, "~~~", "i")+ COUNT(t1.MESSAGE, "!!!", "i")+
COUNT(t1.MESSAGE, "@@@", "i")+ COUNT(t1.MESSAGE, "###", "i")+ COUNT(t1.MESSAGE, "$$$", "i")+
COUNT(t1.MESSAGE, "%%%", "i")+ COUNT(t1.MESSAGE, "^^^", "i")+ COUNT(t1.MESSAGE, "&&&", "i")+
COUNT(t1.MESSAGE, "***", "i")+ COUNT(t1.MESSAGE, "(((", "i")+ COUNT(t1.MESSAGE, ")))", "i")+
COUNT(t1.MESSAGE, "===", "i")+ COUNT(t1.MESSAGE, "+++", "i")+ COUNT(t1.MESSAGE, "\\\", "i")+
COUNT(t1.MESSAGE, "|||", "i")+ COUNT(t1.MESSAGE, "[[[", "i")+ COUNT(t1.MESSAGE, "]]]", "i")+
COUNT(t1.MESSAGE, "}}}", "i")+ COUNT(t1.MESSAGE, "{{{", "i")+ COUNT(t1.MESSAGE, "", "i")+ COUNT(t1.MESSAGE,
";;;", "i")+ COUNT(t1.MESSAGE, ":::", "i")+ COUNT(t1.MESSAGE, "???", "i")+ COUNT(t1.MESSAGE, "///", "i")+
COUNT(t1.MESSAGE, "...", "i")+ COUNT(t1.MESSAGE, ",,,", "i")+ COUNT(t1.MESSAGE, "<<<", "i")+
COUNT(t1.MESSAGE, ">>>", "i")) LABEL="Stretch_Index" AS Stretch_Index,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
#HASHTAG usage by predators also tends to be high. Similar to Stretch words, this variable also may not be an indicator by itself and should be used with other variables or with an Age indicator variable.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* HASHTAG */
(IFN(t1.MESSAGE CONTAINS "#", 1,0) ) LABEL="HASTAG" AS HASHTAG,
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
Negation of words is used to sound more positive and is a common word trick used by all types of online predators, not just sexual predators. This variable is typically used in combination with other variables.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* NEGATION */
(IFN(INDEXW(t1.MESSAGE, "not")>=1, 1, 0) ) LABEL="NEGATION" AS NEGATION
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
The Affect word score of a message post, Message thread or a Message conversation is broken out along the dimensions of Pleasantness, Activation, Imagery, and total Affect. These dimensions can be obtained by scoring each word in a message post, summing the respective scores and using them as variables for comparing predators to other message posters. Because sexual predators exhibit emotional immaturity, both the average scores and the variation of the scores can be used to distinguish them from a normal user. In SAS code a PROC SQL can join the corpus to the text entry for scoring purposes. This can take a long time depending on the size of the data-sets.
The use of this material may require a fee for copyrighted material.
During the initial approach phase, a sexual predator will use use non-explicit words so they don’t alarm their victim. They want the child to first get use to talking to them but once that happens they want to start a process called “Desensitization”. This is the process where the predator gets the victim conditioned to the idea of having sexual relations by getting them to accept the use of talking with sexually explicit words.
The predator will start by working in a few explicit words into the conversation. If the victim objects or “calls them out” on the use of those words, the predator will usually respond with 1. an apology or 2. will chastise them for being “a baby” or “a little kid”, 3. challenge them to “grow up”. If the apology option is use, the predator will continue to use explicit words and just apologize each time until the victim stops challenging them on the word usage. In any of these 3 scenarios, the predator will ramp up the use of desensitizing words as the conversation progresses.
If apology words [7] are used with the desensitizing words, they will ramp up initially with them, but at some point will fall off after the victim stops any challenges. These two word types can for a clear pattern that can also be detected as a signature of a sexual predator. This pattern is illustrated in the graph below.
FIGURE 3: Desensitising and Apology Word use pattern.
There can be a lot of explicit words used for desensitizing victims, along with slang variations of those words. Desensitization may also take place across racial or gender boundaries and explicit racial or gender slurs or compliments may be used as well, all depending on the race and preferences of the sexual predator. It may be necessary to add or delete racial or gender explicit words from the corpus depending on the social media site subject in order to deal with these types of word properly.
I think it is important to note that in an early regression model that had an accuracy of 91%, desensitizing words did not become a statistically significant variable and was dropped from the model. The use of other word patterns were likely influenced by the presence of explicit desensitizing words, but DESENSWORD itself was not used in the model to detect predators.
SAS Code for DESENSWORD index, as a count of words in a post is Appendix C1. The Explicit Word Corpus in Appendix B is used.
Apology words may be used along with explicit words as part of a desensitisation process by sexual predators, but are seldom used by psychological predators (aka Trolls) except in a mocking manner. Apology word fall into a number of categories that include Apology Words, Apology Antonyms, Acknowledgments, amends, defense, excuse, justification, and parody words. All of these words get used as various forms of apology. Counts or percentage use of these words for the basis for measurement variables. A full corpus of these words is in APPENDIX C.[7] and SAS code is in Appendix C2.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CHATLOG_0001 AS
SELECT t1.F1,
t1.NAME,
t1.DATE,
t1.TIME,
t1.MESSAGE,
/* STOPWORDS */
(COUNT(t1.MESSAGE, "a", "i")+ COUNT(t1.MESSAGE, "able", "i")+ COUNT(t1.MESSAGE, "about", "i")+
COUNT(t1.MESSAGE, "across", "i")+ COUNT(t1.MESSAGE, "after", "i")+ COUNT(t1.MESSAGE, "all", "i")+
COUNT(t1.MESSAGE, "almost", "i")+ COUNT(t1.MESSAGE, "also", "i")+ COUNT(t1.MESSAGE, "am", "i")+
COUNT(t1.MESSAGE, "among", "i")+ COUNT(t1.MESSAGE, "an", "i")+ COUNT(t1.MESSAGE, "and", "i")+
COUNT(t1.MESSAGE, "any", "i")+ COUNT(t1.MESSAGE, "are", "i")+ COUNT(t1.MESSAGE, "as", "i")+
COUNT(t1.MESSAGE, "at", "i")+ COUNT(t1.MESSAGE, "be", "i")+ COUNT(t1.MESSAGE, "because", "i")+
COUNT(t1.MESSAGE, "been", "i")+ COUNT(t1.MESSAGE, "but", "i")+ COUNT(t1.MESSAGE, "by", "i")+
COUNT(t1.MESSAGE, "can", "i")+ COUNT(t1.MESSAGE, "cannot", "i")+ COUNT(t1.MESSAGE, "could", "i")+
COUNT(t1.MESSAGE, "dear", "i")+ COUNT(t1.MESSAGE, "did", "i")+ COUNT(t1.MESSAGE, "do", "i")+
COUNT(t1.MESSAGE, "does", "i")+ COUNT(t1.MESSAGE, "either", "i")+ COUNT(t1.MESSAGE, "else", "i")+
COUNT(t1.MESSAGE, "ever", "i")+ COUNT(t1.MESSAGE, "every", "i")+ COUNT(t1.MESSAGE, "for", "i")+
COUNT(t1.MESSAGE, "from", "i")+ COUNT(t1.MESSAGE, "get", "i")+ COUNT(t1.MESSAGE, "got", "i")+
COUNT(t1.MESSAGE, "had", "i")+ COUNT(t1.MESSAGE, "has", "i")+ COUNT(t1.MESSAGE, "have", "i")+
COUNT(t1.MESSAGE, "he", "i")+ COUNT(t1.MESSAGE, "her", "i")+ COUNT(t1.MESSAGE, "hers", "i")+
COUNT(t1.MESSAGE, "him", "i")+ COUNT(t1.MESSAGE, "his", "i")+ COUNT(t1.MESSAGE, "how", "i")+
COUNT(t1.MESSAGE, "however", "i")+ COUNT(t1.MESSAGE, "i", "i")+ COUNT(t1.MESSAGE, "if", "i")+
COUNT(t1.MESSAGE, "in", "i")+ COUNT(t1.MESSAGE, "into", "i")+ COUNT(t1.MESSAGE, "is", "i")+
COUNT(t1.MESSAGE, "it", "i")+ COUNT(t1.MESSAGE, "its", "i")+
COUNT(t1.MESSAGE, "just", "i")+
COUNT(t1.MESSAGE, "least", "i")+ COUNT(t1.MESSAGE, "let", "i")+ COUNT(t1.MESSAGE, "like", "i")+
COUNT(t1.MESSAGE, "likely", "i")+ COUNT(t1.MESSAGE, "may", "i")+ COUNT(t1.MESSAGE, "me", "i")+
COUNT(t1.MESSAGE, "might", "i")+ COUNT(t1.MESSAGE, "most", "i")+ COUNT(t1.MESSAGE, "must", "i")+
COUNT(t1.MESSAGE, "my", "i")+ COUNT(t1.MESSAGE, "neither", "i")+ COUNT(t1.MESSAGE, "no", "i")+
COUNT(t1.MESSAGE, "nor", "i")+ COUNT(t1.MESSAGE, "not", "i")+ COUNT(t1.MESSAGE, "of", "i")+
COUNT(t1.MESSAGE, "off", "i")+ COUNT(t1.MESSAGE, "often", "i")+ COUNT(t1.MESSAGE, "on", "i")+
COUNT(t1.MESSAGE, "only", "i")+ COUNT(t1.MESSAGE, "or", "i")+ COUNT(t1.MESSAGE, "other", "i")+
COUNT(t1.MESSAGE, "our", "i")+ COUNT(t1.MESSAGE, "own", "i")+ COUNT(t1.MESSAGE, "rather", "i")+
COUNT(t1.MESSAGE, "said", "i")+ COUNT(t1.MESSAGE, "say", "i")+ COUNT(t1.MESSAGE, "says", "i")+
COUNT(t1.MESSAGE, "she", "i")+ COUNT(t1.MESSAGE, "should", "i")+ COUNT(t1.MESSAGE, "since", "i")+
COUNT(t1.MESSAGE, "so", "i")+ COUNT(t1.MESSAGE, "some", "i")+ COUNT(t1.MESSAGE, "than", "i")+
COUNT(t1.MESSAGE, "that", "i")+ COUNT(t1.MESSAGE, "the", "i")+ COUNT(t1.MESSAGE, "their", "i")+
COUNT(t1.MESSAGE, "them", "i")+ COUNT(t1.MESSAGE, "then", "i")+ COUNT(t1.MESSAGE, "there", "i")+
COUNT(t1.MESSAGE, "these", "i")+ COUNT(t1.MESSAGE, "they", "i")+ COUNT(t1.MESSAGE, "this", "i")+
COUNT(t1.MESSAGE, "tis", "i")+ COUNT(t1.MESSAGE, "to", "i")+ COUNT(t1.MESSAGE, "too", "i")+
COUNT(t1.MESSAGE, "twas", "i")+ COUNT(t1.MESSAGE, "us", "i")+ COUNT(t1.MESSAGE, "wants", "i")+
COUNT(t1.MESSAGE, "was", "i")+ COUNT(t1.MESSAGE, "we", "i")+ COUNT(t1.MESSAGE, "were", "i")+
COUNT(t1.MESSAGE, "what", "i")+ COUNT(t1.MESSAGE, "when", "i")+ COUNT(t1.MESSAGE, "where", "i")+
COUNT(t1.MESSAGE, "which", "i")+ COUNT(t1.MESSAGE, "while", "i")+ COUNT(t1.MESSAGE, "who", "i")+
COUNT(t1.MESSAGE, "whom", "i")+ COUNT(t1.MESSAGE, "why", "i")+ COUNT(t1.MESSAGE, "will", "i")+
COUNT(t1.MESSAGE, "with", "i")+ COUNT(t1.MESSAGE, "would", "i")+ COUNT(t1.MESSAGE, "yet", "i")+
COUNT(t1.MESSAGE, "you", "i")+ COUNT(t1.MESSAGE, "your", "i")) LABEL="STOPWORDS" AS STOPWORDS
FROM WORK.QUERY_FOR_CHATLOG_0000 t1;
QUIT;
The actual list of english stop words is below.
a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your
Child Sexual Predators overuse Emoticons in their attempts to speak like a child and bond with the victim. Emoticon usages typically decreases with increased age of the on-line poster.
This is a also a form of forensic countermeasure. It is fairly easy to go on the internet and find lists of emoticons with conflicting meanings, so if the online predator uses only posts with emoticons, a defense attorney can challenge the meaning of the post in court. Some of the emoticon meanings are context sensitive and may only have meaning because of the message threads in posts before the emoticon. The usage of lesser known emoticons that have dual meanings, combined with “null” posts that put some distance between the previous post and the emoticon can function as a forensic countermeasure.
Child Sexual Predators overuse Emoticons in their attempts to speak like a child and bond with the victim. Emoticon usages typically decreases with increased age of the on-line poster. Because of this, all of the following indicators can function as variables to detect predators.
The 20 most popular emoticons represent 91% of all the emoticons used, and the top 40 represent 99% of the usage. There are currently over 2200 documented emoticons in use around the world. About 400 of these contain asian language characters so they may or may not render correctly in English. That still leaves over 1700 rare emoticons that most people will never see.
During the grooming phase, one the things that a sexual predator wants to do it appear interesting to a child. One way to do that is to use some obscure emoticons to appear “smart” or “worldly” to the child. By using some rare emoticons that the child has to look up or ask about, the predator increases interest and interaction and furthers the grooming process. The additional benefit to the predator is that many of these rare emoticons may have multiple meanings, and can function as a forensic countermeasure. Predators will often use obscure emoticons along with “null” posts as a distancing tactic to create forensic countermeasures with explicit emoticons and text posts.
We can create an EMO_OBSCURITY score by taking a list of all 1400+ emoticons, counting their frequency of occurrence, converting that to a percent / proportion of to total emoticon use and taking an inverse of that value. This creates a power law rating of emoticon rarity, or obscurity, based on actual usage that we can then use to score message posts with. By summing or averaging these EMO_OBSCURITY scores by user, we create a metric from which we can detect this grooming behaviour.
EMO_OBSCURITY = 1 / P[Emoticon X | Emoticon ALL]
This score will range from about 2.5 for the most popular emoticons to 94 Million for the lease used emoticons. Because these emoticons follow a power law distribution of usage, they can become a very strong indicator of predatory behaviour, especially when used in combination with NULL_POST variables. It is important to note that not all sexual predator (are smart enough to) use this approach to grooming or as a countermeasure.
One common form of forensic countermeasure is a Chiral or Left Handed emoticon. Many IM services will render the text of an emoticon into a GIF image or an animated picture. The author know of no IM service that renders a lefthanded version into pictures. These left handed versions remain as text, but typically cannot be detected or blocked as being obscene. The emoticon for Oral Sex :-* becomes *-:, it conveys the same meaning but will get past ISP blocking, avoids most automatic detection and also becomes a legal forensic countermeasure during court testimony. A count or normalized count of chiral emoticons can be a valuable variable in detecting predatory behaviour.
Null posts are where a user enters a carriage return, a space, a period (.), a ? or some other placeholder character. In general messaging lingo this means “waiting” or functions as an “are you there?” ping to the other user. Not all IM services support a carriage return as a null post, so the user hits the spacebar and then enter. If the other user is focussing on another browser window, this action will flash the IM icon at the bottom of the screen to get that person’s attention.
It is characteristic of predators to be patient if they are making progress and impatient if they are not.
The use of characters, such as k, K, which are short for “OK” do not count toward a NULL_POST index. Other excludable characters include numbers (0-9), f or F which is short for “fuck” (this can be counted in an explicit word index), u or U which is short for “you” and is often used as a response to a ask the same repeat question or inquiry.
Predator: I am going out tonight.
Predator: u
Victim: Staying home.
Single letters “o” or “O” which mean “Oh” and function as a response statement.
Predator: I am going out tonight.
Victim: o
Single letter “y” or “Y” which means “why” and are a response prompt question.
Victim: I am going out tonight.
Predator: y
The ? symbol may or may not count depending on context. It can also mean a “why” question, but can be a “wait” marker as well.
Victim: I am going out tonight.
Predator:?
or usage as
Predator: Are you there?
Predator: ?
Predator: ?
Predator: ?
Predator: ?
Predator: ?
Predator: ?
Because not all predators use these posts, and the presence of these variations, single character or Null Character posts can be difficult to process. Some require the context of related posts in order to interpret them. Some IM services have active spell check and will not post “Null” or space only posts because they do not appear in a dictionary. These type of services would exclude this type of behavior.
There are a number of Sentiment models based on emotion that can be used in text mining. In this paper we are going to discuss three of those models that are both popular and in general free for use. The main theory behind these models is that different words or emoticons have emotional power and some emotional dimension that relates to the speakers (actors) emotional state. By converting the words to numbers or scores on the different emotional scales we can quantify an emotional state of a speaker, and in turn use that as variables to detect predatory behaviour. These models have found use in many sentiment scoring applications, everything from measuring if customers are happy with a product to detecting emotional stress or depression in addition to predation.
I will cover a brief overview of three emotional models and their use. These models are the
1. the Ekman Emotional Model,
2. Sentiwordnet from Harvard University,
3. Plutchik emotional model.
Along with these models I will also cover the Word Affect scoring model which is more of a word influence model that also has use in detecting predators that are trying to groom a victim. The reader is cautioned that some of these sentiment tools may be closed sourced or copyrighted and may require a fee to use.
The Ekman model is based on the work of Paul Ekman and includes the idea of people having 6 basic emotions. These are Anger, Fear, Disgust, Happiness, Sadness and Surprise. Each word or Emoticon is linked to a primary emotion and each word has a polarity and strength. Anger, Fear, Disgust and Sadness have a negative polarity and Happiness and Surprise are positive. The strength scores go either from 0 to 1 or 0 to -1 based on polarity. Some words or emoticons can be neutral and have a score of 0.
Each message can be scanned for words or emoticons and converted to a basic emotion and summed in net strength for a sentiment score. Hence each message post will have six scores and can function as six variables in a statistical model for detecting predators.
The corpus file EMO_Ekman has the emoticon list with the Ekman emotional scores included.
Sentiment word scores works off the idea that all words can be used to convey a sentiment. Sentiment has several dimensions, the first being Positiveness or Negativeness of the word and the second is Objectiveness (or Subjectiveness) of the word. It is possible for a word to convey both positive and negative sentiment at the same time. The scale for SentinetWord becomes 3-axis scale with a polarity net score that is the sum of the positivity score and the negativity score as the first two scales and the Objectivity score as the third scale with is a measure of Subjectivity / Objectivity scale (SO-Polarity)
SO-polarity, as in deciding whether a given text has a factual nature or expresses an opinion on its subject matter. This amounts to performing binary text categorization under categories Subjective and Objective. This is accomplished by the Objective Score of a word
Determining text PN-polarity, as in deciding if a given Subjective text expresses a Positive or a Negative opinion on its subject matter. This happens by the PosScore and NegScore values such that -1> (PosScore + NegScore) >1.
Positivity and negativity can both be scored as yes=1, no=0 attributes for scoring messages also. This is less sensitive, but computationally easier.
Determining the strength of text PN-polarity, as in deciding e.g. whether the Positive opinion expressed by a text on its subject matter is Weakly Positive, Mildly Positive, or Strongly Positive. Words or emoticons can be convey both positive and negative sentiment depending on usage.
Robert Plutchik's psychoevolutionary theory of emotion is one of the most influential classification approaches for general emotional responses. He considered there to be eight primary emotions—anger, fear, sadness, disgust, surprise, anticipation, trust, and joy. Plutchik proposed that these 'basic' emotions are biologically primitive and have evolved in order to increase the reproductive fitness of the animal. Plutchik argues for the primacy of these emotions by showing each to be the trigger of behaviour with high survival value, such as the way fear inspires the fight-or-flight response.
Plutchik's psychoevolutionary theory of basic emotions has ten postulates.
.
Interrogatives for the basis for questions within the English language and are used by people to request and gather information from others during a conversation. For Sexual predators, this is vital for Step 1 of the Attack Sequence and failure here stops the entire process. Inclusive in this list is the standard ones, who, what, when, where, why, whom, which and how. With the advent of internet slang, a larger list of words, abbreviation and symbols have entered the vernacular that function as interrogatives. This include ?, dig, 3rd degree, 3rd, third degree, fish, grill, hammer, pimp, pump, d, D. The d and D are short for “Details” and get used as follows
VICTIM: I am going out with a friend tonight.
PREDATOR: d
VICTIM: I am going out with a friend tonight.
PREDATOR: D
Words like Fish and Hammer can also have Emoticon substitutes which may also be added to the list at the programmer's discretion.
For our development, the author was trying to detect predators on a NASCAR fan site sponsored by a company. The site attracted a good percentage of underage fans, but also experienced an above average rate of Explicit word use by the typical, non-predatory fan. This represents a higher than normal rate of noise for the detection model. Any variable that works on the basis of detecting a predators usage of explicit words as being higher than normal, would be reduced in effectiveness (power) as a variable.
Nearly all statistical models will fall into one of 3 categories, 1. Regression models, 2. Neural networks or 3. Decision trees. The author used a regression model for the initial attempts in detecting predators. The basic steps used to develop the model are listed below.
In the model development process, only a handful of the 30+ variables were determined to be statistically significant. The author developed several regression models to characterize the data and created a 5 dimensional surface equation that divided the user conversations into predator and non-predator categories. Several equations were developed and I chose the one that was providing the best correct classification rate. This model is a Support Vector Machine (SVM), where the equation divides the data instead of characterizing it. The model and SAS code that was used is below.
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_STATS_WORD_COUNTS_0000 AS
SELECT t1.NAME,
t1.SOURCE1,
t1.Target,
/* PREDICTION YES=1, or Predator=1*/
(IFN(ROUND((-0.557) + (29.865 * t1.N_MODALVERBS) + (2.336 * t1.N_PERPRONOUN) +( -23.643 * t1.N_FAMILY) + (
t1.N_PERPRONOUN - 0.204) * (( t1.N_REFPRONOUN - 0.001) * 869.195) + ( t1.N_PERPRONOUN -0.204) * ((
t1.N_STRETCHWORD - 0.0079) * 179.58),1)<0.5,0,1)
) LABEL="PREDICTION" AS PREDICTION,
FROM WORK.QUERY_FOR_STATS_WORD_COUNTS t1;
QUIT;
/* The following code outputs a filtered dataset where the training data set targets=1 or predator and the Prediction values =1 or predator so Type 1 and Type 2 error rates can be estimated */
PROC SQL;
CREATE TABLE WORK.FILTER_FOR_QUERY_FOR_STATS_WORD_ AS
SELECT t1.NAME,
t1.SOURCE1,
t1.Target,
t1.PREDICTION,
t1.PREDICTION2
FROM WORK.QUERY_FOR_STATS_WORD_COUNTS_0000 t1
WHERE t1.PREDICTION = 1 OR t1.Target = 1;
QUIT;
The final model uses only the following variables.
Variables like Explicit Words or Reflexive Personal Pronouns dropped out of the model. It is very likely that a sexual predators use of explicit words affected these variables even if the direct usage of those variables did not happen in this model. A number of other variables did not make the cutoff for being statistically significant, but the author suspects that on a website where the average usage of explicit words is lower, these variables would become stronger indicators and likely get included in a model. FIGURE 4 below has some regression graphs on some of the key variables in the model.
FIGURE 4: Regression graphs some key variables of the authors model.
The real final end product that was desired from this system was a daily output report to the company security team, that showed the Username, Predator prediction and which social media feed the user was on, so that security personnel could do a manual review of the posts and then block anyone confirmed as a likely predator. An example of the output table is below along with the code that generates it.
FIGURE 5: Sample daily output report for the security team.
PROC TABULATE
DATA=WORK.FILTER_FOR_QUERY_FOR_STATS_WORD_;
VAR PREDICTION PREDICTION2;
CLASS NAME / ORDER=UNFORMATTED MISSING;
CLASS SOURCE1 / ORDER=UNFORMATTED MISSING;
CLASS Target / ORDER=UNFORMATTED MISSING;
TABLE /* Row Dimension */
SOURCE1*
NAME*
N,
/* Column Dimension */
PREDICTION
PREDICTION2 ;
;
RUN; QUIT;
TITLE; FOOTNOTE;
FORTRAN 90 Code (Text file)
WORD CORPUS LINKS
DICTIONARY OF AFFECT($$ CLOSED SOURCE)
PERSONAL INFORMATION WORD CORPUS
BoydEOwens at gmail dot com