Predicting Vulnerable Software Components via Text Mining

To enable the replication of our study, this page provides additional resources for our paper "Predicting Software Vulnerabilities via Text Mining" by Riccardo Scandariato, Aram Hovsepyan, James Walden and Wouter Joosen.

Feature extraction from Java files

Text files are parsed with the following R function:

tokenize <- function(text) { 
    delimiters <- '[[:space:]]|[]"\'.,;:()?!{}[<>+=~*&^%/|\\-]' 

    tokens <- unlist(strsplit(text, delimiters)) #tokenize 
    tokens <- tokens[tokens != ""] #remove empty strings 

    return(tokens) 
}

Often, the applications include external library code that should be packaged as a JAR file instead. Therefore, we only analyzed the Java files that were under the following namespaces:

Application     Namespace
Anki-Android    com/ichi2
Boardgamegeek   com/boardgamegeek
Connectbot      org/connectbot
CoolReader      org/coolreader
Crosswords      org/eehouse/android/xw4
FBReaderJ       org/geometerplus/fbreader, org/geometerplus/android/fbreader
K9              com/fsck/k9
KeePassAndroid  com/keepassdroid
Mileagetracke   com/switkows
Mustard         org/mustard

Application     Namespace
Browser         com/android/browser
Calendar        com/android/calendar
Camera          com/android/camera
Contacts        com/android/contacts
DeskClock       com/android/deskclock
Dialer          com/android/dialer
Email           com/android
Gallery2        com/android
Mms             com/android/mms
QuickSearchBox  com/android/quicksearchbox

Analyzed versions

  1. AnkiDroid: Mar'10 (0.4.0), Aug'10 (0.4.1), Sep'10 (0.4.2), Feb'11 (0.4a1), Mar'11 (0.4b1), Apr'11 (0.6), Aug'11(0.7), Nov'11 (1.0)
  2. BoardGameGeek: Jul'10 (2.3), Aug'10 (2.4), Apr'11 (3.0rc), May'11 (3.0), Jun'11 (3.2), Sep (3.3), Oct (3.4)
  3. ConnectBot: Sep'10 (1.7.0), Oct'10 (1.7.1), Nov'10 (1.7.1), Dec'10 (1.7.1), Feb'11 (1.7.1), May'11 (1.7.1), Jul'11 (1.7.1), Aug'11 (1.7.1), Sep'11 (1.7.1), Oct'11 (1.7.1), Dec'11 (1.7.1)
  4. CoolReader: Nov'10 (0.36), Dec'10 (0.36-32), Jan'11 (0.39-33), Feb'11 (0.43-4), Apr'11 (0.45-2), May'11 (0.45-12), Jun'11 (0.46-1) Jul'11 (0.48-1), Aug'11 (0.49-7), Sep'11 (0.51-3), Oct'11 (0.51-18), Nov'11 (0.51-29), Dec'11 (0.53-16)
  5. Crosswords: Jul'10 (11), Aug'10 (14), Sep'10 (16), Nov'10 (17), Dec'10 (20), Jan'11 (21), Mar'11 (22), Apr'11 (25), May'11 (26), Jun'11 (28), Jul'11 (29), Aug'11 (31), Sep'11 (34), Oct'11 (35), Nov'11 (38), Dec'11 (39)
  6. FBReader: Nov'10 (0.7.14), Dec'10 (0.7.17), Jan'11 (0.99.0), Feb'11 (0.99.12), Mar'11 (0.99.12), Apr'11 (1.0.0), May'11 (1.0.10), Jun'11 (1.1.0), Jul'11 (1.1.2a), Aug'11 (1.1.2b), Oct'11 (1.2.0), Nov'11 (1.2.3), Dec'11 (1.2.4)
  7. K9Mail: Jan'10 (2.308), Feb'10 (2.504), Mar'10 (2.511), Apr'10 (2.590), May'10 (2.600), Jun'10 (2.710), Jul'10 (2.803), Aug'10 (2.912), Sep'10 (3.003), Oct'10 (3.112), Nov'10 (3.206), Dec'10 (3.320), Jan'11 (3.504), Feb'11 (3.596), Mar'11 (3.604), Apr'11 (3.709), May'11 (3.900), Jun'11 (3.901), Jul'11 (3.902), Aug'11 (3.902), Sep'11 (3.907), Oct'11 (3.908), Nov'11 (3.910), Dec'11 (3.991)
  8. KeePassAndroid: Sep'10 (1.7.2), Oct'10 (1.8.4), Nov'10 (1.8.5), Dec'10 (1.8.6.0.1), Jan'11 (1.8.6.4), Feb'11 (1.9), Mar'11 (1.9.1), Apr'11 (1.9.2), Jun'11 (1.9.3.1), Jul'11 (1.9.4), Aug'11 (1.9.5), Dec'11 (1.9.6)
  9. MileageTracker: Jun'10 (r11), Sep'10 (r28), Oct'10 (r30), Nov'10 (r34), Feb'11 (r36), Mar'11 (r42)
  10. Mustard: May'10 (0.1.9.3), Jul'10 (0.1.9.5rc1), Aug'10 (0.1.9.5rc8), Sep'10 (0.1.9.6), Oct'10 (0.1.9.7b), Nov'10 (0.1.10), Feb'11 (0.1.11), Mar'11 (0.1.12c), Jun'11 (0.1.14), Ju'11 (0.2.0), Oct'11 (0.2.1a)
  11. Browser: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  12. Calendar: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  13. Camera: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  14. Contacts: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  15. DeskClock: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  16. Dialer: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  17. Email: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  18. Gallery2: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  19. Mms: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)
  20. QuickSearchBox: 2.1 (Eclair) 2.2 (Froyo) 2.3 (Gingerbread) 2.3.3 (Gingerbread) 4.0.1 (Ice Cream Sandwich) 4.0.3 (Ice Cream Sandwich)

Features and vulnerabilities

This compressed [archive] contains the features that have been extracted and the list of files that are classified as vulnerable. The archive contains one folder per application. Under each folder, there are two CSV files per version:
  • .features.csv: each line in this type of CSV file contains the features of a Java file
  • .vulns.csv: each line in this type of CSV file contains the name of a vulnerable Java file

Additional graphs

This compressed [archive] contains additional graphs that could not be included in the paper due to space limitations. For example, the additional graphs show the size (as number of files) of the applications

  

The archive also contains a chart showing the types of vulnerabilities reported by HP Fortify SCA.