IJCAI 2016 results and experiments

This page presents the results and experiments used for the IJCAI 2016 article.

Abstract

We introduce a framework for knowledge-based sequence mining based on Answer Set Programming (ASP). We begin by modeling the basic task and refine it in the sequel in several ways. First, we show how easily condensed patterns can be extracted by modular extensions of the basic approach. Second, we illustrate how ASP's preference handling capacities can be exploited for mining patterns of interest. In doing so, we want to demonstrate how easy it is to incorporate knowledge into an ASP-based mining process. Since this comes with a trade-off in effectiveness, we provide an empirical study comparing our approach with closely related sequence mining approaches.

Our encodings have been design for the solver from Potassco ASP tools suite. We use clingo 4.5 as a solver and the ASPRIN system to extract preferred patterns (including update for ASPRIN-3).

Encodings

  • instance example : instance.lp
  • mining frequent sequences
      • running command: clingo 0 -c k=2 instance.lp frequent.lp
  • closed sequences
      • running command: clingo 0 -c k=2 instance.lp frequent.lp closed.lp
  • maximal sequences
      • running command: clingo 0 -c k=2 instance.lp frequent.lp maximal.lp
  • preferences encodings (require asprin)
      • running command: asprin 0 -c k=2 instance.lp frequent.lp preference.lp asprin.lib
      • preferences encodings changed with ASPRIN-3 : here is the encoding for it preferences-3.lp (same running commands) [thanks to Javier Romero for update]

Datasets

You'll find below the ASP facts of the datasets we used to benchmark our ASP encoding. Simulated datasets are also available in a format that can be read by CPSM (Constraint Based Sequence Mining)

Simulated datasets used to evaluate computing performances:

    • dataset generator: generator.py
      • use -h option to detail about how to use this generator
    • ZIP It contains a set of databases of simulated sequences. Each database is of size 500. The mean length of sequences they contains are from 10 to 40. The file database_40_50_4.lp is a file containing sequences of mean length 40 built on a vocabulary of size 50. The 4 means that it is the fourth database with this characteristics. Some basename file with extension .dat is the same database in the CPSM format, and the file with the .pat extension describes the hidden patterns.

Real datasets used to compare mining tasks:

Results

Computing time: comparison of ASP solving wrt CPSM-emb

    • CMSM-emb is a version of CPSM encodings that allow to have declarative constraints on embeddings.
    • CPSM-emb has been run on same server with following command (10 is the frequency threshold)
    • cpsm-emb file.dat 10

Number of patterns: compare our different encodings on real datasets