Kingfisher - a tool for searching statistical dependencies/association rules

Download newest version (1.2c)

1. What is Kingfisher?
2. Who can use Kingfisher?
3. How to install Kingfisher?
4. How to use Kingfisher?
5. What if my data is not binary?
6. How to make your own modifications?
7. Where to report bugs?

What is Kingfisher?

Kingfisher is a data mining tool, which searches for the best non-redundant dependency rules, which express statistically significant dependencies in data. The special feature of Kingfisher is that it scales up well without any minimum frequency thresholds or other suboptimal heuristics. This means that one can find globally optimal dependencies from data. Another unique feature is that it supports Fisher's p-value as a goodness measure, which produces very reliable results (the discovered dependencies hold in future data which is not always true with other measures).

The discovered rules are of form X->A or X->~A, where X is a set of binary attributes and consequence is any single attribute or its negation. No negations can occur in the antecedent of rules (you can circumvent this restriction with suitable preprocessing). Notice that with many measures (like Fisher's p and chi2), the positive dependence between X and A is the same as positive dependence between ~X and ~A or negative dependence between X and ~A or ~X and A. This means that you can find both positive and negative dependencies with these measures.

The current version evaluates the statistical significance by three measures: Fisher's pF (Fisher's exact test), the chi^2-measure and mutual information (MI). It is also possible to add your own measure functions (see 5).

All discovered rules are required to be non-redundant, which means that they do not contain any redundant attributes, which do not improve the rule. Rule X->A is considered non-redundant, if it is more significant than more general rules with the same consequence. For example, if measure M is increasing by goodness (a high value indicates a good rule) and M(A1 A2 A3 -> A5)<=M(A1 -> A5), then A1 A2 A3 -> A5 is considered redundant in respect of A1 -> A5.

The details are described in

Hämäläinen, W.: Efficient discovery of the top-K optimal dependency rules with Fisher's exact test of significance. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), pp. 196-205 IEEE Computer Society 2010. (See here)

and (more detailed)

Hämäläinen, W.: Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems: An International Journal (KAIS) 32(2):383-414, 2012. (See here)

Who can use Kingfisher?

Kingfisher has the following copyright, which is also included in the source file:

------------------------------------------------------------------------ (C) Copyright 2010 Wilhelmiina Hämäläinen. The code can be freely used for academic/research purposes. Direct or indirect use for commercial advantage is not allowed without written permission from the author. The code can be modified and redistributed if this note is left intact. -------------------------------------------------------------------------

If you use it for your own publications, use one of the references above.

How to install Kingfisher?

If you are using Linux/Unix with gcc, installing is easy and you shouldn't have any problems. The program may work well in other environments and/or with other compilers, but you have to find it out yourself.

First, load the source code package here. Unpack it in a selected directory. Then simply run command "make". (The package contains a README file with these same instructions.)

Notice that handling really high-dimensional data sets requires relatively much memory. For the classical benchmark data sets (Mushroom, Chess, T10I4D100K, T40I10D100K, Accidents, Pumsb, and Retail) 1 GB memory is sufficient with measure pF, but the chi2-measure may already require more.

Notice also that the default makefile compiles the program by default into 64-bit code. If you want to save memory (which can also speed up the search), you can use the 32-bit mode. Just modify the makefile (add and remove corresponding comment marks). Still, note that some really difficult data sets (especially with the chi2-measure) may require so large address space that 32 bits isn't sufficient. In this case the program may just halt without warning (a well-known malloc bug).

How to use Kingfisher?

The basic syntax is

kingfisher -i<inputfile> -k<maximum attribute number> [-w<measure>] -M<threshold for the measure> [-q<number of best rules to search for> -t<type of the rule>] [extra options]

where

- input file should be in the transaction form: All binary attributes are represented by non-negative integers, and each row lists attributes which are true on the row. Separator is space. E.g. row "1 2 4 7 9" corresponds attribute combination A1 A2 ~A3 A4 ~A5 ~A6 A7 ~A8 A9 ~A10 (where ~is negation, 10 is the number of attributes).
- maximum attribute number means the largest integer which can occur in the input data. For example, in the previous example it was 10. Attribute numbers can begin from either 0 or 1.
- measure M can be either pF (Fisher's exact test, option -w1), ln(pF) (Fisher's exact test, option -w2; recommended, when the maximum p-value is very small), chi2 (option -w3) or MI (option -w4). If no measure is selected, the default is w2. Notice that with the chi2-measure (option -w3), continuity correction is used by default. If you want to disable the continuity correction (not recommended for the sake of accuracy), give option -d.
- threshold for the M-value depends on the measure. For Fisher's pF, it should be in range ]0,1[; for ln(pF), it should be <0, and for chi2, in range ]0,n] (where n is the number of data rows), and for MI, >0.
  - Since Kingfisher searches for the Q best rules, it updates the threshold automatically, when new good rules are found. However, the first levels are proceeded faster, if you give a sufficiently demanding threshold in the beginning. At each level, the program outputs the currently used threshold value. So, for example, if the program gets stuck at a level, where the updated threshold is x, you can stop the search (ctrl-c) and use x (or a more demanding) value as a threshold. Alternatively, you can search a smaller number of best rules (option -q).
- the number of best rules to search for is optional. By default, the program searches for the 100 best rules. If the program gets stuck (with very dense data sets), try a smaller number. With easier data sets you can also search efficiently for much more rules, e.g. 1000 or 10 000.
- type of the rule is optional. By default, Kingfisher searches for only positive dependency rules. If you want to search for only negative dependencies, use option -t2, and for both positive and negative dependencies, option -t3. Notice that typically negative dependency rules are more demanding to search for (the search continues further). So, if the programs gets stuck, you can try to search for only positive dependency rules.
  - Notice also that if you have created new negation (complement) attributes for all attributes in the preprocessing phase, then you should not search for negative dependency rules - otherwise you would get a lot of trivial rules of the form A->~negA, where neg A is a new attribute for ~A. Note that in this case there is no need for negative dependency rules, because you will get them anyway (positive rule X->negA is the same as X->~A).
- all other options are optional. They let you, for example, define minimum frequency or confidence thresholds, output format and file as well as different kinds of extra constraints. You get a full list of options when you give command "kingfisher", without any options. The same help is output, if you give a wrong number of options or obligatory options are missing.
  - Examples:
  - kingfisher -i mushroom.dat -k124 -M-1000
- searches for the 100 best positive dependency rules from mushroom.dat. The measure is by default ln(pF) and the initial threshold is -1000.
  - kingfisher -imushroom.dat -k124 -M-1000 -t3 -q50
- searches for the 50 best positive and negative rules from mushroom.dat. Otherwise like the previous one.
  - kingfisher -i mushroom.dat -k124 -w3 -M8000
- searches for the 100 best positive rules using chi2-measure (-w3) whose initial threshold is 8000.
  - What if my data is not binary?
  - Kingfisher like most rule discovery tools was originally designed for occurrence data (like market basket data or species or gene occurrence data). However, it is easy to preprocess different types of data into suitable format. Here, we give some hints for preprocessing. Kingfisher itself supports several kinds of constraints which faciliate different preprocessing schemes.
  - For any nominal data, you can always create a binary attribute for each nominal value. If the original variable A can get values a, b and c, just create binary attributes B1, B2 and B3 for A=a, A=b and A=c. You may also want to create attributes B4, B5 and B6 for A!=a, A!=b and A!=c. This corresponds to creating attributes for negations ~B1, ~B2 and ~B3. Now it is important to prevent trivial rules which would be just a byproduct of your binarization. Otherwise, you could easily get rules like B1->~B4 (A=a -> ~(A!=a)) or B4 B5 -> B6 (A!=a, A!=b -> A=c). You can define such forbidden combinations in a separate constraint file which you give to Kingfisher with option -e <constraint file>. Each row of the constraint file should list attributes (separated by spaces or commas) any two of which cannot occur together. If you created new attributes for all negations, you may also want to use option -t1 - see above.
  - Numerical variables require always some discretization. In practice, you can either use some static discretization or simulate a dynamic discretization, which let's Kingfisher to define the discretization on-line. In the static discretization you should predefine mutually exclusive intervals and create new attributes for each of them. Like in the nominal case, it is advisable to add constraints which prevent trivial rules.
  - For the dynamic discretization, you can try the following technique: Given numeric variable A, decide a set of split points a_1,...,a_m (potential end points for an interval). Then create binary attributes for all inequalities A<=a_i and A>=a_i. Add constraints (with option -e) which forbid attributes for A<=a_i to occur together and similarly for A>=a_is. In addition, you probably want to prevent rules of the form A>=2->A<=5. This can be done with option -b <extra constraint file>. Here extra constraint file lists on each row first a consequence attribute and then its forbidden condition attributes.
  - Notice that when nominal variables are transformed into binary, no information is lost, and you can still find globally optimal dependency rules. With numerical variables this is no more the case, because discretization can lose information (while it can also clean information on dependencies from noise). If the optimality of results is important to you, try to simulate some dynamic discretization scheme with your preprocessing so that the program can itself set optimal discretization intervals on-line. Depending on the number of numerical variables and their domains, you may be able to find globally optimal rules even when numerical variables are involved. However, if the number of numerical variables is large and/or they have many different values, this is no more feasible, but even then, you can try a coarse-level dynamic discretization.
  - How to make your own modifications?
  - Adding your own measure functions is easy. The only places where you have to make changes are measures.h and measures.c. Short instructions are given there.
  - For any goodness measure M it is required that
    - you can define bounds (an upper bound for an increasing measure and a lower bound for a decreasing measure) for any rule XQ->A=a, a={0,1}, in two cases:
      - when only m(X) and m(A=a) are known, and
        when m(XA=a), m(X) and m(A=a) are known.
    - M(A->B=b)=M(B->A=a) (a=0,1 and b=0,1) e.g. M(A->~B)=M(B->~A). (This condition is not crucial to the search, but you should modify kingfisher.c to remove it. If your search for only positive rules, then it suffices that M(A->B)=M(B->A).)
  - If measure M is well-behaving, you can easily define the required upper or lower bounds.
  - Where to report bugs?
  - If you find any bugs, please report to Wilhelmiina Hämäläinen.

Page updated

Google Sites

Report abuse

Kingfisher - a tool for searching statistical dependencies/association rules

Contents

What is Kingfisher?

Who can use Kingfisher?

How to install Kingfisher?

How to use Kingfisher?