Paper (appeared in ACM WSDM 2017)
Some features of DiSMEC
- C++ code which uses openMP for parallel training and prediction and built over the Liblinear-64 bit code-base to handle multi-label datasets.
- Handles extreme multi-label classification problems consisting of hundreds of thousands labels in datasets which can be downloaded from Extreme Classification Repository
- Takes only a few minutes on EURLex-4K (eurlex) dataset consisting of about 4,000 labels and a few hours on WikiLSHTC-325K datasets consisting of about 325,000 labels
- Learns models in the batch of 1000 labels (by default, can be changed to suit your settings)
- Allows two-folds parallel training (a) A single batch of labels (say 1000 labels) is learnt in parallel using openMP while exploiting multiple cores on a single machine/node/computer (b) Can be invoked on multiple computers/machines/nodes simultaneously such that the successive launch starts from the next batch of 1000 labels (see -i option in the sample command line)
- Tested on 64-bit Ubuntu only
The above code also consists of a demonstration on how to run on EURLex-4k dataset downloaded from the The Extreme Classification Repository, and instructions. For EURLex-4k datasets, you should get the following output finally showing prec@k and nDCG@k values
Results for EURLex-4K dataset
========================
precision at 1 is 82.51
precision at 3 is 69.48
precision at 5 is 57.94
ndcg at 1 is 82.51
ndcg at 3 is 72.89
ndcg at 5 is 67.05
=========================
When run for WikiLSHTC-325K datasets, the program should give results close to the following
========================
precision at 1 is 64.14
precision at 3 is 42.45
precision at 5 is 31.52
ndcg at 1 is 64.14
ndcg at 3 is 58.37
ndcg at 5 is 58.34