Code‎ > ‎

DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification

Paper (appeared in ACM WSDM 2017)

Some features of DiSMEC

- C++ code which uses openMP for parallel training and prediction and built over the Liblinear-64 bit code-base to handle multi-label datasets.
- Handles extreme multi-label classification problems consisting of hundreds of thousands labels in datasets which can be downloaded from Extreme Classification Repository 
- Takes only a few minutes on EURLex-4K (eurlex) dataset consisting of about 4,000 labels and a few hours on WikiLSHTC-325K datasets consisting of about 325,000 labels
- Learns models in the batch of 1000 labels (by default, can be changed to suit your settings)
- Allows two-folds parallel training (a) A single batch of labels (say 1000 labels) is learnt in parallel using openMP while exploiting multiple cores on a single machine/node/computer (b) Can be invoked on multiple computers/machines/nodes simultaneously such that the successive launch starts from the next batch of 1000 labels (see -i option in the sample command line)
- Tested on 64-bit Ubuntu only

The above code also consists of a demonstration on how to run on EURLex-4k dataset downloaded from the The Extreme Classification Repository, and instructions. For EURLex-4k datasets, you should get the following output finally showing prec@k and nDCG@k values
Results for EURLex-4K dataset
 precision at 1 is 82.51
 precision at 3 is 69.48
 precision at 5 is 57.94

 ndcg at 1 is 82.51
 ndcg at 3 is 72.89
 ndcg at 5 is 67.05

When run for WikiLSHTC-325K datasets, the program should give results close to the following

precision at 1 is 64.14
precision at 3 is 42.45
precision at 5 is 31.52

ndcg at 1 is 64.14
ndcg at 3 is 58.37
ndcg at 5 is 58.34