Home - DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification

DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification

Paper (appeared in ACM WSDM 2017)

Some features of DiSMEC

- C++ code which uses openMP for parallel training and prediction and built over the Liblinear-64 bit code-base to handle multi-label datasets.

- Handles extreme multi-label classification problems consisting of hundreds of thousands labels in datasets which can be downloaded from Extreme Classification Repository

- Takes only a few minutes on EURLex-4K (eurlex) dataset consisting of about 4,000 labels and a few hours on WikiLSHTC-325K datasets consisting of about 325,000 labels

- Learns models in the batch of 1000 labels (by default, can be changed to suit your settings)

- Allows two-folds parallel training (a) A single batch of labels (say 1000 labels) is learnt in parallel using openMP while exploiting multiple cores on a single machine/node/computer (b) Can be invoked on multiple computers/machines/nodes simultaneously such that the successive launch starts from the next batch of 1000 labels (see -i option in the sample command line)

- Tested on 64-bit Ubuntu only

The above code also consists of a demonstration on how to run on EURLex-4k dataset downloaded from the The Extreme Classification Repository, and instructions. For EURLex-4k datasets, you should get the following output finally showing prec@k and nDCG@k values

Results for EURLex-4K dataset

========================

precision at 1 is 82.51

precision at 3 is 69.48

precision at 5 is 57.94

ndcg at 1 is 82.51

ndcg at 3 is 72.89

ndcg at 5 is 67.05

=========================

When run for WikiLSHTC-325K datasets, the program should give results close to the following

========================

precision at 1 is 64.14

precision at 3 is 42.45

precision at 5 is 31.52

ndcg at 1 is 64.14

ndcg at 3 is 58.37

ndcg at 5 is 58.34

Google Sites

Report abuse