MPBoost++

Andrea Esuli
ISTI-CNR

Introduction

MPBoost++ is a C++ implementation of MPBoost (SPIRE06a) a variant of the multi-label AdaBoost.MH algorithm that improves its efficacy and efficiency by performing a multiple pivot selection at each boosting iteration.

Quick tutorial

Download links are at the end of the page, please first read this tutorial and the license terms.

The following quick tutorial is based on the assumption that you have a dedicated mpboost directory with three subdirectories:

/bin – in which you put binary files (compilation from source code automatically creates file in this directory
/source – in which you put the source code files (not required if you download the binary files)
/data – in which you put the data on which you want to perform experiments.

Compiling MPBoost++


~/mpboost/> tar zxvf mpboost_latest_source.tar.gz
~/mpboost/> cd source
~/mpboost/source/> make
~/mpboost/source/> cd ../bin
~/mpboost/bin/> ls
total 256
-rwxrwxr-x 1 esuli esuli 91156 2010-08-16 16:09 boostTest
-rwxrwxr-x 1 esuli esuli 86596 2010-08-16 16:09 boostTrain
-rwxrwxr-x 1 esuli esuli 25216 2010-08-16 16:09 mergeEvaluation
-rwxrwxr-x 1 esuli esuli 40025 2010-08-16 16:09 showEvaluation

That should be enough, provided that you have make and g++ (currently using version 4.3.3).
Binaries are created directly into the bin directory.

Input format

MPBoost and AdaBoost.MH use binary features, so no weight information is provided for the features.
MPBoost and AdaBoost.MH are multi-label classification algorithms, i.e., a document could belong to zero, one, or more than one category.

The input format for MPBoost++ is based on a sparse vector representation, using text files and describing one vector per line.
The format of a line describing a vector is:


<ID> <featureID>* | <categoryID>*

with feature IDs and category IDs sorted in ascending order. The

For example, the line describing document with ID 3, containing features 3, 6, 103, and 201, and belonging to categories 3 and 9 is:


3 3 6 103 201 | 3 9

The pipe character is used to define where the feature list ends and category list starts. If a document has no category, the pipe at the end of the line can be omitted.

Train and Test

In order to perform a train and test experiment, MPBoost++ must be provided with a training data file and one (or more) test data file.
The boostTrain program is used to learn a classification model out of the training data file.
The boostTest program is used to classify test data using the learned classification model.
The showEvaluation program is used to show the evaluation of the classification results, in the form of contingency tables and effectiveness measures. The output of showEvaluation is optimized for direct copy & paste into a spreadsheet.
In the case multiple test data files are used (e.g., for RCV1 v2) the mergeEvaluation program can be used to merge the partial evaluation on each test data file into a single evaluation.


~/mpboost/> cd data
~/mpboost/data/> bunzip2 reuters21578.tar.bz2
~/mpboost/data/> tar xvf reuters21578.tar
~/mpboost/data/> cd ../bin
~/mpboost/bin/> ./boostTrain -t ../data/reuters21578/training
Loading data
Using uniform distribution
Data loaded
Starting training
Iteration 1
...
Iteration 100
Training completed in 111.86 seconds.
Serializing model to file: ../data/reuters21578/training.model
Serialization completed
Serializing distribution to file: ../data/reuters21578/training.distribution
Serialization completed
~/mpboost/bin/> ./boostTest -t ../data/reuters21578/test -m ../data/reuters21578/training.model
Loading data
Using all the hypothesis in the model (100)
Data loaded
Starting test
Test completed in 9.04 seconds.
Serializing evaluation to file: ../data/reuters21578/test.evaluation
Serialization completed
Serializing prediction to file: ../data/reuters21578/test.prediction
Serialization completed
~/mpboost/bin/> ./showEvaluation -e ../data/reuters21578/test.evaluation
Evaluation file: ../data/reuters21578/test.evaluation
Global table
TP TN FN FP
2984 375235 760 406
Per-category tables
cat TP TN FN FP
0 619 2553 100 27
114 0.999394 0.866667 1 0.928571
MACRO-average evaluation
accuracy precision recall F1
0.996927 0.479644 0.410518 0.427168
MICRO-average evaluation
accuracy precision recall F1
0.996927 0.880236 0.797009 0.836557

Options

boostTrain

Usage:
./boostTrain -t trainfile

Optional parameters:
-i <number of iteration to perform> (default: 100)
-m <previous model file to load to continue training>
-d <assigned distribution to load at beginning of traininig>
-A (use AdaBoost.MH algorithm, default: MPBoost)
-dB (use a balanced distribution, default: uniform)
-om <name of output model file> (default: trainfile.model)
-od <name of output distribution file> (default: trainfile.distribution)
-v <verbosity> (default:1)

boostTest

Usage:
./boostTest -t testfile -m modelfile

Optional parameters:
-h <number of hypothesis to use from model> (default: all)
-oe <name of output evaluation file> (default: testfile.evaluation)
-op <name of output prediction file> (default: testfile.prediction)
-v <verbosity> (default:1)
-nE (do NOT save evaluations, default: save)
-nP (do NOT save predictions, default: save)

showEvaluation

Usage:
./showEvaluation -e evaluationfile

Optional parameters:
-c (show only contingency table)
-p (show only performance measures)
note: -c and -p are mutually exclusive, default is show both
-M (compute only MACRO-average measures)
-m (compute only MICRO-average measures)
note: -M and -m are mutually exclusive, default is compute both

mergeEvaluation

Usage:
./mergeEvaluation output_evaluationfile+

License

This program is granted free of charge for research and education purposes. However you must obtain a license from the author to use it for commercial purposes.

Scientific results produced using the software provided shall acknowledge the use of MPBoost++. Please cite as

Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani
MP-Boost: A Multiple-Pivot Boosting Algorithm and its Application to Text Categorization
Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE’06), Glasgow, UK, 2006, pages 1-12. Lecture Notes in Computer Science n. 4209, Springer Verlag.
SPIRE06a

Moreover shall the author of MPBoost++ be informed about the publication.

The software must not be modified and distributed without prior permission of the author.

By using MPBoost++ you agree to the licensing terms.

NO WARRANTY

BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Downloads

Source code

C++ source code mpboost_latest_source.tar

Binaries

Linux binaries (32 bits) mpboost_latest_binaries_linux.tar.gz

Windows binaries (32 bits) mpboost_latest_binaries_win.zip

Test corpora

Reuters21578 *

RCV1 v2 *

* for details on how these vectorial representations have been generated from test corpora please look at the MPBoost paper.