Classification DataSet
Reuters
RCV1 (Reuters Corpus Volume 1)
RCV1: A New Benchmark Collection for Text Categorization Research (2004)
pdf
Reuters Corpus
RCV1-v2/LYRL2004: The LYRL2004 Distribution of the RCV1-v2 Text Categorization
Test Collection
lyrl2004_vectors_test_pt0.dat.gz
: 159879168
lyrl2004_vectors_test_pt1.dat.gz
: 161878016
lyrl2004_vectors_test_pt2.dat.gz
: 158580736
lyrl2004_vectors_test_pt3.dat.gz
: 149512192
lyrl2004_vectors_train.dat.gz
: 18620416
LIBSVM data sets
Data: Classification (Multi-class)
heart
glass
UCI
iris
Connect-4
wine
news20
Source: [
KL95a
]
Preprocessing: First 80/20 training/testing split.Also see
this page
[
JR01a
]
# of classes: 20
# of data: 15,935 / 3,993 (testing)
# of features: 62,061 / 62,060 (testing)
news20.scale.bz2
(scaled to binary encoding; then unit length for each instance)
news20.t.scale.bz2
(testing) (scaled to binary encoding; then unit length for each instance)
./svm-train news20.scale
./svm-predict news20.t.scale news20.scale.model news20.t.predict
OHSUMED Test Collection
UCI
UCI Knowledge Discovery in Databases Archive
Classification
20 Newsgroups
20news-bydate.tar.gz
- 20 Newsgroups sorted by date; duplicates and some headers removed (18846 documents)