We will be using one or more subsets of the Reuters dataset[1]. Begin your experimentations on this rather small miniReuters file. I will let you know if I create larger datasets. The documents (docs) are already in the raw frequency format: the documents have been scanned and stop words (i.e. common non-informative words, such as “to”, or “the”) have beern removed and the remaining words are stemmed (e.g. “trade” and “trading” are replaced by their stem “trad”). Refer to the entire repository to get a feeling of the contents of the documents (newswire stories related to economics, such as commodities).
The format of the data is as follows: Each category, begins with the string “************** categegory name” (e.g. “************** soy-oil”), the next line contains “*****Exemplars:” followed by the list of training documents (i.e. the examples of the categroy). After the list of training documents, the list of test documents starts with the line “*****Tests:”. A document begins with its id number and then the vector of document terms and frequencies in the document. The string “###” marks the end of a documents.
Finally, for readability, a blank line separates one document from another, or the last training document from the “*****Tests” line or the next category. The same document id (with the same vector of word frequency) may appear in multiple categories.
You should assume that training (resp. test) documents that are not listed under the training (resp. test) docs of a category are negative training (resp. test) example docs for the category.
There are 7 categories in miniReuters, with the following number of train and test docs: (14, 11), (11, 5), (94,30), (8,6),(4,2), (15,14), and (389,189), in order of appearance of the categories in the file.
[1] There are more than one version of the dataset, and we are using a subset of a collection actually called the Reuters-21578 collection. This is now considered to be the “standard” variant of the Reuters dataset for the purposes of benchmarking results.