ABCs of Text Categorization
This document describes some of the basics of text categorization, enough to get you started on the project. The basic problem is to determine the category or categories that a given (test) document belongs to using classifiers trained on the set of training documents (docs for short). We sometimes refer to a set of docs as a corpus (e.g. training corpus). In many of the learning algorithms we will deal with, we train a binary classifier for each category separately in order to simplify things, so that if there are m categories, we will train m classifiers, one for each category. Given a test document, each classifier determines whether or not the document belongs to its category, independent of other classifiers. An exception to this includes the nearest neighbor classifier, where there is no real classifier per category, and the votes of the nearest neighbor training docs determine the final category or categories of the test document.
For most of the training algorithms you may write, documents should be represented as vectors of numbers.
In such representations, documents are viewed as n-dimensional vectors, where n is the total number of unique words appearing in some training document. As n is relatively large (in the thousands), efficient vector representations only keep track of the words actually appearing in a document. Such a representation is called a sparse representation.
Binary Representation.
One simple and common representation is the binary representation, where appearance of a word in the document is indicated with a 1 in the document vector. All non-present words have a weight of zero and are not represented explicitly. You may use unique ids for each word (implemented through a hash map for example) and represent a document by an array of integers, each entry corresponding to the id of the word present. For computing dot products and other operations on documents, it would be best to represent the documents as sorted arrays. Here is an example. Say we have the following mapping of words to ids: (“boeing”, 23), (“flight”, 5), (“aviation”, 312), and (“police”, 11). And a document d contained only the first three words. Then its binary sparse representation (sorted on word id) is:
|
5 |
23 |
312 |
Note that the first id being 5 means that words with ids less than 5 do not appear in d, and “flight” appears in d. The next id being 23 similarly implies that words with ids less than 23, but greater than 5, including “police”, do not appear in d. Writing operations such the dot product of two vectors becomes relatively efficient and easy (how?). For training classifiers such as the perceptron learning algorithm and exponential gradient, try using the binary representation, as training them may take a long time otherwise (see below under feature space reduction). In any case, I strongly recommend that you use a sparse array sorted on id in any case.
tfidf Representation and Normalization
For the centroid algorithm, you should use the cosine normalized tfidf (cn for short) representation. I also recommend it for the nearest neighbors algorithm if you choose to implement it. In general, as the cn representation of a document has more information than the binary representation, it should aid the training algorithm, but may slow the training process, which can be significant for some algorithms. Here’s how the representation is defined and constructed. First we define the frequency vector of a document, which is the vector of frequency of appearance of each term of the document. For example, if document d of the above example has “boeing” (id 23) appearing 3 times, “flight” appearing 2 times, and “aviation” appearing once, its (sparse, sorted on id) frequency vector is:
|
5 |
23 |
312 |
|
2 |
3 |
1 |
Now, appearance of some of the words in a document is more informative, i.e. more indicative of the attributes of the document (for example its category) than others. For instance, if a word appears in almost all docs, we expect that it is not very informative. The tfidf measure of a word, which stands for “term frequency, inverse document frequency” is a re-weighting used to account for exactly this (training corpus frequency) aspect of words. Denote the term frequency of a word w, i.e. its frequency in a doc d, by tf (document dependent), denote the total number of (training) corpus docs by N, and let df(w) denote the document frequency of the word (document independent), i.e. the number of docs in which the word appears. We define the tfidf weight of w in d:
![]()
In our example above, assume N=1000, and df(“boeing”)=10,
then tfidf(“boeing”)=3log(1000/10)=6. Assuming we have df(“aviation”)=1 and df(“flight”)=4, we
get the tfidf vector corresponding to doc d:
|
5 |
23 |
312 |
|
5.40 |
6 |
3 |
Note that the higher the document frequency factor of a word, the lower its tfidf weight. In our example, the relative weight of “aviation” to other term weights in the tfidf vector is higher than in the (raw) frequency vector. There are other definitions of tfidf too. In one variant, log(tf+1), is used instead of tf in the above formula. You may use either definition.
Finally, as it stands, longer documents, with more terms and
higher term frequencies, tend to get larger dot products than smaller
documents, and this skews and biases similarity measures. So for several algorithms (including the
centroid algorithm to be discussed soon) it is best to adjust the weights of
words by a measure of the length of a document vector. This process is called (vector) normalization. A very common normalization is dividing the
weights by the
norm (or the
length or the
Euclidean length) of the documents.
Let
denote the vector corresponding to document d,
with
denoting its ith
component. The
norm of vector
, commonly denoted by
is simply the square root of the dot product of the document
vector by itself:
![]()
Division by the
norm is also
called cosine normalization. You
will use cosine normalized representation of the tfidf document vectors for the
centroid algorithm. For the above tfidf
vector, its
norm is around
8.61, so its cosine normalized (or
normalized vector)
is
, or in our array format:
|
5 |
23 |
312 |
|
0.63 |
.70 |
.35 |
Note that the
norm of an
normalized vector is
1.
There a few other norms that may turn out to be useful to
you depending on the learning algorithm you may use. In general, the
norm is defined as:
![]()
where
denotes the absolute value of
. Two other common
norms are when k=1, so that the length is just the sum of the absolute
values of the vector components, and when
from which we get
, which is called the max norm.
The centroid algorithm is one of the simplest to describe
and implement. For each category one
centroid (classifier) is created. For a
certain category (say sports) let
denote the set of positive training documents, and let
denote its size. Then
the centroid of the category is:
![]()
Therefore, the centroid is basically the average of the documents belonging to the set. You should use the cosine-normalized (l2 normalized) tfidf representation of the docs to get best results for the centroid algorithm. However, you can experiment with other representations (no tfidf or no normalization) and investigate categorization quality.
Now given a document d, a measure of similarity of document d and centroid c is the cosine of the angle between them, or:
![]()
Note that if d is already
normalized, then
there is no need to compute and divide by
as it is 1. Also with entries of d and c
being positive, we have
In general, recall
that the cosine is in the [-1,1] range.
Finally, if a threshold t is chosen (below, we
say how), then the algorithm reports that d belongs to the
category if and only if
Note that another
method of categorization would be to assign to the document the category whose
centroid has the highest similarity to the document (what are the pros and cons
of this method?).
![]()
(if |
|=0 define recall to be 1.0). On the other hand, let FP denote the false positives, i.e.
those the classifier falsely attributes to the category, then precision of the
classifier is:
![]()
Note that
is the total number of docs that the classifier reported as
positive. For example, if a classifier
classifies 25 docs to belong to a certain category, out of which 20 truly
belong to the category (are true positives) and a total of 30 docs belong to
the category, then the precision is 20/25=0.8 and recall is 20/30=0.67. In this case, the classifier had 5 false
positives and 10 false negatives.
Precision and recall usually trade off against each other, for example, if we seek relatively high precision, we may need to give up on recall. For classifiers that output scores[1], one gets different precision and recall measures by applying different thresholds. For example, consider the following scenario for a hypothetical classifier whose score ranges in [0,1].
In the picture, we have 10 positive examples. One positive example gets a score of over 0.9 by the classifier (in case of the centroid classifier this would be its similarity to the centroid). Setting the threshold at 0.9, we get a precision of 1.0, but recall is only 0.1. As we decrease the threshold, at 0.8, precision drops down to 0.5 while recall remains the same. From then on, as we decrease further precision as well as recall go up, till t=0.6.
Often, we may want a combined measure of performance of the classifier. The F measure is such:
![]()
The F measure is more sensitive to the difference between recall and precision than just taking the average.
When one wants to classify unseen documents, one needs to pick a threshold. The choice of the threshold may come down to the minimum level of recall or precision that one can live with. For example, pick a threshold so that recall is maximized while precision remains above 0.9 (or the maximum precision, if 0.9 is not reachable). In this project, you should pick the threshold that gives the highest F measure on the training data (or a validation subset), for calculating the precision, recall and the F measure of the resulting classifier on the test data. Also, compare such an F measure to the maximum F measure of the centroid on the test set as well (i.e. without picking a threshold before hand).
In order to save space and speed up (and at times improve) training, it is common to drop terms that will likely not be informative or useful. This process is referred to as feature selection or dimensionality reduction in machine learning. The centroid algorithm is pretty fast and such a reduction is not required, but for the perceptron and your algorithm of choice, feature reduction is likely very useful for speeding up. I will describe two methods you can use to achieve this: reduction by document frequency and by the mutual information (MI) measure. You should implement the MI measure and I leave the implementation of the other method to you.
Term reduction by document frequency is simple: drop all words that appear in no more than a specified number m of (training) docs. Even for modest values of m such as 1, the number of terms usually halves. I recommend using m=2.
Reduction through mutual information is done per category basis, i.e. the set of words that are retained for one category is different from the set of words kept for another. The mutual information of a word w with respect to a category C, is defined by:

The indices i and j correspond to absence or
presence of the concept or word. For
example,
is the probability that a document does not contain word w
(the event
) but belongs to the concept (the event
), and
is the probability that a document contains w. Thus there are four possibilities and we get
four terms in the sum. The
probabilities are estimated from the training corpus. For example, if there are a 1000 documents, 30 docs belong to the
category, 50 docs contain the word, and 20 docs belong to the category and
contain the word, then
, and
, and
. A word that appears
more frequently in the positive docs than in negative docs (or more in negative
docs than in positive docs) has a relatively higher MI for such a category,
compared to a word whose appearance frequency does not differ significantly.
In MI reduction, you compute the MI score of all words with respect to the category you want to build a classifier for, and then keep the top m such words. Around a few hundred kept words (say 300) should suffice. You may experiment with smaller of higher m. Then, the words eliminated are dropped from the vectors of all docs. For simplicity, I suggest you use the binary representation when you apply the MI reduction.
[1] Strictly speaking, those outputting scores are not yet “classifiers,” but with addition of a threshold, they become classifiers.