Evaluation methods in text categorization
Categorizing textual data is the task of determining the correct class or classes of unseen documents based on some learning examples. Categories or classes are predetermined therefore a supervised learning algorithm is needed to classify documents.
Formally, following the notations used by Sebastiani in [Sebastiani02], let and be the set of the documents and categories respectively. The task is to approximate as precisely as it is possible the unknown target function by , where and denotes "True" and "False", respectively. If there are only two classes, then the categorization is called binary, while if is called multiclass. Moreover, if the categories are disjunctive or non-overlapping, that is a document belongs to exactly one category, is called single-label, while if the categories are overlapping, that is a document can belong to arbitrary number of categories, it is said multilabel.
A text categorization system can use hard or ranking policy. Hard categorization means that the classifier takes a hard decision about a document-category pair, while ranking categorization ranks the categories for a given document based on some distance metric. This metric is called categorization status value, , . The greater the CSV-value of a category for a document the better the document belongs to the category.
Precision and recall
Precision and recall are the most common measures for evaluating an information retrieval system. Precision is the proportion of returned documents that are targets, while recall is the proportion of target documents returned.
The definition can be rewritten using probabilities. Suppose the learning algorithm was run and decisions had been made. This set constitutes our space. The precision can be written as
while recall is
Besides these accuracy and error another two common IR measures are sometimes used to characterize TC systems:
There are two conventional methods of calculating the performance of a text categorization system based on precision and recall. The first is called micro-averaging, while the second one macro-averaging. Micro-averaged values are calculated by constructing a global contingency table (see above for one class) and then calculating precision and recall using these sums. In contrast macro-averaged scores are calculated by first calculating precision and recall for each category and then taking the average of these. The notable difference between these to calculations is that micro-averaging gives equal weight to every document (it is called a document-pivoted measure) while macro-averaging gives equal weight to every category (category-pivoted measure).
Precision-recall breakeven point
The properties or the expected behaviours of text categorization/information retrieval systems can vary. For example for one system it is better to return mostly correct answers, while in another it is better to cover more true positives. There is a tradeoff between precision and recall: if a classifier says "True" to every category for every document, then it receives perfect recall, but very low precision. However it can be easily seen that if a classifier says "False" for every category, except one which is correct () then it will have a precision equal to but a very low recall. That is why it makes comparison between systems easier if the system is characterized by a single value, the breakeven point (BEP), which is the point at which precision equals recall. This can be achieved by tuning the parameters of the system. When there is no such point (because , and are natural numbers) the average of the nearest precision and recall is used, and is called interpolated BEP.
For example in ranking categorization models for each class an optimal CSV threshold has to be determined such that . If then the classifier says "True", otherwise says "False".
11-point average precision
The 11-point average precision is another measure for representing performance with a single value. For every category the CSV threshold is repeatedly tuned such that allow the recall to take the values . At every point the precision is calculated and at the end the average over these eleven values is returned [Sebastiani02]. The retrieval system must support ranking policy.
Yang [Yang99] gives the following detailed algorithm for the calculation of this value. The precision and recall values for a given document and a threshold is calculated as
- For each document calculate the precision and recall at each position in the ranked list where a correct category is found.
- For each interval between thresholds use the highest precision value in that interval as the "representative" precision value at the left boundary of this interval.
- For the recall threshold of the "representative" precision is either the exact precision value if such point exists, or the precision value at the closest point in terms of recall. If the interval is empty we use the default precision value of .
- Interpolation: At each of the above recall thresholds replace the "representative" precision using the highest score among the "representative" precision values at this threshold and the higher thresholds.
- Per-interval averaging: Average per-document data points over all the test documents at each of the above recall thresholds respectively. This step results in 11 per-interval precision scores.
- Global averaging: Average of the per-interval average precision scores to obtain a single-numbered performance average. The resulting value is called the 11-point average precision.
The E and F-measures
Van Rijsbergen [VanRijsbergen79, Ch.7] introduced two measures for evaluating information retrieval systems: the E and the F-measure. Since then the F-measure has become the most frequently used such measure.
The E-measure introduced by van Rijsbergen measures the non-overlapping grade between the retrieved and the actually relevant documents. Let and denote the set of relevant and the set of retrieved documents, respectively (see Fig. ). Lets denote the non-shaded region by . Then, by definition,
where the denominator stands for normalizing the value, because we are interested only in the proportion of the relevant and non-relevant documents. Moreover from the contingency table in Section we can write and . Using these the E-measure becomes
which is the harmonic mean of the precision and recall. Thus, the more efficient the system is, the lower the E-value we obtain. The measure achieves its minimum at the highest values such that or if such a point does not exist.
The F-measure is simply the complement of the E-measure, indicating the extent of overlap of the above sets, that is
A more generalized version of these are the and -measures, introducing a parameter as a weighting factor for the importance of the recall (or precision) [VanRijsbergen79], [Lewis95]:
By setting to , we obtain the above and measures, that is why the conventional notation uses and denoting them. The most frequently applied setting is using . If , it means that recall is half as important as precision, in case of they get equal importance, while for example is used to set recall twice as important as the precision.
The relation of the BEP and the F-value
Because the BEP and the F-value captures the tradeoff between precision and recall, and using these a single value can be used for characterizing system performance, the researchers by choice use one these. But since there is no convention about which one to use, different papers use different measures. Both measures achieve its maximum at the highest values such that , in which case the BEP and the F-value are also equal. This section analyses the relation between them, which will be denoted by a bold question mark, .
By multiplying the corresponding parts we get
Now if , then also (from the definition), therefore "=". If , then also , hence similarly "=". In the case and , we can proceed in the following way: divide both sides of the relation by , and denote ,
This can be written in the form of a product, that is
Now we can differentiate three cases:
- : , thus "="
- : , thus and , therefore ">"
- : , hence . Deciding the sign of the second component is somewhat longer. Solving the equation we get . The function achieves its minimum at , so in monotonically decreases, and in monotonically increases, and . Therefore
- if , that is "<"
- if , i.e. "". So, unless , the BEP is greater or equal to .
Because is just a specific setting of this is also valid for it. But if we examine the third case (because the other cases do not depend on ), we see that , because , which means that
that is BEP (this could be shown by simply recalling that the arithmetic mean is always greater or equal than the harmonic mean). From these it follows that if the performance of an information retrieval system is computed using , it will show a lowest performance unless .
Receiver Operator Characteristics
- The same in PDF: [pdf]
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. [pdf]
- [Lewis95] David D. Lewis. Evaluating and optimizing autonomous text classification systems. In SIGIR '95, pages 246--254, New York, NY, USA, 1995. ACM Press.
- [VanRijsbergen79] Cornelis J. Van Rijsbergen. Information Retrieval. Butterworths, 1979.
- [Sebastiani02] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
- [Yang99] Yiming Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2):69--90, 1999.