Evaluating
Cost-Sensitive Unsolicited Bulk Email Categorization
José María Gómez Hidalgo*, Enrique Puertas Sanz*, and Manuel J. Maña López**
* Departamento de Inteligencia Artificial, Universidad Europea de Madrid – CEES (Spain) email: {jmgomez, epuertas}@dinar.esi.uem.es
ABSTRACT
In the recent years,
Unsolicited Bulk Email has became an increasingly important problem, with a big
economic impact. In this paper, we
discuss cost-sensitive Text Categorization methods for UBE filtering. In concrete, we have evaluated a range
Machine Learning methods for the task (C4.5, Naive Bayes, PART, Support Vector
Machines and Rocchio), made cost sensitive through several methods (Threshold
optimization, Instance Weighting, and MetaCost). For the evaluation, we have
used the Receiver Operating Characteristic Convex Hull method, that best suits
classification problems in which target conditions are not known, as it is the
case. Our results do not show a
dominant algorithm nor method for making algorithms cost-sensitive, but are the
best reported on the test collection used, and approach real-world manual
classifiers accuracy.