In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI), by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We provide a systematic analysis of how each algorithmic component of BWI, in particular boosting, contributes to its success. We show that the benefit of boosting arises from the ability to reweight examples to learn specific rules (resulting in high precision) combined with the ability to continue learning rules after all positive examples have been covered (resulting in high recall). As a quantitative indicator of the regularity of an extraction task, we propose a new measure that we call SWI ratio. We show that this measure is a good predictor of IE success. Based on these results, we analyze the strengths and limitations of current rule-based IE methods in general. Specifically, we explain limitations in the information made available to these methods, and in the representations they use. We also discuss how confidence values returned during extraction are not true probabilities. In this analysis, we investigate the benefits of including grammatical and semantic information for natural text documents, as well as parse tree and attribute-value information for XML and HTML documents. We show experimentally that incorporating even limited grammatical information can improve the regularity of and hence performance on natural text extraction tasks. We conclude with proposals for enriching the representational power of rule-based IE methods to exploit these and other types of regularities.
The authors of these documents have submitted their reports to this technical report series for the purpose of non-commercial dissemination of scientific work. The reports are copyrighted by the authors, and their existence in electronic format does not imply that the authors have relinquished any rights. You may copy a report for scholarly, non-commercial purposes, such as research or instruction, provided that you agree to respect the author's copyright. For information concerning the use of this document for other than research or instructional purposes, contact the authors. Other information concerning this technical report series can be obtained from the Computer Science and Engineering Department at the University of California at San Diego, firstname.lastname@example.org.
[ Search ]