Font Size: a A A

Study On Text Preprocessing And Automatic Rule Learning Technology For Information Extraction

Posted on:2006-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:N YeFull Text:PDF
GTID:2168360155958183Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid popularization and development of the Internet technology, the amount of on-line information grows explosively. Common focus has arisen as to how to find useful information from the huge source. As the precursor of deep data mining technology, information extraction can extract specified facts from natural language documents through shallow analysis, thus becoming a hot research topic in natural language processing.Information extraction refers to the task of extracting information from a text in the form of text strings which are placed into slots labeled to indicate the kind of information that can fill them. This technology is an integrated application of many natural language processing techniques, including text preprocessing, text structure analysis, inter-text inference and so on. Most information extraction systems perform extraction on the basis of patterns (rules). The construction of rule library determines the performance of the whole extraction system. In this paper we study the text preprocessing and automatic rule acquisition technology for information extraction.In text preprocessing, we realized the recognition of simple named entities through deterministic finite automaton. Recognizable entity types include money, time, email, phone number, web address, number string and other symbols. The design of automaton fully considers the characteristics of each kind of entity, and acquired good recognition results while testing on large-scale people daily corpus.Traditional information extraction systems require experts to build rules by hand. The construction of rule base is the knowledge acquisition bottleneck, and the knowledge representation capability also limits extraction performance. Inductive logic programming (ILP) technique based on first-order predicate logic can describe and learn complex relations naturally, thus is quite suitable for the knowledge representation and automatic acquisition of rules in information extraction. In this paper we put forward an automatic multi-slot rale acquisition method under the ILP framework and solved the knowledge acquisition and representation bottleneck. The learned rules have good expansibility. Linguistic resource requirement is largely reduced because traditional semantic and syntactic analysis and complex named entity recognition process are no longer necessary. Experimental results show that the rules acquired by this algorithm achieve higher precision and recall compared to zero-order rules.
Keywords/Search Tags:information extraction, text preprocessing, deterministic finite automaton, automatic rule acquisition, inductive logic programming
PDF Full Text Request
Related items