Study On Text Preprocessing And Automatic Rule Learning Technology For Information Extraction

Posted on:2006-06-02

Degree:Master

Type:Thesis

Country:China

Candidate:N Ye

Full Text:PDF

GTID:2168360155958183

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid popularization and development of the Internet technology, the amount of on-line information grows explosively. Common focus has arisen as to how to find useful information from the huge source. As the precursor of deep data mining technology, information extraction can extract specified facts from natural language documents through shallow analysis, thus becoming a hot research topic in natural language processing.Information extraction refers to the task of extracting information from a text in the form of text strings which are placed into slots labeled to indicate the kind of information that can fill them. This technology is an integrated application of many natural language processing techniques, including text preprocessing, text structure analysis, inter-text inference and so on. Most information extraction systems perform extraction on the basis of patterns (rules). The construction of rule library determines the performance of the whole extraction system. In this paper we study the text preprocessing and automatic rule acquisition technology for information extraction.In text preprocessing, we realized the recognition of simple named entities through deterministic finite automaton. Recognizable entity types include money, time, email, phone number, web address, number string and other symbols. The design of automaton fully considers the characteristics of each kind of entity, and acquired good recognition results while testing on large-scale people daily corpus.Traditional information extraction systems require experts to build rules by hand. The construction of rule base is the knowledge acquisition bottleneck, and the knowledge representation capability also limits extraction performance. Inductive logic programming (ILP) technique based on first-order predicate logic can describe and learn complex relations naturally, thus is quite suitable for the knowledge representation and automatic acquisition of rules in information extraction. In this paper we put forward an automatic multi-slot rale acquisition method under the ILP framework and solved the knowledge acquisition and representation bottleneck. The learned rules have good expansibility. Linguistic resource requirement is largely reduced because traditional semantic and syntactic analysis and complex named entity recognition process are no longer necessary. Experimental results show that the rules acquired by this algorithm achieve higher precision and recall compared to zero-order rules.

Keywords/Search Tags:

information extraction, text preprocessing, deterministic finite automaton, automatic rule acquisition, inductive logic programming

PDF Full Text Request

Related items

1	Genetic Inductive Logic Programming Research
2	Research On Interrupt Processing Mechanism Of Embedded System Based On Deterministic Finite Automaton
3	Research On Inductive Learning Of Discrete Dynamic Systems Based On Logic Programs
4	Deterministic inductive logic: A multi-valued logic for reasoning about categories
5	Memory-Efficient Regular Expression Matching Algorithms For Deep Packet Inspection
6	Web Information Extraction Based On Inductive Study
7	Parallel inductive logic programming for pharmacophore discovery
8	A Combining Deterministic Finite Automaton With Logic Rules Approach For Analyzing Of E-commerce Protocol
9	Research And Implementation Of Protocol Identification Based On Regular Expression
10	Design And Implement Of The Embedded HTML Parser Based On Automaton