Research Of Information Extraction Algorithm

Posted on:2007-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:F F Wu

Full Text:PDF

GTID:2178360182996428

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the popularization of Computer and the development of theinternet techniques, large amounts of information appear in the form ofonline documents. For the sake of dealing with the information on theinternet increasing exponentially, some automatic tools are needed which canhelp the people to find the information quickly that they really require. Aresearch focusing on information extraction comes forth on this background.Text information extraction is a natural language processing task thatinvolves automatically extracting specific types of information from text,such as events and facts, forms structured data, and then populates databaseslots for queries. i.e., Information extraction can extract events, facts andrelations interested by people from the text, and then analyze the direction,give the tabloid, or serve online from the database. This thesis describes thehistory of information extraction, introduces the different and relationbetween information extraction and information research, text understanding,and lists the objects that information extraction technology can deal with. Thethesis analyzes the research outcome about the technology of informationextraction on English text, and brings forward some new ideals and methods.A late research on Chinese text focuses on the identification of Chinesenamed entity, and states an exploring moment at resigning and establishing afull system of information extraction.The thesis summarizes information extraction in general in chapter 1,lists the models of it and enumerates the work in the thesis. From chapter 2,the thesis studies some arithmetic. In chapter 2, we will introduce HiddenMarkov Model (HMM). HMM is a random process of double Markov,including Markov chain which has probability of transition between statesand observation value of outputs. States in HMM are uncertain and invisible,which can be represented by the random process of observation list. HMM isa powerful probabilistic tool for modeling time series data, and has beenapplied with success to many language-related tasks such as part of speechrecognition, text segmentation and topic detection. HMM has three problems:evaluation problem which allows us to choose the model which best matchesthe sequence;decoding problem, i.e., given the parameters and a sequence ofsymbols, how can we efficiently computer the most likely state sequenceassociated with the symbol sequence;training problem which deals with thetraining of an HMM given a set of observation sequences. Text informationextraction will need to deal with training problem and decoding problem. Atlast, we will introduce the problem of data sparse and methods to deal with it.Chapter 3 firstly analyzes how to learn model topology structure andconcrete parameters for text information extraction, introduces the materialapplication of training problem and decoding problem in informationextraction, and then making use of the information of layout and segmentlists, an algorithm is proposed using HMM and technology of data sparse forinformation extraction based text blocks. The result shows: the recall andprecision rate are all improved a lot compared with existing methods whichare based on words and traditional HMM, efficiency has also been raisedgreatly.HMM and Viterbi algorithm are applied abroad in the area ofinformation extraction, almost become the standard algorithm for the systemof information extraction. But in fact, whether the capability of a systembased on Stat. is good or bad rests with a fit statistical philology model thatcan deal with the language, a valid algorithm based on the model and amoderate scale corpus. When the model and corpus are certain, a validalgorithm can find the proximal result to the best one, so the capability of themodel is exerted best. In chapter 4, we present a heuristic informationextraction algorithm on text blocks and build a system with it. The systemutilizes the semantic characteristic and structure characteristic of the text tomake certain the states with characteristic. On the basis of this result, wecarry on extracting the remainder states having no characteristic with analgorithm incorporating backward-dynamic-programming and forward-A*algorithm. This thesis has tested 100 pieces of headers of computer sciencepaper of the data provided by the search-engine research group from CMUuniversity of USA. The result shows: the recall and precision rate are allimproved a lot compared with existing methods which are based on wordsand traditional HMM.We also do research on the identification of Chinese named entity.Extraction of Chinese names is the importance and difficulty. For the problemof name extraction whose tail character and the following character is a word,the present methods have certain objection. In chapter 5, we have designed anextraction system of Chinese names. The system adopts the neural network todeal with the Chinese word segmentation, and then carries on extraction ofname by rearmounted characteristic word according to name. The systemsolves the question of extracting Chinese name successfully whose tailcharacter and the following character is a word. In this chapter, the testingdata is the sentences with this kind of name in the corpus base of "PeopleDaily" of January in 1998, the result shows: the recall and precision rate areall improved a lot compared with existing methods.In the final chapter of this thesis, we sum up the algorithm used in thethesis and the experiment results, and then we introduce the work in thefuture simply.

Keywords/Search Tags:

Information Extraction, HMM, A~* algorithm, backward-dynamic, neural network

PDF Full Text Request

Related items

1	Double-layer HDR Is Backward Compatible With Parameter Prediction In Codec Systems
2	Research On Dynamic Hand Gesture Recognition Algorithm Based On Convolutional Neural Network
3	Research Of Information Extraction Algorithm On Sentiment Analysis Of Household Appliances Enterprises Network Reviews
4	A Study On Direct Supervised Learning Algorithm For Deep Spikliig Neural Networks
5	The Research On Text Relation Extraction Based On Convolution Neural Network
6	Research On The Method Of Gait Feature Extraction And Recognition
7	Chinese Information Extraction Based On Hybrid Neural Network
8	The Offline Processing Technology Of GNSS/SINS Integrated Navigation Based On The Smoothing And Neural Network
9	Based On Neural Networks For Web Information Extraction System
10	Study On Model And Algorithm Of Dynamic Feature Fusion Based On Information Sources Selection And Sequential Extraction