Font Size: a A A

Research Of Information Extraction Algorithm

Posted on:2007-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:F F WuFull Text:PDF
GTID:2178360182996428Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularization of Computer and the development of theinternet techniques, large amounts of information appear in the form ofonline documents. For the sake of dealing with the information on theinternet increasing exponentially, some automatic tools are needed which canhelp the people to find the information quickly that they really require. Aresearch focusing on information extraction comes forth on this background.Text information extraction is a natural language processing task thatinvolves automatically extracting specific types of information from text,such as events and facts, forms structured data, and then populates databaseslots for queries. i.e., Information extraction can extract events, facts andrelations interested by people from the text, and then analyze the direction,give the tabloid, or serve online from the database. This thesis describes thehistory of information extraction, introduces the different and relationbetween information extraction and information research, text understanding,and lists the objects that information extraction technology can deal with. Thethesis analyzes the research outcome about the technology of informationextraction on English text, and brings forward some new ideals and methods.A late research on Chinese text focuses on the identification of Chinesenamed entity, and states an exploring moment at resigning and establishing afull system of information extraction.The thesis summarizes information extraction in general in chapter 1,lists the models of it and enumerates the work in the thesis. From chapter 2,the thesis studies some arithmetic. In chapter 2, we will introduce HiddenMarkov Model (HMM). HMM is a random process of double Markov,including Markov chain which has probability of transition between statesand observation value of outputs. States in HMM are uncertain and invisible,which can be represented by the random process of observation list. HMM isa powerful probabilistic tool for modeling time series data, and has beenapplied with success to many language-related tasks such as part of speechrecognition, text segmentation and topic detection. HMM has three problems:evaluation problem which allows us to choose the model which best matchesthe sequence;decoding problem, i.e., given the parameters and a sequence ofsymbols, how can we efficiently computer the most likely state sequenceassociated with the symbol sequence;training problem which deals with thetraining of an HMM given a set of observation sequences. Text informationextraction will need to deal with training problem and decoding problem. Atlast, we will introduce the problem of data sparse and methods to deal with it.Chapter 3 firstly analyzes how to learn model topology structure andconcrete parameters for text information extraction, introduces the materialapplication of training problem and decoding problem in informationextraction, and then making use of the information of layout and segmentlists, an algorithm is proposed using HMM and technology of data sparse forinformation extraction based text blocks. The result shows: the recall andprecision rate are all improved a lot compared with existing methods whichare based on words and traditional HMM, efficiency has also been raisedgreatly.HMM and Viterbi algorithm are applied abroad in the area ofinformation extraction, almost become the standard algorithm for the systemof information extraction. But in fact, whether the capability of a systembased on Stat. is good or bad rests with a fit statistical philology model thatcan deal with the language, a valid algorithm based on the model and amoderate scale corpus. When the model and corpus are certain, a validalgorithm can find the proximal result to the best one, so the capability of themodel is exerted best. In chapter 4, we present a heuristic informationextraction algorithm on text blocks and build a system with it. The systemutilizes the semantic characteristic and structure characteristic of the text tomake certain the states with characteristic. On the basis of this result, wecarry on extracting the remainder states having no characteristic with analgorithm incorporating backward-dynamic-programming and forward-A*algorithm. This thesis has tested 100 pieces of headers of computer sciencepaper of the data provided by the search-engine research group from CMUuniversity of USA. The result shows: the recall and precision rate are allimproved a lot compared with existing methods which are based on wordsand traditional HMM.We also do research on the identification of Chinese named entity.Extraction of Chinese names is the importance and difficulty. For the problemof name extraction whose tail character and the following character is a word,the present methods have certain objection. In chapter 5, we have designed anextraction system of Chinese names. The system adopts the neural network todeal with the Chinese word segmentation, and then carries on extraction ofname by rearmounted characteristic word according to name. The systemsolves the question of extracting Chinese name successfully whose tailcharacter and the following character is a word. In this chapter, the testingdata is the sentences with this kind of name in the corpus base of "PeopleDaily" of January in 1998, the result shows: the recall and precision rate areall improved a lot compared with existing methods.In the final chapter of this thesis, we sum up the algorithm used in thethesis and the experiment results, and then we introduce the work in thefuture simply.
Keywords/Search Tags:Information Extraction, HMM, A~* algorithm, backward-dynamic, neural network
PDF Full Text Request
Related items