Font Size: a A A

Extract Information Based On Semantic And Layout Of Online Characters

Posted on:2009-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:M YanFull Text:PDF
GTID:2208360242985845Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the rapid polarity and development of the Internet or Web technique, there is an ever-increasing volume of data published in Web. The World Wide Web has already become the biggest information resources. But for a particular user, the useful information is little compared with the huge web. This is the problem of the so called Rich Data Poor Information. The phenomenon results in the Information Extract technique from Web, which is becoming one of the hot researches recently.This paper adopts the IE algorithm based on the combining of regulations and statistics, together describes the extracted information utilizing the idea of the ontology to generate the extracting regulations. The IE system for people information called PeopleInfoAbstract system has been developed, which accomplishes the auto-extracting information from the semi-structure people web page.This IE system is composed of four modules: the web page collecting module, the web page preprocessing module, the IE module and the selecting module. The web page collecting module defines this paper's research object and classifies it firstly, then introduces the collecting criterion following the range, quantity, principle and approach. The web page preprocessing module accomplishes two ways of preprocessing that are extracting the web page body area and removing all the html tags by analyzing html files to DOM trees. This module uses the web page format analyzing developed by hylanda to get the web page's body area. The IE module accomplishes the auto-extracting information from the semi-structure people web page. It established a field name dictionary which contains four thousands and six hundreds and twenty four effective field name by program statistic people field names from the huge corpus. The precision of the IE is greatly raised by checking the extracted field name with the field name dictionary. This algorism classifies the field value into the short field value and the long field value which use different extracting rule. This paper adopts the ontology idea to check the field value by describing the character of the extracted field value to generate the checking rule validity.The average precision and recall both reach to 90% or above by testing the system which has a good adaptability.
Keywords/Search Tags:the Web IE, IE regulations, DOM tree, XML, the semi-structure web page, the web page format analyzing
PDF Full Text Request
Related items