Extract Information Based On Semantic And Layout Of Online Characters

Posted on:2009-05-10

Degree:Master

Type:Thesis

Country:China

Candidate:M Yan

Full Text:PDF

GTID:2208360242985845

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

With the rapid polarity and development of the Internet or Web technique, there is an ever-increasing volume of data published in Web. The World Wide Web has already become the biggest information resources. But for a particular user, the useful information is little compared with the huge web. This is the problem of the so called Rich Data Poor Information. The phenomenon results in the Information Extract technique from Web, which is becoming one of the hot researches recently.This paper adopts the IE algorithm based on the combining of regulations and statistics, together describes the extracted information utilizing the idea of the ontology to generate the extracting regulations. The IE system for people information called PeopleInfoAbstract system has been developed, which accomplishes the auto-extracting information from the semi-structure people web page.This IE system is composed of four modules: the web page collecting module, the web page preprocessing module, the IE module and the selecting module. The web page collecting module defines this paper's research object and classifies it firstly, then introduces the collecting criterion following the range, quantity, principle and approach. The web page preprocessing module accomplishes two ways of preprocessing that are extracting the web page body area and removing all the html tags by analyzing html files to DOM trees. This module uses the web page format analyzing developed by hylanda to get the web page's body area. The IE module accomplishes the auto-extracting information from the semi-structure people web page. It established a field name dictionary which contains four thousands and six hundreds and twenty four effective field name by program statistic people field names from the huge corpus. The precision of the IE is greatly raised by checking the extracted field name with the field name dictionary. This algorism classifies the field value into the short field value and the long field value which use different extracting rule. This paper adopts the ontology idea to check the field value by describing the character of the extracted field value to generate the checking rule validity.The average precision and recall both reach to 90% or above by testing the system which has a good adaptability.

Keywords/Search Tags:

the Web IE, IE regulations, DOM tree, XML, the semi-structure web page, the web page format analyzing

PDF Full Text Request

Related items

1	Research On Mining Structure Of WEB Page For Information Extraction
2	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining
3	A Web Structure Clustering Algorithm For Mobile Page Adaptive Platform
4	Web Page-oriented Handheld Devices Automatically Cutting Technology Research
5	Research On Web Data Extraction Based On Web Page Structure
6	Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree
7	A Study Of Hybrid Cache Management Mechanism Based On Page Classifier And Page Placer
8	Structure Information Extraction- Study And Implementation On Semi-auto Wrapper
9	Research Of Web Page Purifying Method Based On Document Object Model
10	Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website