Font Size: a A A

Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website

Posted on:2008-09-06Degree:MasterType:Thesis
Country:ChinaCandidate:C L WangFull Text:PDF
GTID:2178360212974252Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Internet is an extremely large information repository with its data amount ever-increasing in an exponential rate. This provides users with a valuable resource of information. However, the information in the Internet is massive, heterogeneous, variable and non-semantic, which makes it difficult to retrieve relevant data quickly and accurately from the tremendous amount of web pages. Therefore, the availability of robust, flexible and automatic tools that can help users effectively retrieve information from the Internet has become a great necessity.This thesis presents a novel web information automatic extracting mechanism based on the website semantic structure, which trying to extract information using the semantically-meaningful logical view of the website, so the computer can be made to understand the meaning of information to a certain extent, attaining the goal of making the process of information extraction more efficient.This thesis designs a web information extraction system which based on semantic structure of the website. The system consists of three main components: website spider, website semantic structure generator, web information extractor. The task of website spider is to search the target website, provide relations of links to generate the website direct graph, download pages to extract relevant information. The task of website semantic structure generator is to translate the website direct graph (the physical structure of the website) to the website semantic structure based on the web page classification which has been done by website designer according to his understanding of the content of web pages, namely to produce a category relationship chart in accordance with the semantic classification of the website. The task of web information extractor is to extract relevant information based on this classification.A website spider was implemented in the thesis. The website spider can traverse websites, download web pages and generate website direct graphs. Several key issues about implementation are demonstrated in details. The thesis also proposes a web page classification based on the semantic meaning of the website. When the website direct graph has been constructed, a topology structure which reflects the website semantic meaning will be generated based on web page semantic classification and website direct graph. When the semantic structure of the website has been constructed, web information extraction can be done based on this structure. The thesis presents a web...
Keywords/Search Tags:web information extraction, web page classification, tag tree, web page noise reduction, spider
PDF Full Text Request
Related items