Font Size: a A A

Research And Application Of Web Information Extraction And Webpage Summarization

Posted on:2009-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q S LiuFull Text:PDF
GTID:2178360272470365Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
It has become a very important research field to extract the primary content from a Web page accurately and get its summarization. Because the innovation of the Web creates numerous information sources published as HTML pages on the Internet. However, much information is intra-page redundancy such as navigation bars, advertisements, etc. that surrounds the primary content and makes the document theme not very clear. As a result, users can't locate what they really want rapidly and the search engine will pay more attention to index the page.Based on analyzed differences between the web page and ordinary text, a feasible method of information extraction has been developed. This method is mainly based on the three facts: noise blocks always have a high node frequency property within a given website; primary content blocks are often made up of few link words and many text words; useful links are contained in a useful content blocks and have a close semantic distance with page titles. Therefore, after parsing the web page to an ordinary DOM tree, this method adds relativity and node frequency property properties as judge standard. Thus, it not only improves the precision but also cut down the sensitivity of the entropy threshold. The reason why block node frequency is used instead of entropy property is to make the method more efficiency. Experiment on 8 respective websites shows the proposed method can identify the primary content blocks with higher precision and recall rate both above 0.96.Based on the above work, an anthropopathic method for automatic summarization of Chinese web document is proposed, which is guided by document structure. This method partitions the document into a hierarchical structure by parsing the semantic distance between each adjacent paragraph, uses statistical approaches and heuristic rules to extract keywords and key sentences, and finally creates the abstract. Experiments show that this method can generate abstraction effectively.Then the web information extractor is applied in criminal investigation information extracting system, implementing criminal investigation information extraction, as well as providing data to other systems like information contrasting system, which gets a good result.
Keywords/Search Tags:DOM Tree, Information Extraction, Information Entropy, Automatic Summarization, Document Structure
PDF Full Text Request
Related items