Research And Application Of Web Information Extraction And Webpage Summarization

Posted on:2009-02-03

Degree:Master

Type:Thesis

Country:China

Candidate:Q S Liu

Full Text:PDF

GTID:2178360272470365

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

It has become a very important research field to extract the primary content from a Web page accurately and get its summarization. Because the innovation of the Web creates numerous information sources published as HTML pages on the Internet. However, much information is intra-page redundancy such as navigation bars, advertisements, etc. that surrounds the primary content and makes the document theme not very clear. As a result, users can't locate what they really want rapidly and the search engine will pay more attention to index the page.Based on analyzed differences between the web page and ordinary text, a feasible method of information extraction has been developed. This method is mainly based on the three facts: noise blocks always have a high node frequency property within a given website; primary content blocks are often made up of few link words and many text words; useful links are contained in a useful content blocks and have a close semantic distance with page titles. Therefore, after parsing the web page to an ordinary DOM tree, this method adds relativity and node frequency property properties as judge standard. Thus, it not only improves the precision but also cut down the sensitivity of the entropy threshold. The reason why block node frequency is used instead of entropy property is to make the method more efficiency. Experiment on 8 respective websites shows the proposed method can identify the primary content blocks with higher precision and recall rate both above 0.96.Based on the above work, an anthropopathic method for automatic summarization of Chinese web document is proposed, which is guided by document structure. This method partitions the document into a hierarchical structure by parsing the semantic distance between each adjacent paragraph, uses statistical approaches and heuristic rules to extract keywords and key sentences, and finally creates the abstract. Experiments show that this method can generate abstraction effectively.Then the web information extractor is applied in criminal investigation information extracting system, implementing criminal investigation information extraction, as well as providing data to other systems like information contrasting system, which gets a good result.

Keywords/Search Tags:

DOM Tree, Information Extraction, Information Entropy, Automatic Summarization, Document Structure

PDF Full Text Request

Related items

1	Statistic-based Automatic Keypharse Extraction And Summarization From Multi-document
2	Automatic Summarization Of Multimedia Information And Related Technology Research,
3	Research On Key Techniques Of Multiple Documents Automatic Summarization
4	Information Extraction System For Three Types Of Information Disclosure Announcements Of Listed Companies
5	Study On Methods And Their Applications Of Text Automatic Summarization And Information Extraction
6	Key Technologies Research On Web Products Automatic Extraction Based On Web List
7	Research Of Single Document Automatic Summarization Based On Discourse Structure Theory
8	Research Of Document Summarization Based On Topic Analysis
9	Multi-Document Automatic Summarization Of Chinese
10	Chinese Multi-document Automatic Summarization Extraction Based On The Combination Of LDA And TextRank