Font Size: a A A

The Study, Based On Themes By Web Document Automatic Summarization

Posted on:2007-12-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z M ChenFull Text:PDF
GTID:2208360185961106Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the network has already become the warehouse of the data and potential source of knowledge. How could users obtain and utilize these resources fast and effectively become a problem needing to solve urgently, the information retrieval and automatic summary is the key technology. Automatic summary technology presents document content compactly, it is the result of the information retrieval technique which develops to certain extent. But now, the search engine only return sentences or paragraphs including the key words as the abstract, which makes users hard to grasp the whole content of the Web document. And the traditional technology based on the Statistic frequency of the word only focuses on the external characteristic of text and is lack of semantic analysis. So, it can't be totally suitable for the summary of Web document.In the view of the deficiency of traditional summary technology, this text has put forward a new summary technology of webpage document based on the topic segmentation in the unlimited field. The method has added the semantic analysis technology of understanding digest on the found of mechanical digest and utilized fully the Web structure which has the auxiliary function to the summary. First, we segment the Web topics using the structure of HTML document and extract the digest in the unit of the divided topic blocks, which can increase the coverage of the digest to the document. Secondly, while extracting topic character, we utilize WordNet to extract topic concept instead of counting traditional frequency of the word, which can dispel the influence of the synonym and improve the accuracy of drawing topic words. Finally, according to similar degree among of sentence weight, we put forward a algorithms to extract digest sentences dynamically, which can reduce redundancy greatly. During the course of research in this paper, the main work is as follows:Webpage topic Segmentation: On the basis of constructed DOM tree of HTML document, carry on the pretreatment to the collected Webpage after designing two grades of filters, and then divide the document into different topics through the natural segmentation function and the semantic similarity comparison of nodes.Theme concept Drawing: Utilizing the relation of synonym and up/down in WordNet, sum up the synonym as their ancestor's concept instead of the statistics of each synonymous word, which can reduce the dimension of the vector space...
Keywords/Search Tags:automatic summary, Web segmentation, theme concept, extracting dynamically, Web summary
PDF Full Text Request
Related items