Font Size: a A A

Research On Antomatic Chinese Text Summarization Of Web-oriented Text Mining

Posted on:2010-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:Q N XuFull Text:PDF
GTID:2178330332998594Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of the Internet, the massive Web data resources have already become an important origin for people to obtain knowledge and information. Because the Web resources are half-structural, discrete, real-time and isomerous, it is very hard for users to gain the true valuable information from the Web accurately. At present, the amount of Chinese text data is far more than structured data. Recent studies have shown that an organization's information is based on 80% of the text in the form of storage. With the growing information resources, people from the urgent need for mass text messages to effectively collect and select interesting and useful information. In such a demand-driven, text data mining in the area of data mining has become a hot and difficult subject.According to this paper, the content we studied is the automatic Chinese text summarization of Web-oriented text mining and the design of system. Through the independent development of the Chinese text information retrieval systems, focused on popular Chinese Web text mining as well as the text of the summary of the auto-core technology, can be summarized as follows.Chinese word segmentation:Considering the characteristic of Chinese data, we adopted the algorithm based on the "meta-word"Chinese text keywords extraction:using the result based on the segment we develop a statistical method to extract the keywords.In this paper, a new automatic summarization method which is based on statistical analysis of the text is proposed on the basis of the existing methods towards the Chinese text automatic summarization. Digest is distilled through structural analysis of sub-themes, and is processed applying the heuristic rules to make the summary more readable. The main work and major innovation is on:①propose a practical method of automatic summarization;②propose a new method for selecting the items of the text vector space, using several top weight keywords rather than all the words, solving the problem of the information decentralization;③design a new method for partitioning themes, with the different dynamic number of themes for texts with different structure, partition themes scientifically;④raise the concepts of the overall weight, the local weights and the thematic weight of the keywords, propose the right weight evaluating methods for all weight category, solving the difficulty of relying on large corpus.In view of the above research results, we design and realize details of the system.
Keywords/Search Tags:automatic summarization, web text mining, keywords extraction, statistical method, vector space, structural analysis
PDF Full Text Request
Related items