Font Size: a A A

Research On The Generation Of Automatic Summarization In Chinese From Web

Posted on:2013-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:J Y GaoFull Text:PDF
GTID:2218330371960924Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet, The information resources on network grow rapidly, which provide us with convenience and also generate a lot of problems, The important one of the problems is how to locate and access information of interest quickly from the mass of Web pages, in order to effectively improve the efficiency of information retrieval, research of automatic summarization technology came into being, and has recevied a lot of attention. The complicated, lengthy document content can be summed up to simple, clear sentences by the technology, it is a big help for the rapid screening of information and accessing to information.The main reasearch objective of this paper is extracting subject text from the news, blog and other pages firstly, then the text becomes a collection of words by Chinese word segmentation, and then generate to a summarization which can reflect the full meaning of text accurately after feature extraction, while the design and implementation of automatic summarization generation system are given in the end.Firstly, the link density and word density are proposed to distiguish the subject text from"noise"information through comparing and analysis of the shallow text features from Web pages, and then these two shallow text features are used to construct two classifiers based on the decision tree algorithm to achieve the subject text information extraction. After this, a device of Chinese word segmentation, IKAnalyser, is introducted, which is better in words database expansion, its word segmentation principle and key algorithms are described at the same time. Then an information summarization technology of feature extraction is proposed, it is to calculate the weight of words to extract the feature words that can reflect the content of the document exactly according to the frequency of words, location and other factors, summarization is generated through words, further sentences similarity calculation. Finally, the design and implementation of automatic summarization generation system are given using the Java language in the Eclipse platform, the accuracy and efficiency of the system has been verified by the Weka experiment.Analysis and experiments show that the system proposed can automatically extracts the subject text from news, blogs and other Web pages, and then generates the accurate summarization to output, which greatly reducing the time to locate and access to information, improving the efficiency of information retrieval.
Keywords/Search Tags:shallow text feature, subject text extraction, feature information extraction, term-weighting
PDF Full Text Request
Related items