Font Size: a A A

Study On The Text Representation Of Extraction-based Multi-documents Summarization

Posted on:2014-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:S GongFull Text:PDF
GTID:1268330401471006Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Automatic Summarization is a branch of the research field of Natural Language Processing. This dissertation mainly concerns on the Extraction-based Multi-documents Summarization, which is the most basic and also mostly concerned branch of Automa-tion Summarization. It works for a collection of documents and extracts a set of important sentences in these documents to constitute the final summary. The target collection is gen-erally documents focusing on a certain theme, collected from different sources. Although collections always have prefixed themes, the variety of information sources of documents causes the following problems, which affect the accuracy of text representation. These problems are:1. Theme nonuniqueness:Different authors always write from different aspect of the same theme for different purposes. Thus there may contain more than one theme in the collection. These themes may be relate to or belong to the predefined theme.2. Word sense complexity:Due to the richness of different authors’writing habits and the flexibility of language using, it is easy for a documents generated by different people to contain synonyms about the same concept. While for collections with rich themes, there may contain polysemic words which used for different themes at the same time.3. Noise:In addition to contain content related to the predefined theme, documents may also carry other irrelevant content, which can be treated as noise in a theme-oriented collection.Based on the analysis and understanding of the above problems, we try to use se-mantic analysis model, semantic knowledge base and supervised information to improve the text representation in the extract-based multi-documents summarization in this dis-sertation. The main contributions of this dissertation are as follows:1. We introduce a multi-documents summarization method based on topic representa-tion to deal with the theme nonuniqueness problem. It includes three main modules: collection’s topic structure training, sentence’s topic representation and sentence importance computing, in which,(1) We introduce background training method to train the collection’s topic struc-ture, solving the structure’s accuracy and stability problems caused by small summarization datasets. By mixing together other collections with the target one, we are able to expand the training set’s scale, add extra word distri-bution information to assist training, and get a topic structure like "training set-collection theme-collection topic-word". The experiments varify that this method can improve the accuracy and stability of the generated summary.(2) According to the topic structure, we introduce a sentence’s topic representa-tion method. The sentence’s topic vector is built on the topics of words inside it, and can reflect the relation between sentence and topics. Experiments ver-ify that this representation can accurately distinguish sentences of different collections from each other, which are mixing together in the training set.(3) According to the topic representation of sentence, a multi-documents sum-marization method is proposed. Based on the fact that the collection always has a predefined theme, we assume the more and closer the topic related with sentences, the more important the topic is. Then, the sentences related to the important topic are also important, and should be selected into summary. Ex-periments show that this summarization method can obtain summaries with good quality.We introduce a multi-documents summarization method based on Wikipedia con-cept representation to deal with the word sense complexity problem. It includes three main modules:concept format and extraction, sentence’s concept representa-tion and sentence feature computing, in which:(1) We using Wikipedia concepts and wikification method for extraction, promis-ing the concept representation and corresponding summarization method to be robust, easy to expand and effective in a long term.(2) We improve the concept weight calculation, and get the concept representa-tion of sentence. By combining the concept’s global information in Wikipedia and local information in collection, a concept network can be built. Sentence’s concept vector is built using concept weight which is computed according to their connectivity in the network. Experiments confirm that this approach can extract representative concepts from the collection.(3) We introduce a multi-documents summarization based on sentence’ concept representation and the first paragraph of Wikipedia’s concept. Based on the fact that the first paragraph of Wikipedia concept is a manually written sum-mary for the concept, we can computer related features. Sentence’s impor- tance is then computed by combining these features with other commonly used features. Experiments confirm the validity of the first paragraph, and also the quality of summary generated using Wikipedia concept representa-tion.3. We introduce a supervised learning method for automatic noise filter of multi-documents summarization to deal with the noise problem. It includes three main modules:supervise information extraction, feature extraction and classifier train-ing, in which:(1) We select semantic units as training objects, use model summaries to extract class label. Through the studies on model summaries, we find they contain a certain number of semantic units in the target collection. Therefore, we can label the semantic units by seeing whether they are in the model summaries or not.(2) Extract features for semantic units with different frequencies. Experiments show that valid and noise semantic units can both have high or low frequen-cies, which means frequency feature is not sufficient to distinguish these two kinds of units. Thus we design features for units with high or low frequen-cies, with the similar frequencies, and also for co-occurred units with different frequencies.(3) Using binary classifier for automatic noise filtering. Because different data sets’noise distributions change, automatically distinguish valid and noise se-mantic units is more realistic for noise filtering. Experiments show that based on the previous label and features, the learned automatic noise filter can be used in different summarization systems and filter noise in different sematic units. The filtering results in an improvement for both the summary quality and the time cost of summarization algorithms.
Keywords/Search Tags:Multi-documents summarization, Topic modeling, Wikipedia, Wikifica-tion, Supervised learning, Automatic noise filtering
PDF Full Text Request
Related items