Font Size: a A A

Research On Automatic Multi-Document Summarization Based On Deep Learning

Posted on:2018-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:J S WangFull Text:PDF
GTID:2348330515978278Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the internet,a large number of data is generated every day,such as text,images and videos.The text is the most common type of data among these.When users want to query and understand topics that they concern,they have to spend a large amount of time to select articles and read them.Automatic Digest technology provides us with a quick way to understand the relevant topics,this kind of technology can summarize the document quickly which enable users to comprehend the relevant information simply by reading a short span of ten or dozens of sentences.Therefore,there are many types of abstract methods came into to meet the user's demand,like the theme-based method,word-based method and so on.These models have solved some problems related with single document digest fundamentally.However,many of these models cannot obtain good results because of many kinds of problems,for instance,there are many topics involved in the multi-document,and the feature extraction is difficult.In recent years,the deep learning has made great progress in the field of text processing.The most advanced neural machine translation model has surpassed the traditional algorithm model in many kinds of languages.This paper considers the method of deep learning which is applied to the area of multi-document automatic abstract.The limited Boltzmann machine is a classical model of deep learning which can encode the data,and it is widely used in the field of feature dimension reduction and neural network weight initialization.Because of the difficulty of feature extraction of text data,it is hard to understand the effect of many features.This paper completes the task of feature selection using the multi-layer network model consist of multi-constrained Boltzmann,this kind of model makes it easier to obtain features and enable text information contained in the feature to be more complete.And then sentences in the document will be scored using the support vector machine model,and extracting some representative sentences from the multiple documents to perform the sentence redundancy control,in addition,the abstract set will be generated by the sentences with the highest score based on the length of unit.Finally,this paper sort the abstract set based on the relative order of abstract sentences in documents,and then gather the sentences of the same topic in order to make the order of the abstracts more reasonable.The main process is as follows:(1)With regard to the presentation of multi-document information,through the research of natural language processing and abstract methods,we know that the text understanding usually express the document information at levels of the general word-sentence-articles-multiple documents set.We use text-based vector to represent the text,and try to extract features from the document at multiple levels,such as the importance of words in the sentence,sentence content information,the importance of the position of the sentence in the document,sentences and the title of articles,the similarity of the query words and so on.(2)With respect to feature dimension reduction,we use a multi-layer network structure which consists of a layer of local feature extraction layer and two-layer of constrained Boltzmann model to reduce the dimension of features.Additionally,we gain more abstract features from the collection of a large number of features.(3)With regard to the generation of abstracts and the sorting of abstracts,firstly,we obtain the sentence scores using the support vector machine model,and combine the sentences with higher scores into candidate abstracts,and then obtain the unit length abstract score for each candidate sentence,so we can get the abstract set consist of sentences with the highest score.We called it incremental generation summarization scheme.it makes our abstract performs well both in coverage and redundancy.Finally,we sort the abstract results by using the abstracts to make it more reasonable and logical.Based on the relative order of sentences in the article,we gather sentences with the similar topic.
Keywords/Search Tags:Multi-document automatic summarization, Boltzmann machine, RBM, feature dimension reduction, abstract, deep learning
PDF Full Text Request
Related items