Font Size: a A A

Research On Internet Short Text Message Oriented Multi-Document Automatic Summarization

Posted on:2017-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2428330596959997Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid evolution of the Internet,especially the mobile Internet,Internet scale expands unceasingly in our country.A large number of complicated and redundant short text messages will be produced during the mutual communication,speech and forwarding information of Internet users.Traditional multi-document automatic summarization methods are almost long text oriented,which can not deal with short text message that is sparse.The research on Internet short text message oriented multi-document automatic summarization can generate Summarization that has the whole information and limited words toward the same topic,which is badly needed not only for the government to grasp the Internet public opinion but also for the intelligence organizations organizations to improve the efficiency of gathering intelligence.This thesis makes deep research on Internet short text message oriented multi-document automatic summarization,including sentence retrieval,short text clustering,and summary sentence extraction.The main contributions are shown below:(1)A WordNet and Word Embedding based sentence retrieval method is put forward in this paper to dispose the vocabulary mismatch problem rooted in the sparsity of sentences and queries.Firstly,We run the Personalized PageRank algorithm over the graph representation of WordNet concepts and relations to obtain concepts related to the queries,which can partially settle the sparsity of the queries.Secondly,the word embeddings that represent semantic of the queries and sentences are gained through training in large-scale corpus with the Continous Skip-gram Model.Finally,The ranked list of retrieval results is achieved by applying Word Mover's Distance to calculate semantic similarity of query and sentence,which can further handle the “word mismatch” problem.The evaluation on TREC2003 and TREC2004 reveals that the proposed method is significantly superior to the baseline sentence retrieval method.The MAP and R-Precision are 13.29% and 13.54% higher than the result of traditional method,which illustrates that the suggested method can validly handle the “word mismatch” problem.(2)A key word extraction and word embedding based short text clustering algorithm is put forward in this paper to dispose the poor clustering of short text caused by sparse feature and quick update of short text.Firstly,A formula based on word part-of-speech and length weighting is defined and used to extract key words to represent the short text.Secondly,the word embeddings that represent semantic of the key words are gained through training large-scale corpus in the Continous Skip-gram Model.Finally,Word Mover's Distance is used to calculate similarity of short texts which is an important procedure in the hierarchical clustering algorithm.The evaluation of four testing datasets reveals that the suggested algorithm significantly surpasses the result of traditional method.The mean F is 56.41% higher than the traditional result,which manifests that the suggested method can fully utilize semantic information included by word embedding and improve the short text clustering performance.(3)The word embedding based method has made some attempts to make use of semantic information that ignored by the traditional methods and received some progress,but it does not take the order of the words in the sentences into consideration.As a result,some different sentences may have the same sentence vector and the summary will be highly redundant when the training data is not adequate.To dispose these problems,a PV-DM model based summary sentence extraction method is put forward.Firstly,we formulate the monotone submodular objective function.Then,the sentence vectors trained by PV-DM model are used to achieve the semantic similarity between the sentences,which will be applied to solve the objective function.Finally,the summary of the Opinosis dataset is produced through an optimal algorithm.We evaluate the summary using the ROUGE evaluation measures.The ROUGE-1 and ROUGE-2 are 8.67% and 24.95% higher than the result of traditional method,which indicates that the suggested method can extract the most representative sentences of the topic.
Keywords/Search Tags:Word Embedding, Word Mover's Distance, Sentence Retrieval, Short Text Clustering, Distributed Memory Model of Paragraph Vectors, summary sentence extraction
PDF Full Text Request
Related items