Research On Internet Short Text Message Oriented Multi-Document Automatic Summarization

Posted on:2017-10-06

Degree:Master

Type:Thesis

Country:China

Candidate:X Liu

Full Text:PDF

GTID:2428330596959997

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid evolution of the Internet,especially the mobile Internet,Internet scale expands unceasingly in our country.A large number of complicated and redundant short text messages will be produced during the mutual communication,speech and forwarding information of Internet users.Traditional multi-document automatic summarization methods are almost long text oriented,which can not deal with short text message that is sparse.The research on Internet short text message oriented multi-document automatic summarization can generate Summarization that has the whole information and limited words toward the same topic,which is badly needed not only for the government to grasp the Internet public opinion but also for the intelligence organizations organizations to improve the efficiency of gathering intelligence.This thesis makes deep research on Internet short text message oriented multi-document automatic summarization,including sentence retrieval,short text clustering,and summary sentence extraction.The main contributions are shown below:(1)A WordNet and Word Embedding based sentence retrieval method is put forward in this paper to dispose the vocabulary mismatch problem rooted in the sparsity of sentences and queries.Firstly,We run the Personalized PageRank algorithm over the graph representation of WordNet concepts and relations to obtain concepts related to the queries,which can partially settle the sparsity of the queries.Secondly,the word embeddings that represent semantic of the queries and sentences are gained through training in large-scale corpus with the Continous Skip-gram Model.Finally,The ranked list of retrieval results is achieved by applying Word Mover's Distance to calculate semantic similarity of query and sentence,which can further handle the �word mismatch� problem.The evaluation on TREC2003 and TREC2004 reveals that the proposed method is significantly superior to the baseline sentence retrieval method.The MAP and R-Precision are 13.29% and 13.54% higher than the result of traditional method,which illustrates that the suggested method can validly handle the �word mismatch� problem.(2)A key word extraction and word embedding based short text clustering algorithm is put forward in this paper to dispose the poor clustering of short text caused by sparse feature and quick update of short text.Firstly,A formula based on word part-of-speech and length weighting is defined and used to extract key words to represent the short text.Secondly,the word embeddings that represent semantic of the key words are gained through training large-scale corpus in the Continous Skip-gram Model.Finally,Word Mover's Distance is used to calculate similarity of short texts which is an important procedure in the hierarchical clustering algorithm.The evaluation of four testing datasets reveals that the suggested algorithm significantly surpasses the result of traditional method.The mean F is 56.41% higher than the traditional result,which manifests that the suggested method can fully utilize semantic information included by word embedding and improve the short text clustering performance.(3)The word embedding based method has made some attempts to make use of semantic information that ignored by the traditional methods and received some progress,but it does not take the order of the words in the sentences into consideration.As a result,some different sentences may have the same sentence vector and the summary will be highly redundant when the training data is not adequate.To dispose these problems,a PV-DM model based summary sentence extraction method is put forward.Firstly,we formulate the monotone submodular objective function.Then,the sentence vectors trained by PV-DM model are used to achieve the semantic similarity between the sentences,which will be applied to solve the objective function.Finally,the summary of the Opinosis dataset is produced through an optimal algorithm.We evaluate the summary using the ROUGE evaluation measures.The ROUGE-1 and ROUGE-2 are 8.67% and 24.95% higher than the result of traditional method,which indicates that the suggested method can extract the most representative sentences of the topic.

Keywords/Search Tags:

Word Embedding, Word Mover's Distance, Sentence Retrieval, Short Text Clustering, Distributed Memory Model of Paragraph Vectors, summary sentence extraction

PDF Full Text Request

Related items

1	Research On Text Summarization Technology Based On Word And Paragraph Vectorization Representation
2	Unsupervised Extractive Text Summarization Using Sentence Embedding
3	Research On Text Representation Model And Similarity Calculation Algorithm
4	Research On Language Modeling Based Sentence Retrieval
5	Computational Methods Of Sentence Distance Based On Multi-modal Word Embedding
6	Research On Automatic Answering Technique Of English Test
7	With Distance From The Automatic Word Clause
8	Sentence-embedding And Similarity Via Hybrid Bidirectional-LSTM And CNN Utilizing Weighted-pooling Attention
9	Research On Sentence Alignment Method Based On Cross-lingual Word Embeddings
10	Research And Application Of Short Text Clustering Algorithm Based On Word Embedding