Font Size: a A A

Research On Microblog Summarization Using Paragraph Vector And Semantic Structure

Posted on:2017-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y J YuanFull Text:PDF
GTID:2348330566456187Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,the amount of texts has experienced an explosive growth while social network provides multivariate seivice.This phenomenon puts forward a keen demand for computers to quickly grasp the core information from the information ocean.Therefore,summarization technology based on the short text has become a research hotspot in the field of Natural Language Processing.This paper aims at generating accurate summarization for microblogs by removing redundancy and increasing the relevance between the selected sentences and topic in microblogs,the coverage rate of the sub-topic segmentat is increased by calculating the correlative degree between sentences and topics,the importance of the context is assessed by mining the semantic relationship between sentences and sub-topics.In the last,social text summarization is generated combined with a deep learning and sentence meaning structure model.The main contributions are listed below:(1)A new sentence similarity computing method based on Paragraph Vector and Chinese Sentential Semantic Model is proposed.In recent research,most of the similarity computing methods for social short-texts like microblogs are based on the term frequency,not taking the semantic and context information into account.This kind of methods could not truly reflect the semantic association between two sentences.Based on Paragraph Vector and CSM,this paper proposes a novel sentence similarity computing method,namely PV-CSM.Paragraph Vector is a deep learning method and obtains distributed representations of texts.CSM mines the semantic association between words and phrases and generates semantic representations of texts.PV-CSM combines the two kind of representations to calculate the sentence similarities.We implemented experiments on the evaluating corpus from the conference of NLP&CC.The silhouette coefficient reaches 0.3842 when the compression ratio is 1.5%.The result shows that PV model can obtain the association between contexts and accurate sentences representations.At the same time,CSM utilizeseffectively the semantic features to represent sentences and avoids wasting useful information.In summary,based on semantic and context information,our method can calculate the sentences similarity more than the state-of-the-art methods and reflect the semantic similarity between sentences.(2)A microblog summarization framework based on PV-CSM method is proposed.In the task of microblog summarization,most of the current methods obtain sub-topics based on sentence similarity computing method,ignoring semantic relations between words and their context,resulting in the poor result of sub-topic discovering.Meanwhile,when extracting sentences from sub-topics,most methods based on Graph Layout focus solely on global information,ignoring the semantic relation between the sentence and the lcal topic it belongs to.A new microblog summarization framework based on PV-CSM is proposed in order to provide concise summarization to help users quickly grasp the essence of a bunch of microblogs.Latent Dirichlet Allocation(LDA)topic model is used to calculate the pairwise sentence similarities and construct the similarity matrix based on sentential semantic structure obtained by CSM.Sentences are then clustered into several sub-topics based on the similarity matrix,meanwhile the sentential semantic features and sentential relationship features of each sentence are extracted by CSM.The most informative sentences are extracted from each sub-topic through combining sentential semantic features and relationship features.We implemented experiments on the evaluating corpus from the conference of NLP&CC.When the compression ratio is 0.5%,ROUGE-1 value reaches 0.42634,when the compression ratio is 1.0%,ROUGE-1 value reached 0.5018,when the compression ratio is 1.5%,ROUGE-1 value reaches 0.53717.Results indicate that our framework can better understand sentential semantic,and the extracted semantic features can highlight the description power of sentential semantic.Meanwhile,using both sentential semantic features and relationship features can enrich the features representation and reduce information loss,increasing the semantic relevance of similar data.Moreover,the impact of noise can be reduced.Besides,the proposed method has excellent generalization ability and can be applied to various topics.(3)A summary system for social text is constructed to realize the function of socialshort texts automatic summarization.In order to achieve the system of extracting sentences from social short-texts which contain the main content.Under windows operating system,a summarization system based on Paragraph Vector and CSM is designed and realized by using C++ and Python programming language.The main functions of the system include: preprocessing,sentence similarity computation,sub-topics detection,sentence weight computation,sentence extraction and summarization evaluation.Each module of the system is independent from each other and the data of the module are used for data exchange,which reduces the coupling degree.The reliability is high and the expansibility is strong.
Keywords/Search Tags:Microblog summarization, Sentence similarity computing, Deep learning, Chieses Sentential Semantic Model, Distributed representation, Semantic analysis, Natural Language Processing
PDF Full Text Request
Related items