Font Size: a A A

Automatic Summarization Alorgithm For Chiness Short Text

Posted on:2018-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y CuiFull Text:PDF
GTID:2348330542468709Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Social media platforms,such as Weibo and Twitter,have been attracting a large number of users to release and share information,because they has the advantages of easy operation,convenient interaction,rich topics and real-time updates,etc.As a result,they not only become one of the main channels for users to obtain information,but also provide useful data to help businesses to make decision and seize the opportunities.In order to improve the comprehensiveness and diversity of information acquired,short text automatic summarization technology becomes one of the key technologies to solve the problem.This thesis focuses on the excerpt summarization technology for Chinese short text.Considering the characteristics of short text and the advantages of text summarization technology based on clustering synthetically,this thesis proposes an automatic summarization algorithm for short text of social network.It makes sure that the summarization can filter the redundant information and content noise effectively,and reflect the key information of the all sides of the whole dataset.The summarization extracted is helpful for enterprises to make decisions and government to carry out public opinion control work,which has practical significance[1].Firstly,considering the characteristics of short text which have short length,sparse feature and lack of context semantic,the semantic information of words must be extended,so this thesis proposes to obtain word embeddings by training Word2Vec model.More important,the words embeddings still have the semantic relation through the arithmetic operations.So,the processing of the short text can be simplified to the operation between words embeddings corresponding to words in the short text.Secondly,in order to calculate the weight of words,this thesis proposes three main influencing factors,such as the frequency,the left and right entropy and the coverage of the words,then constructs the influence transfer matrix and redesigns the method to calculate the weight of words using the idea of TextRank.Thirdly,combining the weight and semantic information of words,a new short text similarity calculation algorithm is proposed.In order to improve the accuracy of similarity of short text,we can transformed the problem of similarity calculation between short texts into solving the problem of how to move all the words in a text to another with the shortest distance.Finally,applying the density-based clustering algorithm to cluster the short text.The number of clusters and the center of the clusters are obtained by calculating the local density of each short text and the shortest distance to the short text with higher density,then assigning all short texts to the clusters which they belong.Completing the process of clustering,this method just needs to iterate only once,so the efficiency of clustering improved a lot.A last,calculating weight of each short text according to the weights of the words,sorting the short texts in each cluster,and extracting the most important short texts from each cluster to form the summarization.Using these functions,the summarization obtained must cover all aspects of information,and the diversity and the quality of summarization have been improved.
Keywords/Search Tags:social network, short text, automatic summarization, Word2Vec model, word weight, short text similarity, density-based clustering algorithm
PDF Full Text Request
Related items