Font Size: a A A

Research On Short Text Automatic Summarization Algorithm Based On TextRank And Word2Vec

Posted on:2019-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:D ChenFull Text:PDF
GTID:2428330596465445Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
With the gradual rise of social networks and shopping sites,a large number of short text data have been generated.How to obtain useful information from a large number of short text,is an urgent problem to be solved.As a research hotspot in Natural Language Processing,automatic summarization technology is an effective way to solve this problem.Through the automatic abstract technology,can help users quickly get a lot of information in short text.And quickly and accurately extract the short text summarization,also depends on the research of automatic text summarization algorithm.Therefore,this paper focuses on the study of TextRank automatic text summarization algorithm based on the graph,and for its shortcomings,combined with the characteristics of short text algorithm is improved.In this paper,based on the short text event oriented Sina micro-blog as the research object,according to the TextRank algorithm ignores the event topic similarity algorithm,the edge weight is not ideal,there are obvious problems and three redundant information extraction more abstract,and combined with the characteristics of short text,TextRank automatic text summarization algorithm is improved,and The experimental results show the validity of the improved algorithm.This work mainly contains the following four aspects:1)As a basis for automatic text summarization task,this paper focuses on the modeling and calculation of similarity of short text.Taking into account the statistical features and semantic features of short text,this paper proposes a weighted vector modeling method of text representation model of TF-ICF and Word2 Vec combined with the text.Then,the similarity algorithm selection,through theoretical analysis and experiments,this paper selects the cosine similarity calculation algorithm for short text.Through the experiment,verify the effectiveness of the proposed modeling method.2)This paper makes a deep research on the principle of TextRank algorithm for TextRank edge weight similarity calculation algorithm is not ideal,did not make full use of short text feature and semantic feature,this paper proposes to use cosine similarity calculation method based on text vector model with weighted Word2 Vec and TF-ICF method for reconstruction of the right side of TextRank.Through the experiment,verify the feasibility of the proposed reconstruction method and effectiveness.3)Aiming at the shortcomings of the TextRank algorithm ignores the text topic.The similarity between the topic sentences and short text value adjustment method of short text weight,factors and continue to adjust the length of the text similarity between topic sentences and short text value,the adjusting value of short text weight adjustment.Through the experiment,verify the validity of the proposed method.4)According to the TextRank algorithm in extracting multiple abstract,redundant information is easy to appear the problem between abstract sentences,introducing redundancy control algorithm of MMR after a short text after adjusting the weights of redundant control.Experiments show that this method can effectively prevent the preparation of short text contains too many similar information also appears in the final summary results.The main innovation of this paper lies in the following two points.First,we propose a combination of TF-ICF model and Word2 Vec model of short text feature extraction algorithm,and based on this,puts forward a method of right relationship between the cosine similarity algorithm for reconstructing TextRank utilization;secondly,this paper proposes respectively into the theme factor regulating factor,text length factor regulating factor,and the use of MMR Method,a method for adjusting the TextRank of automatic summarization algorithm of short text weight,improve the TextRank automatic summarization algorithms ignore the theme,the final weight by the length of the text is affected and three output results are obviously redundant.
Keywords/Search Tags:automatic document summarization, short text, Word2Vec, feature extraction, similarity calculation, TextRank
PDF Full Text Request
Related items