Font Size: a A A

Research On Automatic Summarization Of Chinese Documents Based On Deep Learning

Posted on:2019-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:X T JiaFull Text:PDF
GTID:2428330548476809Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the advancement of science technology and the continuous development of the Internet,various information on the Internet is teeming with mountains and growing at an alarming rate every day.Under this circumstance,how to extract the knowledge of interest in the vast amount of Internet information quickly and easily has become one of the most urgent problems in this information age.Automatic summarization is just the vital technique that can effectively solve this problem.Through the automatic summarization of massive amounts of Internet text information,it can effectively improve the efficiency of users' browsing and obtaining knowledge,and provide effective support for people to quickly get useful information in life and work.In recent years,deep learning technology has risen and flourished and the deep text representation model has also received extensive attention from researchers at home and abroad laying which has laid the foundation for the further improvement of automatic summarization.Taking into account the shortcomings of traditional text representation models such as the inability to fully extract text semantics,context and grammar information,this paper researched and improved the classic text automatic summarization technology on two typical text data types.The main contents are as follows:1)According to the paper single-document text,an automatic summarization method based on Doc2vec and improved clustering algorithm was proposed.For the paper text data obtained from China National Knowledge Infrastructure,the Doc2vec text representation model was introduced to realize sentence vectorization based on the full investigation of the semantic and grammatical information of the context of the statement;the initial clustering centers of the K-means clustering algorithm was determined by combining the two metrics of density and distance to overcome the defect that the clustering result caused by random selection of the initial center is unstable;the sentence with the largest information entropy is extracted as the center sentence of the cluster and the automatic summarization extraction process is completed.2)According to the Sina Weibo multi-document text,an automatic summarization method based on weighted topic distribution expression was proposed and applied in the automatic summarization process which had combined the advantages of Word2vec which can fully extract the semantic,grammatical information of the context and the excellent performance of the topic model in the multi-document text clustering.After being trained to word vectors,the microblog words are clustered into topic word classes which were used as features to represent the microblog sentences by considering degree of membership of the microblog sentence relative to the topic word class and the topic word class 's weight.The K-means clustering algorithm is used to cluster micro-blog vectors and sentences with maximum information entropy were extracted from every cluster so as to achieve automatic summarizati-on extraction. The experimental results showed that the automatic summary generated by this method can effectively represent the main idea content of the source document and it has advantages in terms of accuracy,recall rate,and F value compared with the traditional model which showed that the summarization extracted through this method has a relatively high quality and achieved a certain degree of improvement in the effectiveness of automatic summarization of Chinese documents comparing with the traditional text representation method.
Keywords/Search Tags:automatic summarization, Deep learning, Doc2vec, Word2vec, K-means
PDF Full Text Request
Related items