Font Size: a A A

Design And Realization Of Text Abstract System Based On Word Embedding

Posted on:2018-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y D D WanFull Text:PDF
GTID:2428330545996647Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet-based information technology,human go into the information age,which people of different occupations dependent on the network.In the face of a large number of information captured from the Internet(mostly in the way of the document),information Index is very important,especially for those users who need to search what they want to know according to their own content.What they need to do is to enter the keyword,and then wait for the responses.However,in the face of the retrieval method,users are often confronted with huge amount of information when reading the corresponding documents,and it is difficult to understand the main content of the whole document in a certain time.A document summary can be used to express a complex document in a simple and concise language,providing users with a profile of the original document and meeting the needs of users to quickly understand the relevant content.In this paper,by using the word embedding technology(a kind of technology that is often used in language model and characteristics of learning technology in natural language processing,which the real number of word is mapped to a low dimensional space vector,which is more convenient for text analysis),an automatic text summary generation system is constructed.The system can not only automatically generate a complete and accurate text summary for the provided text data,but also meet the requirements of the tester.The research content of this paper mainly includes the following aspects:Outlines the web crawler,big data processing,machine learning,the word embedding and text of the concept,characteristics and main content of the words by Word2Vec technologyand discussion.In this paper,the demand analysis of the system is carried out according to the actual application.Then the data acquisition module of the system,vector generation module,text summary generation module and system comparison and evaluation module are designed and implemented in detail.A text summary system based on word embedding is established.Explain the method of quality evaluation and compare with two text summary in detail for demonstrating that the word embedding method is more superior and sufficient.One of the text summary is generated by using the method of TextRank and other is generated by using word embedding.In this paper,two text summary systems are evaluated by using the method of ROUGE,amethod of scientific evaluation.For the current research,this paper has the following characteristics:(1)use the web crawler technology to obtain the raw materials of multiple social websites and improve the system efficiency by using the Map Reduce technique.(2)the data collected through traditional characters to simplified Chinese,a character encoding processing,Chinese word segmentation,and a series of operations,the frontier of word embedded technology training data,get a good Chinese word vector.Combining with the current popular word embedding technology,neural network in machine learning,Bayesian algorithm implements automatic text summary system,which facilitates user's information retrieval.(3)compare the abstract that is generated by using word embedding with anther abstract which is generated by using traditional TextRank in ROUGE-N standard and fully demonstrate the superiority of the word embedding technique.
Keywords/Search Tags:Word embedding, Text summary, Word2Vec, Machine learning, Web crawler
PDF Full Text Request
Related items