Font Size: a A A

Research On Text Classification And Automatic Summarization Based On Distributed Representation

Posted on:2019-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhuFull Text:PDF
GTID:2428330572461857Subject:Engineering
Abstract/Summary:PDF Full Text Request
How to represent texts for machine processing is the first problem for natural language processing.Mapping text to vectors is a good way to solve this problem.The Distributed Representation method maps text elements into fixed-length vectors by neural network training,and the distance between vectors characterizes the semantic relevance between text elements.The Distributed Representation method overcomes the problems of one-hot vectors that maps too high dimension and can't characterize the similarity between text elements.In this paper,a new text categorization algorithm and multi document automatic summarization algorithm are designed based on the distributed representation of texts.To deal with the large dimension and extremely sparse structure of the text graph model,this paper designs the concept directed graph model for the text by combining the distributed representation of the words with the representation of the text graph model.First,we map the words in the text to the word vector,and gather the words with high semantic relativity into the concept through word vector clustering.Then the concept directed graph model is constructed according to the order relation of the words,and save the adjacency matrix corresponding to the concept of a text directed graph model as a grayscale image.The Natural Language Processing task is transformed into image processing task,and the text is mapped to the grayscale image.Finally,a three-layer convolutional neural network is designed to classify text grayscales and compare the classification results with other text classification algorithms,the results show that the algorithm proposed in this paper is better than the other three text categorization algorithms.Aiming at the redundancy problem of abstract sentences on multi-document automatic summarization in China,this paper combined distributed representation method of model sentences,with the spectral clustering algorithm,and designed an extractive multi-document summarization algorithm based on spectral clustering.First,we maps the sentences in the text to the sentence vector,then clustered sentence vectors by spectral clustering algorithm,and divided the documents into different sub-themes document.Then,the sentence relation graph model is established in each sub topic documents,and the sentence weight is iterated by TextRank algorithm.Finally,extract the sentence with the largest weight as the summary sentence,and make a summary according to the location of the sentence in the original text.The internal evaluation methods for the abstract often require manual participation and can not be efficient and objective.Therefore,this paper proposes an automatic abstract evaluation method based on text information entropy.The quality of the abstract is measured by the ratio of the abstract information entropy to the original document information entropy.This evaluation method does not require humans to writer a reference abstract,and using this evaluation method,we compare the proposed multi document summarization algorithm with other algorithms,the results show that the multi-document automatic summarization algorithm proposed in this paper is better than the other two automatic summarization algorithms.
Keywords/Search Tags:Distributed Representation Model, Concept Directed Graph Model, Text Classification, Multi-Document Summarization, Text Information Entropy
PDF Full Text Request
Related items