Font Size: a A A

Research On Text Classification Algorithm Based On Corpus Characteristics

Posted on:2020-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:B C HaoFull Text:PDF
GTID:2428330599460495Subject:Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has led to a large increase in text information.How to accurately classify valuable text information is one of the hotspots in the field of natural language processing.When text classification is carried out,since the traditional method easily ignores the influence of corpus features on the classification effect,considering the characteristics of the corpus itself will have a positive impact on the text classification.In this paper,the supervised machine learning and deep learning classification algorithms are used,and the following research is carried out on text classification based on corpus features.Firstly,the Term Weight-Inverse Document Frequency(TF-IDF)feature weighting algorithm is improved for the problem that the emotional linguistic emotion characteristics are not obvious.By constructing a corpus-specific sentiment dictionary and matching emotional corpus,the feature enhancement and redundant information removal of emotional corpus are realized,and the Term Frequency-Inverse Document Frequency vector space model for emotional corpus classification is optimized.The experimental results show that the model improves the classification performance on a variety of classifiers for emotional corpus.Secondly,based on the problem of text corpus length imbalance,based on convolutional neural network and long-term and short-term memory network classification model,the method of model processing corpus data is improved.On the input of the model data,the text corpus is equalized by the sentence self-loop method to achieve the purpose of mobilizing the global neural unit to extract features.The experimental results show that the method accelerates the convergence speed of the model and improves the performance of unequal corpus classification.Finally,based on the characteristics of specific corpus and strong contextual semantics,a classification model combining convolutional neural network and two-way gated cyclic neural network is designed.The model uses self-training topic vector to enhance the semantic connection of words.Combining the advantages of convolutional neural network to extract local features and two-way gated cyclic neural network to capture features before and after,the feature reduction and context semantic extraction are performed on the corpus.Experiments show that the model reduces the corpus data dimension,saves network computing resources,and improves the accuracy of specific corpus classification.
Keywords/Search Tags:corpus characteristics, text classification, word frequency inverse document, machine learning, word vector, cyclic neural network
PDF Full Text Request
Related items