Font Size: a A A

Research On The Effect Of Text Vectorization Method On Text Classification Effect

Posted on:2019-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhaoFull Text:PDF
GTID:2429330545970813Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
As an important research direction of text mining,text classification technology plays an important role in the natural language processing.With the rapid development of the Internet,the amount of information people receive in their daily lives has also grown exponentially.And how to manage this exponential growth of data has made it easier and faster for people to obtain.This has become academic key issues in the field.Converting texts into structured data and building models is a necessary part of text categorization.At present,the text modeling methods include vector space models and topic models.Both can effectively express texts,but there are also respective ones.The disadvantages are large dimensions,high text vector sparseness,and difficulty in distinguishing synonyms and polysemy.Compared with the vector space model,the topic model can reduce the dimension and discover the hidden semantics.At the same time,it accompanies the problems of training samples,training time-consuming and other issues affecting the classification efficiency.After studying the related technologies of text vectorization,this article has done the following work:(1)Using web crawler technology to crawl over 10,000 news corpus of automobiles,finance,real estate,military,science and technology,and society.(2)Experimented with the text vectorization method of the benchmark on the crawled corpus and used multiple classifiers to classify it to obtain the classification accuracy and recall rate of the benchmark.(3)Combining the advantages of the topic model and the vector space model,a textrank weighted word vector,a topic-tfidf weighted word vector,and a topic-textrank weighted word vector are proposed to describe the text.It not only focuses on the role of keywords in the text,but also adds potential topic information to the text.(4)The three algorithms are trained in the news corpus,and a multi-group comparison experiment is conducted.The experiment uses accuracy rate and recall rate to compare each algorithm,through multiple groups of comparisons.Experiments prove the effectiveness and feasibility of the proposed algorithm.
Keywords/Search Tags:text classification, text vectorization, topic model, weighting
PDF Full Text Request
Related items