Font Size: a A A

Research And Application Of Chinese Text Classification Technology

Posted on:2020-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:J Z WangFull Text:PDF
GTID:2428330590996423Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The development of the Internet gives the wings of the third revolution of science and technology,which leads people to the era of information explosion.Every day,the total amount of information in the world is rising at a horrible rate.Our brain is actively or passively receiving and processing large amounts of information all the time.In this era of high-speed information,time becomes more and more precious,how to accurately find,filter and distinguish the information we need from the complicated information is extremely valuable.As an important carrier of information,how to distinguish the text speedily is very important,due to this,text classification technology appears.Since the birth of the text classification technology in the 1970 s,its status has become more and more important.Therefore,the research and application of related technologies for text classification also have great significance.This paper first introduces the background of text categorization,the current progress made at home and abroad.This paper is based on the purpose of improving the stability and accuracy of text categorization,and summarizes the related techniques of text categorization.The pre-processing stage of text,text feature selection algorithm,text representation model,text weighting algorithm and text classification algorithm are introduced in detail,and the text feature selection and weighting algorithm and application of text classification technology are deeply studied.The main research contents are as follows:For the case that the chi-square test feature algorithm has defects in low-frequency words,the paper proposes two improvements.First,a DT(Document & Term)factor that considers the influence within the category is introduced,which includes word frequency and text frequency factor.Second,The original algorithm does not consider the influence difference of the feature words in different categories,thereby introducing the category deviation factor.Based on the above two improvements,a new improved chi-square test algorithm ICHI is proposed,and the other traditional feature selection methods and other improved methods are compared by three sets of experiments.We use SVM algorithm for classification,classification performance is improved by 5.6% compared with traditional CHI,and it is improved by 2.2% compared with other the existing improved methods.The comparisons verify the effectiveness and superiority of the improved algorithm.Aiming at the shortcomings of the traditional TF-IDF(Term Frequency-Inverse Document Frequency)algorithm for the influence of feature word categories,the paper proposes a new concept of inverse class frequency.Based on this,the original TF-IDF algorithm is improved and the TF-CF(Term Frequency & Category Frequency)algorithm is proposed.Then,we propose a W2V-CF model,which uses the weighted operation of word2 vec word vector and TF-CF value as the feature input of classification,and then we design experiments,compare the new model with other five models including traditional methods and other literature methods.We use SVM algorithm for classification,compared with the traditional BOW word bag model,the performance is improved by 7.7%,and compared with the existing improved model,the performance is improved by 1.7%.The comparisons verify the rationality and practicability of the new model.Combining text classification technology with TCP reverse proxy technology,the paper designs and implements a system that can isolate and filter sensitive web pages and files(including word,pdf,etc.)or set classification rules to restrict them,and verifies the practicability of the system by functional test and stress test of the system.The design and research work of this system has reference significance for the research on the control management and distribution technology of the follow-up online content files.
Keywords/Search Tags:Text classification, Word2vec, Feature selection, The reverse proxy
PDF Full Text Request
Related items