Font Size: a A A

Research And Implement Of Chinese Multi-Selection Text Categorization System Based On Hadoop

Posted on:2020-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:S T DingFull Text:PDF
GTID:2428330572467222Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology and the popularity of the network,the network information has also shown explosive growth.Due to the huge amount of data mixed with AD,harmful information and useless spam make it more and more difficult for people to easily and efficiently obtain available information through the network.In order to improve the speed of obtaining available information,many scholars have done a lot of research and adopted many effective methods and the Text Categorization technology is one of the important means.At present,the commonly used Text classification method is based on the classification function provided by the supervised learning algorithm,which has the problems of slow classification,low accuracy and single classification function in the face of massive data.For the urgent problem to be solved,this paper proposes CTF(Chinese Text Fast)classification algorithm,HA-SVM(High Accuracy Support Vector Machine)classification algorithm and CM-Selection(Chinese Multiple Selection).Text The main work of this paper is as follows:(1)According to the characteristics that the Text title represents the Text category,the CTF classification algorithm is proposed by using the word segmentation,de-stop words,Word2 Vec model training,category queue and other techniques.The algorithm is a fast classification algorithm,which can meet the requirements of fast classification.It can complete classification under the condition of time complexity O(n),and can also control the classification accuracy rate to about 75%.(2)According to the theoretical deficiency of SVM algorithm relying heavily on Text vector,the HA-SVM classification algorithm is proposed.The algorithm is a highaccuracy classification algorithm that can meet the classification requirements of highaccuracy classification.It not only improves the SVM algorithm,but also improves the accuracy of Text classification,especially in dealing with Chinese Text with less content and chaotic content,which will increase the accuracy by more than 35%.(3)Through the study of the classification system,Based on the research of Text classification system,based on CTF classification algorithm and HA-SVM classification algorithm,the CM-Selection Text classification system is constructed.The system provides fast classification and high accuracy classification.(4)Through the current big data processing technology and its use in the field of Text classification,the integration of CTF classification algorithm and HA-SVM classification algorithm into Hadoop platform can significantly improve the processing efficiency and process time of massive text.
Keywords/Search Tags:CM-Selection, Classification, Hadoop, HA-SVM, Word2Vec, CTF
PDF Full Text Request
Related items