Font Size: a A A

Research On Text Classification Based On Natural Language Processing And Machine Learning

Posted on:2019-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:B XieFull Text:PDF
GTID:2428330548973316Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,the number of electronic text information is increasing.In order to facilitate the user to find the information required by the user from the text quickly and accurately,and what methods can be used to organize and manage the text information effectively will become a major challenge for the information technology.The automatic classification of texts is a key technology for processing massive text information,it can solve the problem of complicated information confusion to a large extent.Categorizing related information resources not only facilitates users to search for the required information accurately,but also can realize effective data management.From the perspective of improving the accuracy of automatic text classification,this paper mainly analyzed the automatic text classification and related technologies.From the natural language processing to the whole process of classification,each link was closely linked,the most important thing was to ensure the accuracy of the word segmentation in the process of Natural Language Processing,so as to ensure the accuracy of subsequent classification.In the process of Natural Language Processing,based on the matching thesaurus,this paper used the "Telelogical Survey of Railway Engineering Geology"(TB1002-2007)as the training standard.Firstly,the natural language processing of text documents was performed.Due to the limitation of word segmentation and the ambiguity recognition of words,in the process of constructing a corpus,both computer segmentation and artificial segmentation were used to segment words.The final word segmentation results selected words whose term frequency was greater than a predetermined threshold as the final geological exploration corpus.In order to avoid the separation of the terms of the same geological survey,in the process of studying the automatic classification of geological survey texts,the corpora were added to the Python database for word matching.Based on the description of the natural language processing process described abo-ve,feature reduction techniques were performed on the word segmentation results of the text document,ie terms whose word frequency was greater than a certain threshold were used as feature words,and then the word cloud analysis was performed on the text document.The size of the word shape in the word cloud map indicated the frequency of the term,and the higher the word frequency was,the greater the font size was.In the end,machine-learning methods were used to classify text files automatically after word segmentation.In this paper,two types of machine learning were used:k nearest neighbor classification and Bayesian classification.In the results presented by the two classification methods,the prediction accuracy of the k nearest neighbor classification test set was significantly higher than that of the Bayesian classification test set.In the k nearest-neighbor classification algorithm,from the final result analysis,the prediction accuracy rate of ten points was higher than that of the sixteen classifications;when the text length was controlled,the k nearest neighbor classification algorithm was in the best state and the correct rate reached 100%.
Keywords/Search Tags:Natural language processing, Feature dimension reduction, Word cloud analysis, Text classification
PDF Full Text Request
Related items