Font Size: a A A

Study On Chinese Text Classification Technology Based On Improved Text Similarity Algorithm

Posted on:2020-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:M G LiFull Text:PDF
GTID:2428330596487273Subject:computer science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,various types of data have grown geometrically,and how to extract truly valuable information in massive data has become a top priority.For text data,efficient and accurate classification of a wide range of data is critical,which requires us to conduct in-depth research and analysis of text classification technology.In addition,text classification technology as a key technology in natural language processing is also the premise and basis for realizing many common applications,such as question and answer systems,sentiment analysis,relationship extraction,and so on.Although Chinese text classification technology started relatively late and faced more complicated grammar analysis problems,with the research and improvement of related text processing algorithm and the improvement of data processing performance,Chinese text classification technology has made great progress and development.This paper firstly compares and analyzes several text classification techniques based on text similarity algorithm,and then introduces the unique semantic and grammatical structure of Chinese text,that is,in Chinese text,words are the basic unit of sentence meaning expression;for words with different parts of speech in Chinese text,the amount of information it contains varies widely.Based on this,the text similarity algorithm in Chinese text classification is improved,and the combination of statistics and linguistics is used to calculate Chinese text similarity.In addition,the existing word segmentation method is optimized when Chinese word segmentation is performed after data preprocessing,and the commonly used evaluation criteria are also improved in combination with the experimental data characteristics.Finally,through contrast experiments and analysis,the Chinese text classification method based on the improved text similarity algorithm achieved 72% accuracy in the final experiment,compared with the Chinese text classification method based on vector space model similarity algorithm.By Comparison,the rate has increased by nearly 20%.It shows that the improved Chinese word segmentation method and the text similarity algorithm based on part-of-speech tagging and Word2 Vec model can achieve better classification results in Chinese text classification.
Keywords/Search Tags:Text Classification Technology, Text Similarity Algorithm, Part of Speech Tagging, Word2Vec Model, Chinese Word Segmentation
PDF Full Text Request
Related items