Font Size: a A A

Design And Implementation Of Kazak Text Categorization System

Posted on:2015-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:H T MuFull Text:PDF
GTID:2308330473451761Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the extensive application of computer technology in recent years in the minority areas of Xinjiang, Kazakh-based electronic document is also increasing, and pile up. How many of these more effective electronic document data management, and for the majority of users to provide convenient and efficient information retrieval has become an important data mining technology content. Text classification is for a technology to this problem, and has proposed a series of solutions. The artificial intelligence of text classification information processing technology, mainly used in filtering information in the field of information retrieval, database applications and digital library construction.Text classification is to divide a large text document into one or a group of categories, making the content of each category represents a different theme.At present, the text classification mainly uses the vector space model based on statistics, related to the text pre-processing, Kazakh word stemming, feature selection, feature weighting methods, classification algorithms, classification performance evaluation and other processes.Feature weighting methods is an important issue of text classification based on vector space model, related to the final classification results.as one feature weighting method,the basic idea of TFIDF is to tatke the word frequency in the text as the TF weights,and then multiplied by the IDF function to complete the weight adjustments.Not only consider the feature selection process implemented using word frequency statistics and the method of combining language information to calculate the feature item weights vocabulary word frequency, also feature items vocabulary concentration, the dispersion calculations. Weight vector using the above information on the training set and test set text text text form for each type of feature items vocabulary, forming multi-dimensional vector space of all training set of the text and use K recently obtained from the test set text classification results. By this method effectively improves the accuracy of Kazakh text classification and achieved good results.In this thesis, K nearest distance method, the Kazakh text classification research(mainly for Kazakh newspaper text), describes the relevant technologies and related text classification algorithm, using the basic idea of software engineering for the design and implementation of a Kazakh text classification systems. The system is divided into the following sections:( 1) Kazakh text preprocessing module, dealing mainly with the Kazakh word, stemming and stop word filtering;( 2) Frequency statistics module, in accordance with the requirements of K nearest distance method and feature selection algorithm for statistical document features characteristic words appear in the text from the Kazakh frequency;( 3) feature selection module;( 4) weight calculation module, and the calculation;( 5) classification algorithm to achieve the closest distance K the Kazakh text classification algorithm;( 6) classifier evaluation module from the recall, precision and other aspects of the evaluation. At the same time a certain amount of software testing work.
Keywords/Search Tags:Kazakh Text, Text Categorization, K-nearest Neighbor Algorithm, Feature Selection, The Weight Calculation
PDF Full Text Request
Related items