Font Size: a A A

Chinese Text Automatic Classification

Posted on:2009-12-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:L Z HaoFull Text:PDF
GTID:1118360245463195Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Automatic text categorization, as a core technology of automated information management, its research has been subjected to a high degree of concern. In the late 1980s, most of the use of automatic text classification systems is based on the knowledge of artificial intelligence systems. During this period, classification rules are mainly from manual preparation and maintenance, and be used to determine the type of document. Therefore, in order to construct a text classification system effective, it is necessary to spend a lot of manpower, resources, and higher research and development costs, and difficult to adapt to different areas. The application of transplantation costs is high. Until the 1990s, the situation has earthshaking changes. Statistical machine learning methods have begun to spring up, replacing the rule-based system position. Based on the statistical machine learning text classification methods overcome the shortcomings of a rules-based system, and the classification model training sets are automatic learning, and do not need manual intervention, the efficiency and accuracy of the classification has been greatly improved. And in many areas have a lot of application of the results.This paper is from the mayor's public access line project text, classification of specificpractical problems, and study is based on the real data sets. Mayor's public access line project is the "highway" of service for people, a commitment to understand the society and public opinion.lt reflects the social dynamics and provides important informationfunctions. Massive text data from mayor's public access line project, on the one hand enlarged the researchers' understanding of the data itself, rising from perceptual knowledge to rational knowledge. Some of the people from small countries reflect the implementation of major policies. On the other hand, data covers wide fields, covering all aspects of the daily lives of the people, reflects the community as a whole posture, being a barometer for the implementation of the policy. To mine the rule contained in the text and abstract the law, which derived from government decision-making for the information society, the studying is not only of great practical significance, but also of extreme challenging. In recent, years as the rapid development of mayor of public phones and the volume of complaints annually increasing, in the attribution of the complaints the session departments are facing unprecedented difficulties, Artificial classification has been unable to meet actual needs, the urgent need to achieve the automated text classification complaints, and satisfactory classification accuracy.In October 2005, I participated in the Northeast Normal University in machine learning research group (the initial), in June 2007 the study group assisted Changchun City mayor to open its office phone system software and hardware conducted all-round upgrading work, developed" admissibility developed telephone system, "handed down by the telephone system", "automatic classification unit machines", "Automatic Categorymachines", "statistical analysis and prediction system", "Website for the Statisticalanalysis and prediction Information". "large-screen information for presentation systems" and other systems.In order to achieve faster software upgrades of mayor's public access line system, I developed a comprehensive management information system control platforms, and based on Changchun City Mayor massive public telephone text data, using statistical theory, the scientific mayor public telephone database has been established, and identifiedthe organic links between the telephone. All telephone complaints by industry alone unit, time for the cross-classification, On this basis, a lot number of statistical analysis software has been established to rise the analysis to the level of theory, to mine periodicity and trend hidden in data, so to achieve complaint data prediction.This paper starts with the mayor publicly phone Chinese text categorization and conducts in-depth studies, studies include the following: 1. Introduce a number of research groups of automatic text classification at home and abroad and the developmentof the automatic text classification, the basic concept and related issues, several developments in the classifiers of text classification and evaluation methodologies. The paper is around Changchun City Mayor public phone data business processes, characteristicsof data itself and difficulties facing in the process of business, and the urgency of achieving units or industry complaints text automatically distributed to the of a simpleelaborated. This paper study on specific characteristics of mayor's public telephone text data:(1) The length of documents is short, document length and the number of Chinesecharacters curve are shown in Figure 1. Documentation's average length is 65 characters, and the standard variance is 34 characters.(2) Training sample documents in the category of complaints in the distribution is uneven (or deviation (skewed)), that is, the number of sample categories possibly exists magnitude difference. The Public Security Bureau has the largest number of samples, nearly ninth of the total number of documents, the Monitor is the least, only having 11 samples. The skewness of data sets is an important factor of unperfected classifications. When data is skewed, samples can not accurately reflect the entire space of the data distribution. In data skewed circumstances, classifiers can easily be drowned by major subcategory and minor ignored.(3) The content of documents is mainly everyday language, most of the problems reflect the daily lives of chores. A large number of place names, person names, names of unit construction, polite platitudes, dialect as the saying goes have appeared, even a lot of typos and unconventional grammar and word order, nearly covering almost every aspect of people's daily life.2. Introduced a machine code standards that Chinese characters often used, a brief introduction of Difficult Chinese Fast Facts software developed by the author, which completely resolve the problem of the importation of difficult words, then introduced several commonly used methods of the Chinese word segmentation, such as the repre- sentation of text in vector space model, and several commonly used representations of term weighted. Aiming at the large number of units, place and construction names, the method of gaining the access to an unknown word is put forward, aiming at the large number of clich e s and platitudes which having useless information, this paper presented a method to extract stereotyped words to reduce impact of noise word in text classification.3. Briefedly introduced several common feature selection and feature extraction methods, such as the extraction of stop words and low-frequency words, mutual information,odds ratio, Chi-square statistics, hidden semantic indexing and terms clustering criterion.The extraction of stop words focused on the telephone text data based on mayors public access line system. A differentiate criterion of stop words is given from the perspective of statistical. That is,the stop words satisfy the following two conditions vocabulary:(1) it has high-frequency documents:(2) and the term among all Categories with smaller statistical correlation.2×p contingency table, calculate the statistical correlation between the words with all categories and order the words by word-ordering formula. I do experiment to compare the classifiers using or not using the extraction of stop words. The results showed that out of the word extraction method is very effective. For instance, a word-table containing only 500 stop table can delete the total words by 43%.and increase classification correct rate by 5 percentage.It is obvious that the extraction of stop words has high effect on the rate of correct classification, see Table 1, Table 2 and Table 3.Which the Rmicro is Naive Bayesian classifiers micro-average recall after deleting the corresponding stop words. The "percentage" is the percentage of the stop words frequency in all words frequency. As can be seen from the table, deleting or not deleting the word has a large influence on the correct classification rate, further, relatively low average of macro-averaging show that the number of types is large and data skewness has a great negative impact on classifier.In the low-frequency words, removing low-frequency words is a frequent method in feature selection in the vector space dimension reduction. Using documentation low frequency threshold to delete the low-frequency words are often used.But the selection of documents frequency threshold m is often by experience, usuallym is relatively small, Clearly the selection m has nothing to do with the classificationcategories, At the same time deleting plenty of low-frequency words has different impact on categories, and often the rate of low-frequency words in the small categories (categories which contain less samples ) be deleted is much higher than that of large categories.Because of the serious skewness of the data sets in the mayor public telephone, it is reasonable to delete certain proportion of low-frequency words of the document than fixed frequency threshold from each category.An intuitive idea is from Naive Bayesian classifierConsidering each of the conditional probability of decompositionP(Wj|Ci).its implicationis the probability of the words Wj in the category Ci,it can be estimated by word frequency of words Wj and the total word frequency of Ci We can see that low-frequency words has very small probability in the largest category, while larger in minor category, so it might be possible to delete the smaller of conditional probability P(Wj|Ci), and preserve the probability of larger items. I hope to deleted more low-frequency words in large category, and delete less in small categories. I did test on mayors public access line project data sets, the impact of the cumulative percentage of different frequencies of the classification are shown in table 4.In Table 4,'C.percentage frequence' denotes the cumulative percentage frequency, 'RWN' denotes the reservations number of words. Practice shows that selection ratio of industry data is 70% is more reasonable, the dimension of vector space model can be reduced to 3.445 from 13,909, and the correct rate increased by one percentage, it also shows that there are a lot of redundancy data, a reasonable choice of the characteristics selection can effectively reduce the dimension of vector space.In the choice of words, this paper introduced a feature selection method with a combination of Chi-square statistics and Odds ratio, which can effective select words which have positive correlation with the category.In the text classification in unit, using feature weighted Naive Bayesian classifierthe correct rate increases by 2 percentage. But it is difficult to choose the model parameters, a better model is still in searching.4. The large number of test data of Mayor's publicly phone access project showed that Naive Bayesian classifier has its own characteristics, the algorithm is simple and fast, more suitable for real-time requirements of the classification of data, but with less precise. To improve it, I combine the feature selection and empowerment, and develop a characteristics of weighted Naive Bayesian classifier model based on multiple hypothesis testing.By the actual characteristics of text, data in Mayor's publicly phone access project, a text classification hierarchy model based on geographical information has been put forward. Its corresponding test results has been compared with that of Naive Bayesian classifier. The test results showed that the model is practical and has low rate of misjudgment. Mayor public telephone classification is a multi-classification, and test results shows that only the correct rate of industry one is low, others are high, so the classificationof industry one is unreasonable, In addition, the impact of cumulative error is also great, from 73.21% of level one to the 43.31% of level five.In Table 7, 'No' denotes the orders of industry; 'Actual' denotes the number of the testing samples; 'Correct' denotes the sample numbers of correct assigned classes by the classifier; 'Conditional correct rate' denotes micro-averaging recall known that industry one clustered correctly. Further analysis of industry data indicate that the distribution of misjudging data in the various industries is extremely uneven, the designof classification is unreasonable in some ways, and the unevenness prompted to consider clustering of similar text in industry data, so to make the settings of classificationcategories more reasonable. The test based on industry data showed that the mayor public phone text data is standardized, testing 10.000 samples can be, and Naive Bayesian classifier has good stability. Based on practical need of distributing complain list, I put forward automatic classification of a unit constructed by text classification model based on classification combined by several classifiers.The automatic classification greatly reduce the pressure for the work in admissibility,and the number of the complaints of admissibility has significantly increased, categories of units for the direct distribution rate is 80.76%.and accuracy rate is 81.04%. In Table 8. 'DDR' denotes the direct distribution rate, 'DAR' denotes the distributionaccurate rate.5. Today's software development has already begun to develop in the direction of large-scale integration, and well-known corporations do not hesitate to spend huge numbers of money to establish their own dedicated area platform and to extend life cycle and lay the foundation for the second software development. From a long-term point of view, not having their own large-scale software development kits, it can maintain early stage of development, but it is so hard for further deepen study. This paper presents the procedure-oriented platform design idea: use the rapid module structure of platform which is linked into an organic whole by indexed tree, regard each module as an object, use language of platform to define the events, properties and methods of the object, and achieve its functions. The author developed a comprehensive management information system control platform, detailedly explained the development of platform the key steps, and pointed out the differences between platform-oriented programming and object-oriented programming are as folloeed:(1) Platform system already exists overall framework structure, security systems . and access control systems has been defined completed. The integration degree of the software system is high.Large number of modules is pre-defined, many module itself is very powerful and can be used alone, the freedom of users designing significantly falls. Using platform-oriented programming whose basic pattern has been set, users need to consider less on the overall structure with high programming speed.Object-oriented programming need to first consider the implementation of the whole project, and then consider how to combine various objects together. Object functions are all relatively simple, and there is no overall framework. The users need to design of the overall framework independently, with no security systems and accesscontrol systems. The user need to separate design with low level of integrated and develop their own accumulated programming tools and accumulated research experience.In -oriented programming, objects can be organically composed to design a software systems with unique and aesthetic style. Designing is flexible with slow speed of programming, at the same time, the cost of training and maintenance is increased.(2) Platform has area trend even in the progress of development . it is not a all-purpose but a specific area software. The functions often used are designed as componentsor modules, laying a solid foundation for second development. From a sustained development perspective, platform-oriented programming is the trend of software systems,which is a effective way to extend life cycle of software tools.Each object of object-oriented programming is all-purpose with no trend of the field can be applied to any field. Each development of a software system starts from zero, so the reuse is seldom and it caused tremendous waste of resources to develop redundantly with low-level. From the view of continuance, the application software systems using object-oriented programming has short life cycle and high maintenance costs, with few secondary development interface reserved, so it is necessary to be replacedby software reusable and with high professional integration. Integrated ManagementInformation System Control System, which is a combination of development and application, has its own programming language. I use internal control language to control modules in the process .further to control all components and modules in the process. I organize all functional modules organically by index tree. In the constructionof mathematical model of system platform, the paper point out: any management software can be divided into two parts by functions, that is all management software have events transaction in common and its specific business. I design events transaction to modules, construct a software programming environment for specific business, and achieve specific function by program. Although there are many systems say that they are platforms. But they do not have programming languages, and its further development is difficult. It is hard for secondary development, so the function is extremely restricted and the system expansion for development environment outside is needed. This paper pointed out that data and control process can be ordered, so we can use tree to express function, further to organize function list by indexing tree. .411 the data are expressed by relational database, so the work is to manipulate the database Each node of index tree corresponding to a function, a section of procedures, and a only logo. From the object-oriented perspective, each node is an object which has its own properties and methods.Human-machine interface designing is very time-consuming, this paper presents the six man-machine interfaces, the default function can meet majority of the need. At the same time, these properties and methods of interface can be defined, so the functions and the software life cycle can be extended. This paper introduced designing principle of platform system in detail and showed he core of key technologies. By combining the strongpoint of many softwares. 10 kinds of instruction set and their own specific use are introduced briefly. Platform presented a detailed algorithm, and instruction set is optimized by practice. Finally, a security of platform system is presented, and the trail and. access control procedures impenetrate the operation from start to finish.Mayor's public telephone system is built on the systems platform. The use of many performance of the platform reduces the cost of software development and lay a good foundation for future software upgrades and maintenance.With more members coming into the team and gradual improvement of the toolbox,theoretical research and software development work gradually come into the fast lane. Practical work put forward a lot of practical problems. Although we have accesseda lot of information, but the algorithm we really can play in the practical work is seldom. We must work with specific practical task to improve them and put forward our own way.
Keywords/Search Tags:text classification, mayor's publicly telephone access project, the Integrated Management Information System Control Platform
PDF Full Text Request
Related items