Font Size: a A A

Automatic Classification Research On Chinese Web Document Orientation

Posted on:2004-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:R HuFull Text:PDF
GTID:2168360095953124Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Since 1990s, as volumes of information available on the Internet continue to increase, there is a growing demand for tools to help people find, filter, and manage these resources more efficiently. Text categorization, the assignment of free text document to one or more predefined categories based on their content, is an important component in many information management tasks. Since Chinese text classification has a distinct feature based on Chinese language context and semantics, it becomes a special research field with special difficulties and controversy, among which Chinese text orientation analysis is especially frontier and challenging.With the development of modem network techniques, network becomes an essential tool for people to communicate with others. In order to maintain the robustness of network security, we start our project of Laboratorial Chinese Web Documents' Orientation Text Classifiers. In previous classifiers this process is very time-consuming and costly, thus limiting its applicability. So our classifiers may meet the requirements of real-time and high accuracy.In this thesis, we give a survey of the state-of-the-art in Chinesetext categorization, from the building of the corpus, the divided syncopation system of Chinese Web document, the selection of index, and the design of weight to the structure of SCUSCTC (SCU Smart Chinese Text Classifier) and its implement in Java. Finally, we give a thorough analysis of the experiments results and ascertain the main advantages and features of SCUSCTC as follows: 1) artificial intelligence and accuracy, 2) high speed and realtime, 3) Using XML as a standard and universal output format.The main contribution of this thesis includes: 1) Research the methodology and technology of Web text classification under modern network, and process a practicable system prototype; 2) Provide many correlative papers and development documents for further research; 3) Process a practicable research of Web text classification on gateway; 4) Design the performance request and related parameters' evaluation of Web text classification; 5) Implement a real-time Web text classification system (SCUSCTC), which satisfies certain high speed and high accuracy.In further research, the following issues must be considered: 1) The standardize of corpus; 2) Improve the accuracy of Chinese words divided syncopation system, handle the different meanings of one word and recognize the words that do not appear in the dictionary; 3) Process semantic analysis; 4) Dynamically update the training sets fed back by the user; 5) Quantitatively analyze the system performance influenced by different factors, use an appropriate model to compare and evaluate the Web text classification system; 6) Natural language process; 7) Distinguish the disguise of sensitive words.This thesis is divided into seven chapters, with Chapter 1 as theintroduction. In Chapter 2 we formally define 1C and introduce performance measures and thresholding strategies for TC. Chapter 3 describes the needed steps to transform raw text into a representation suitable for the classification task. Feature selection methods are surveyed in Chapter 4. In Chapter 5 we describe four methods that have been successfully applied to text categorization: kNN, Naive Bayes, Decision Tree and SVMs. In Chapter 6 we describe our own work using the "Korean and World Cup Corpus", while Chapter 7 concludes the whole thesis and discusses open issues and possible avenues of further research for TC.
Keywords/Search Tags:Chinese Words Divided Syncopation, Maximum Method(MM), Vector Space Model(VSM), Latent Semantic Indexing(LSI), Feature Selection, Support Vector Machine, Decision Trees(DT), C4.S, k-Nearest Neighbor(kNN ), Chinese Text Classification(CTC)
PDF Full Text Request
Related items