Font Size: a A A

Research On Chinese Word Segmentation And Text Classification In Distributed Text Knowledge Management

Posted on:2009-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z G LiFull Text:PDF
GTID:1118360272975332Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
We are in the era of a knowledge-based economy. The traditional elements such as land, natural resources, capital and labour were replaced by knowledge as major force to promote social progress and development. The management model, theory and technical are required to satisfy the knowledge-based economy. In order to confront the challenge, Chinese word segmentation and text classification are focused and researched in this dissertation. Distributed knowledge management architecture is presented also. Specifically, several achievements are addressed as follows:(1)An adaptive Chinese word segmentation algorithm is presented in this dissertation. New words recognition and ambiguity resolving are key problems in Chinese word segmentation. The result of traditional dictionary-based matching algorithm largely depends on the representative of the dictionary so that it can not recognize new words effectively, especially in some professional domains. The algorithm in this dissertation is based on 2-gram statistical model and can meet the requirements of application in accuracy and efficiency respectively. Long sentence and long term are dealed by the idea of'Divide and Conquer'while partial probability and overall probability are used to identify new words.(2)A classification algorithm based on proximal support vector machines (PSVM) is proposed. The main difference between PSVM and standard SVM is the corresponding condition of optimization. Classification is considered with a linear inequality quadratic programming problem by SVM while PSVM takes it as a linear equality quadratic programming problem only. This dissertation describes a new PSVM training algorithm based on descending dimension methods, which has faster training speed and smaller memory requirements advantages. In several data sets of experiments showed that the new classification algorithm has better classfication performance under the condition of time-sensitive through fairly loss of accuracy compare with SVM.(3)A new ontology-based hierarchical text classification algorithm is presented. Generally, text classification refers to flat text classication. Hierarchical text classification focuses on the classification under multi-classe. Text knowledge management systems are usually for specific fields, and have a certain ambiguity so that expose the feature of mutil classes. The text relevance and multi-concept-granularity of text are demanded by the users so we need better means to organize hierarchical text. Multi-granularity of the concepts is implemented in hierarchical classification by using the knowledge ontology and controlled keywords. Flat classification can be deal with this algorithm also.(4)Distributed knowledge management model based on Super-P2P is present in the dissertation to address the problems of centralized knowledge management. In order to satisfy the development of distribute organizations, effective distribute knowledge management has become the trends of knowledge management.Based on the above research and work, suites of Super-P2P based text knowledge management software integrated workflow called eKnow has been developed by the support of Shanghai Pudong SD Funds and Baosight Co. Ltd. Design ideas, system architecture and technical framework are summarized. The software has been used in several cases with substantial economic benefits.
Keywords/Search Tags:Knowledge Management, Chinese Word Segmentation, Text classification, Hierarchical text classification, Ontology
PDF Full Text Request
Related items