Font Size: a A A

Research On Rough Sets Based Chinese Language Modeling And Its Applications

Posted on:2004-09-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q C ChenFull Text:PDF
GTID:1118360155476373Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Natural language modeling provides the foundation of processing and applying language information by computer. Though statistical language models have been successfully applied in natural language prosessing (NLP) field, we still face to the problem of increasing the efficient and accurate for linguistic knowledge mining and redundant information pruning. With the advantages in addressing problems of information redundant, contradiction and vague, rough sets techniques have been applied successfully in Knowledge Discovery in Database (KDD). By introducing rough sets techniques, this paper is concentrated on the methods and model for linguistic knowledge mining from large scale corpus and the application of constructed models in natural language processing. This paper is composed of four parts: At first, to address problems in Chinese pinyin-to-character conversion, this paper provides the structuralizing method for textural information. Based on it, a linguistic knowledge discovering model is constructed for the mining of Chinese PTC rules from large scale corpus. The implementation method of constructing the model is also provided and the model performance is evaluation by experiments. To reduct the rule base according to characteristics of applications, the mined rules are application-dependent. In spite of it, since all rules are mined automatically, it is still easy to port this model to other NLP applications. Second, to address the issues of long-distance-constraint efficiently, the combination of rough rules with classical statistical language models is researched. Considering the characteristics of storage-limited applications, rough rules are firstly combined with character-based n-gram models and their performances are evaluated experimentally. Then, under the maximum entropy (ME) framework, rough rules are combined with word-based tri-grams for general applications. The experiment result shows good performance gain for this combination. Third, word sense quantization model is the foundation of word sense disambiguating and sense similarity computing. In this paper, a word space is firstly constructed through statistic of corpus. Then the quantization model is constructed by mapping feature words into this word space. To decrease the time complexity of sense similarity computing, attribute reduct algorithm is introduced to fulfil the task of word space reduction and axis words selecting. This part also provides method of discretizing attribute values. At last, this model is auto-evaluated by the method of constructing pseudoword words and the results show good performance in word sense disambiguation task. Fourth, the popularization of Internet boosts the requirement for efficient and accurate methods of information acquisition. As an important component among information acquisition methods, the requirement for high quality auto-summarization system also becomes more urgency. To fulfill this requirement, an adapted Dotplot methods based on word sense quantization model is firstly provided to address the problem of subtopic segmentation. Then a multi-knowledge-sources-integrating (MKSI) model is constructed, which combines the results of rhetorical structure analyse, text content structure analyse and subtopic segmentation together and provides clues for abstract sentences extracting. At last, a auto-evaluation system is provided for the performance evaluation a text summarization system, based on it, the model parameters optimization is fulfilled by genetic algotithm. Our experiment result shows that the MKSI model can keep the logical of original text and then generates good quality abstact. In addition, the evaluation of model performance also shows that there is not a strong dependency between the model performance and the scale of training corpus, which is very helpful for the auto-summarization field since the training corpus must be constructed manually.
Keywords/Search Tags:Rough Sets, Linguistic Knowledge Discovery, Rough Rules Based Statistical Language Model, Word Sense Quantization Model, Automatic Text Summarization
PDF Full Text Request
Related items