Font Size: a A A

The Research On A Few Of Key Issues In Chinese Information Processing

Posted on:2005-07-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:J H WangFull Text:PDF
GTID:1118360125967577Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the high-speed development of science technology and the continuous increase of all kinds of resources, the information processing (IP) has become the most important research field in order to pursue the high efficiency. The P touches upon feature selection, information retrieval (IR), information extraction, natural language processing (NLP), automatic clustering and classifying, automatic summarization, automatic annotation and topic identification, analyzing information structure and text generating, in which the research on selecting features is the groundwork and offers the foundation and precondition for others, and others can retrieve useful information and mine new knowledge efficiently and accurately, accelerate the process to access a large amount of useful information.In allusion to the requirement to information processing, this paper has studied a few of key issues in the research on information processing based on statistics. This paper has made the following contributions and innovative achievements:1. An improved algorithm to select features based on statistics. The research onselecting features is the groundwork in the research on information processing and hitherto a lot of algorithms have been put forward. Since it should take a large amount of time and work to compose the thesaurus when to select features based on the thesaurus, most of present algorithms are based on statistics. But most of them have shortcomings in the following facets: (1) The evaluating strategy by these algorithms is based on either the traditional TF/IDF, or only the distribution characteristic between classes, and does not take the distribution characteristic between classes and in classes into account adequately; (2) The efficiency of the present N-gram method need to be improved; (3) The present algorithms do not touch on the problem to treat phrases overlapped with each other; (4) The present algorithms ignore the impact on tide importance of phrases by their location. To solve the problems, this paper presents an algorithm to select features based on statistics, which has extended and improved the present algorithms, and is applied later to select Chinese features. The result of the application confirmed that the new algorithm is very accurate and can meliorate the performance of the information processing.2. To identify the dependent relationship between words qjantificationally. A newalgorithm has been put forward to identify the dependent relationship between words based on statistics, in order to rectify the shortcomings of present algorithms, to improve the accuracy of identification and to ameliorate further the efficiency and performance of information treatment, natural language processing, and so on For this purpose, this paper has made the following contributions and innovative achievements: (1) The new algorithm makes the best of the distribution characteristic between words, and not only can identify the dependent relationship between the neighbor words, but also can identify that between distant words and some latent relationship; (2) This paper distinguishes the collocation, coordinate and affiliation relationship between words definitely and identifies themrespectively by different strategies; (3) This paper has present a new module of matching between strings and identifies the affiliation relationship based on it; (4) This paper has present a new module of dependent intensity between words by making use of the distribution of the relative distance and location between words, and constructs the tree of dependent relationship between words based on it; (5) An updating algorithm is offered to prune the constructed tree of dependent relationship and identify the dependent relationship between distant words and some latent relationship. The result of the application confirmed that, the new algorithm can identify the dependent relationship between words very accurately and can meliorate the performance of the information processing, natural language processing, and so on.
Keywords/Search Tags:Information processing, Information retrieval, Feature selecting, Clustering, Classification, Subspace module, Automatic summarization, Automatic annotation, Dependent relationship, Pattern recognition
PDF Full Text Request
Related items