Font Size: a A A

Study On Key Technologies Of Text Information Organization For Information Retrieve

Posted on:2010-12-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:1118360278956555Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of computer and internet, the various information resources are increasing rapidly, especially the number of text has explosive growth. Faced with such a large and rapid expansion of the information oceans, how to efficiently handle and manage these information, and how to accurately comprehensive and gain the information needed by the user, is a major challenge to the current information science fields, as an important means to solve these problems, the text information management has a very broad application prospects.This paper is committed to the integrated use of document classification /clustering technology and document indexing technology to improve the performance and degree of automation of text organization system. However, these key technologies and methods still have many deficiencies in practical applications, mainly reflected in: (1) Existing document clustering algorithms focus on how to improve the algorithm's accuracy and efficiency, but the effectiveness is often neglected, such as the parameters are difficult to determine, only effective on specific data distribution, etc, resulting in algorithms can not meet the demands of the the document topic mining; (2) Document classification requires a large number of labeled samples to train, but the labeled samples are difficult to obtain, which makes the classifier's generalization ability lower and the classification accuracy can not meet the needs; (3) The VSM makes the high dimension of document vector, which seriously affected the efficiency and accuracy of document classification; (4) The existing indexing models are designed for western languages, which can not establish an ideal index for Chinese document because of the differences between Chinese and Western languages. To solve the problems, This paper focus on these key technologies, models, algorithms and give the the corresponding solution by theoretical analysising and experimental researching. The research results are as follows:(1) For the effectiveness problem of clustering algorithms, proposed a dynamic threshold selection model based local density clustering algorithm without parameters—DTSLD. The algorithm is inspired by layered filtering thinking in wavelet denoising, establish a tiered dynamic threshold selection model to automatically select the parameters of the algorithm; Secondly, based on RDBKNN algorithm, use dynamic threshold selection model for each data point to choose their suitable neighbor in stead of unified global neighbors parameters k to to avoid the impact of overall parameters. Relative density threshold parameterδalso makes use of dynamic threshold selection model, but apply a different strategy. Finally, in the document's topic mining applications, the use of polynomial kernel function improved the document similarity calculating. Experiments show that the algorithm is easy to use, effective on a variety of cloud-like and manifolds data distribution, and has a very good ability to fully meet the demands of the document topic mining.(2) For the small small sample issue, proposed a new transductive inference classification algorithm. The algorithm first use Tri-training algorithm to explore the unlabled samples, gradually expand the scale of the training set by transferring the knowledge of data distribution implied by the unlabeled samples to the classifier. Then use the paired-label exchanging, an idea from transductive vector machines, to maximize the samples margin which raised the handling capacity of border samples. In addition, since the initial training set is at a smaller scale, Tri-training algorithm in the process of expand training set will be introducing a lot of noisy data and error tags, to recover the negative impact, consistency of nearest neighbor bound rule based data editing technology is introduced to do purification operation by removing the mislabeld data and noise in learning Process which increasing the expanded training set's quality.(3) For the"high demision curse"problem, studied the feature selection algorithms in the text classification. A new feature selection algorithm is proposed based on fisher linear discriminant model, which converts the solution process to feature optimization problem and avoid the complex matrix operations. At the same time, we improved the poor performance MI method by using the contribution variance among the categories instead of choosing the highest contribution as a final assessment. And through relevant experiments, we proved there is an error in the conclusions of frequency and classification capability of characteristics having relevance submitted in the paper of yang.(4) Proposed a integration mixed-term full-text indexing model, the model is based on the IRST indexing model, using its characteristic of keeping relations between the characters,we add the word information to the node through expanding the structure. For IRST model search efficiency is its defects, through the expansion of "root node - leaf nodes" structure to the "Root node - branch nodes - leaf nodes" structure, we overcome the shortcoming that the original model can only order to find, can not use the shortcomings of rapid positioning technology such as hash table, which has greatly improved the speed of retrieval. Experiments show the new model successfully combines the character and word indexing, which has higher recall than word indexing model, higher precision and faster retrieve speed than character indexing model.
Keywords/Search Tags:Information Retrieve, Information Organization, Document Clustering, Document Classification, Feature Selection, Full-text Indexing Model
PDF Full Text Request
Related items