Font Size: a A A

Research Of NLP Technologies Based On Statistics And Its Application In Chinese Information Retrieval

Posted on:2006-01-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H SunFull Text:PDF
GTID:1118360182975501Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese Information Retrieval (CIR) is an important branch of InformationRetrieval, and has achieved rapid development in recent years. However, there arestill some issues need to be studied further for improving the effectiveness andefficiency of today's CIR system. This paper uses the NLP technologies based onstatistics and algebra, studies the processing methods for document(s) at the wordlevel and document level, and presents solutions for several key problems in CIR.This paper first provides a detail theoretical analysis on the choice of indexingunit in CIR, and improves the traditional Chinese segmentation algorithm based onmaximum matching, which solves the segmentation ambiguity problem to a certaindegree. In addition, a window moving and expanding method based on statistics isintroduced to this segmentation algorithm, which simply and effectively improves theissue of unknown words identification.Information Extraction (IE) has been a bottleneck restricting the performance ofIR system, in which keyword extraction is one of important factors. This paperpresents a single-document keyword extraction algorithm based on χ 2 statistic. Thisalgorithm uses the co-occurring information between words to get a χ 2 statistic tomeasure their relation. Also, this paper improves the traditional KEA algorithm,extends the features used in identifying keywords, and implements a multi-documentskeyword extraction model based on Naive Bayes theory.Text classification is a key technique for organizing document set in IR. Thispaper first studies the text classification algorithms, discusses how to extract featureterms, and implements a new feature extraction algorithm. In addition, this paperprovides a word co-occurrence model based on Vector Space Model (VSM), andapplies the word co-occurrence resources obtained by this model to text classification,and improves the performance of text classification system Finally, this paper appliesthe idea of classification into reducing users' query ambiguity in IR, and implements aclassification search system, which enables users quickly and accurately get theirrequired information.For reducing the high memory and time cost for processing high-dimensionalterm-document matrix, this paper introduces linear (LSI) and nonlinear (Isomap, SIE)dimension reduction algorithms into the processing of high-dimensional documentdata, and compares theirs performance in document clustering. Experimental resultsshow that SIE algorithm adopting local embedding technology achieves acomparative performance with LSI, and is better than Isomap algorithm using globaloptimization technology.Finally, this paper implements an IR system based on N-level VSM on Windowplatform. This system uses a hierarchical scheme in processing Web documents, andprimarily improves the weight computation for key information in Web documents.
Keywords/Search Tags:Chinese Information Retrieval, NLP Technology, Statistics, Chinese Segmentation, Keyword Extraction, Text Classification/Document Clustering
PDF Full Text Request
Related items