Research Of NLP Technologies Based On Statistics And Its Application In Chinese Information Retrieval

Posted on:2006-01-31

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y H Sun

Full Text:PDF

GTID:1118360182975501

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Chinese Information Retrieval (CIR) is an important branch of InformationRetrieval, and has achieved rapid development in recent years. However, there arestill some issues need to be studied further for improving the effectiveness andefficiency of today's CIR system. This paper uses the NLP technologies based onstatistics and algebra, studies the processing methods for document(s) at the wordlevel and document level, and presents solutions for several key problems in CIR.This paper first provides a detail theoretical analysis on the choice of indexingunit in CIR, and improves the traditional Chinese segmentation algorithm based onmaximum matching, which solves the segmentation ambiguity problem to a certaindegree. In addition, a window moving and expanding method based on statistics isintroduced to this segmentation algorithm, which simply and effectively improves theissue of unknown words identification.Information Extraction (IE) has been a bottleneck restricting the performance ofIR system, in which keyword extraction is one of important factors. This paperpresents a single-document keyword extraction algorithm based on Ï‡ 2 statistic. Thisalgorithm uses the co-occurring information between words to get a Ï‡ 2 statistic tomeasure their relation. Also, this paper improves the traditional KEA algorithm,extends the features used in identifying keywords, and implements a multi-documentskeyword extraction model based on Naive Bayes theory.Text classification is a key technique for organizing document set in IR. Thispaper first studies the text classification algorithms, discusses how to extract featureterms, and implements a new feature extraction algorithm. In addition, this paperprovides a word co-occurrence model based on Vector Space Model (VSM), andapplies the word co-occurrence resources obtained by this model to text classification,and improves the performance of text classification system Finally, this paper appliesthe idea of classification into reducing users' query ambiguity in IR, and implements aclassification search system, which enables users quickly and accurately get theirrequired information.For reducing the high memory and time cost for processing high-dimensionalterm-document matrix, this paper introduces linear (LSI) and nonlinear (Isomap, SIE)dimension reduction algorithms into the processing of high-dimensional documentdata, and compares theirs performance in document clustering. Experimental resultsshow that SIE algorithm adopting local embedding technology achieves acomparative performance with LSI, and is better than Isomap algorithm using globaloptimization technology.Finally, this paper implements an IR system based on N-level VSM on Windowplatform. This system uses a hierarchical scheme in processing Web documents, andprimarily improves the weight computation for key information in Web documents.

Keywords/Search Tags:

Chinese Information Retrieval, NLP Technology, Statistics, Chinese Segmentation, Keyword Extraction, Text Classification/Document Clustering

PDF Full Text Request

Related items

1	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
2	Research On Keyword Extraction Technology Oriented To Conversational Text
3	A Study Of Some Issues In Chinese Text Information Retrieval
4	Research And Implementation Of A Chinese Full-Text Information Retrieval Technology Based-on Lucene Search Engine
5	Research On Keyword Extraction Algorithm For Chinese Texts And Cluster Center Point Selection Algorithm In Text Clustering
6	Research And Implementation Of Chinese Automatic Text Classification System Based On SVM
7	Research On NLP Technologies And Application In Chinese Information Retrieval
8	Research And Implementation Of Text Categorization System Based On VSM
9	Web Text Classification System For Chinese Pretreatment Technology
10	A Research On Chinese Word Segmention Based On The Combination Of Dictionary And Statistics And Full-Text Retrieval System Design