Font Size: a A A

Applications Of Spelling Correction Techniques In Information Retrieval And Text Processing

Posted on:2008-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2178360245991810Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Spelling correction is one of the hot spots in recent natural language processing research. As the pervasive applications of information retrieval and text processing, the spelling errors are unavoidable in the human-typed documents. The process of misspellings is a waste of time and money.After conducting a thorough survey on state-of-the-art spelling correction techniques, we compared the differences of its applications in web search and text processing, implementing systems for these two fields, respectively. Based on the analysis of large volume query log data, we found the misspellings share the most similar context with its most intended correction word; whereas its context is less similar with other candidates. We first employed the noisy channel model, with improvement in its component error model using distributional similarity based on this finding. Next we used distributional similarity as a feature in the discriminative maximum entropy model, with edit distance, phonetic similarity, and language model as other features. In the experimental results part we evaluated these two models.To correct the misspellings in text processing applications, we proposed a novel method which is based on discriminative reranking framework. For the first time we deduced the spelling correction as a ranking problem, rather than the traditional classification one. This method reranks the output of existing spelling corrector Aspell, using Ranking SVM. It employs cutting-edge spelling correction techniques as features, greatly improved its performance. It also outperformed several off-the-self spelling correctors, such as the one used in Microsoft Word 2003. To leverage the great cost on human annotation of training pair acquisition, we also presented a new method to automatically extract training pairs from web query log chain. The performance of model trained by query chain pairs is comparable to that of trained on human-annotated pairs.In the last section we gave some suggestions on spelling correction testing activities. We also raised some problem needed for further research.
Keywords/Search Tags:Spelling correction, machine learning, distributional similarity, Ranking SVM, query log chain
PDF Full Text Request
Related items