Font Size: a A A

Based On Dictionary And Word Frequency Analysis Of The Unknown Words From The Bbs Of Corpus Recognition Research

Posted on:2013-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:M Q ZhuFull Text:PDF
GTID:2248330371971466Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Chinese Automatic Word Segmentation Technology is a basic issue of Chinese Information Processing, in the Chinese information processing, firstly, segmentation of the word, and then to a deeper level of applied research. With the rapid development of information technology, the rapid growth of the amount of Chinese information on the Internet put forward to higher demand on Chinese word segmentation accuracy. Unknown word recognition has been a bottleneck which restricted the efficiency of the Chinese word segmentation, in order to effectively solve the problem of low efficiency of the unknown word recognition in the Chinese word segmentation, this article focus on a forums corpus unknown word recognition strategy which is based on a combination of dictionaries and word frequency analysis. This paper is designed in the following parts.(1) Selecting the Tianya Forum data to build a dynamic corpus, using the network spider WebLech crawling technology to download the Tianya Forum data to the local hard disk, using the parser Jsoup based on Java to parse the html and other web files downloaded to the local, we can obtain clean Txt text file to build corpus. The linear superposition of Two-word coupling function and T-test function can construct a new statistic CT to identify the candidate unknown words in the corpus. As the basis for the judgment of candidate unknown words, CT algorithm is an important part of learning and training modules in the unknown word recognition prototype system, which is added to a temporary dictionary, counting the word frequency of the candidate word that is not registered in the temporary dictionary, then selecting the candidate unknown words whose word frequency is greater than a predetermined threshold as the unknown words which are added to the core dictionary, and the rest as a high frequency of non-word string to disable dictionary.(2) Designing a structure of dictionary, and it is designed as the core dictionary and the extended dictionary; The core dictionary as the basis for segmentation, the first word of the hash storage supports to quickly find, the dictionary is designed as the reverse arrangement based on a phrase word length from long to short, which can greatly reduce the matching number and improve search efficiency. The extended dictionary is divided into temporary and stop word dictionary, combining temporary dictionary with statistical strategy as the basis for the unknown word to learn and train, stop word dictionary is used to store high-frequency non-word strings, thus reducing the burden of temporary dictionary. Not only Optimizing word matching algorithms, but also improving positive matching algorithm combined with the structural design of the core sub-word dictionary, dynamically match the settings rely on word phrase length, thus the ways not only improve the matching efficiency, but also avoid repeatedly invalid matching and long-term segmentation problem. (3) Designing and achieving a prototype system of unknown words. On the basis of the integration of previous research, we design the prototype system. It includes:Corpus acquisition module, the document analysis module, Learning training module and Segmentation module. Among of these modules, the first and two modules are used to build the corpus; the third one is used for transportation to the core dictionary of new words; the last is used for automatic Chinese word segmentation. At the same time, compared to the words for system initialization and learning training, it proves that the feasibility of the system in the new word recall, accuracy, etc. have a certain improvement.
Keywords/Search Tags:Out-of-Vocabulary, Chinese word segmentation, Word frequencystatistic, Core dictionary
PDF Full Text Request
Related items