Font Size: a A A

Personalized User Dictionary Updating Method Based On Network Information

Posted on:2014-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:H Z LuFull Text:PDF
GTID:2268330392469572Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Chinese character input is one of the most important issues in the Chineseinformation processing, and is also an important part of intelligent man-machineinterface. In the field of Chinese character input, Pinyin input technique, whichis more in line with the habits of people, has now entered the stage ofdevelopment of the third generation of cloud input method. Current dominantinput method emphasizes the development of individuation, which is dividedinto frequency adjustment and automatic thesaurus expansion. Frequencyadjustment means that by the segmentation statistics based on users’ input at anytime, the lexicon of word frequency is made reasonable adjustments, which givesusers the most reasonable terms. And automatic thesaurus expansion means thatthesaurus automatically expands by crawling through a search engine or Internetto get unprecedented large training corpus (TB level), so that varieties of wordscan be included in the dictionary without any restrictions. Expanding thethesaurus to improve input method is mainly discussed in this article. The newword identification is the most important aspect of thesaurus expansion, and isalso the core content of this article. On this issue, the following research work ismainly discussed in this paper:(1) The extraction and processing of network information: use web crawlerto crawl Sina website, and extract web content. As there is a lot of rubbishinformation (such as advertising, copyright, and other information) in webcontent, the web content needs to be purified, which means extracting validinformation and tagging important information. Purification means the processof parsing and filtering the original page and extract useful information. Theoriginal pages after purification can be transformed into clear structure, contentcompact and clear information purify pages.(2) Designing and implementing the new word extraction: from purificationpages, use the ordinary repeated string statistical methods to extract the newword. First, by the punctuation and disable vocabulary, do segmentation. Andthen compute the number of the occurrence of every two-character words,three-character words and four-character words, and the words whose number ofthe occurrence is higher than the threshold, are treated as the candidates for thenew term. At last, use the repeated string search algorithm to remove duplicatesubstring and take advantage of word formation rules to remove garbage strings.Finally, the candidates of new words are compared with input method thesaurusto form a new word thesaurus. (3) Classification of the new word and personalized loading thesaurus: bystudying the original page, it is found that the header field also contains the textcategory information. Hence by the matching method, the category is extracted.By this method, the new words can be classificated. According to the user’shabits, one category or more new word thesaurus is selected to be loaded.Finally, in order to obtain real, objective evaluation of the system, theprecision, recall rate and F-measure evaluation is used to test the performance ofthe new word, and by the accuracy of character and the accuracy of sentence,compare the changing performance of the pinyin input method which is addedthe new word thesaurus or not. It is found that the every new words extractedstandard is better, and after adding a new word thesaurus into input method,performance has been further improved.
Keywords/Search Tags:Network information extraction, New words identification, Newwords classification, Personalized loading, Pinyin input method
PDF Full Text Request
Related items