| Pinyin input method is a tool for converting the input of pinyin to characterstrings. However, the accuracy of the conversion of the pinyin input method dependson the input method dictionary which contains the input phonetic string correspondingto the entry or not. Against the input of pinyin strings, firstly, the input method findsthe corresponding phonetic string of all entries through the efficient searchingalgorithm; secondly, for the pinyin strings that can not found directly from thedictionary, we can get the words or phrases that needed through decoding algorithmof the language model. If we still can not get the vocabulary the user wants, we thinkthe entry is a new word of the input method,we need use a special new words miningprocess to dig out and add it into the thesaurus to supple the inadequate vocabulary ofthe input method.The paper summarizes the significance of this research, the definition of the newwords and the predecessors’ research results, then it proposed the space based newwords mining algorithms to mine new input method words. The mining process usethe all users’ input method logs as the input corpus, the eigenspace based similaritycalculation method to divide the principal component of the sentences,and obtain thecandidates of the new words through low-frequency filtering method. After we obtainthe candidates of the new words, we combined it with new words’ heat, burst,accepting and some other related features for filtering. Finally, we compiled it into anew dictionary and pushed to users daily.The experiment results show that the accuracy rate of the mining algorithm basedon this idea is83%, where in the long term accuracy rate is91%. Meanwhile, becauseof the use of Hadoop cluster to data processing, the mining process in performancealso has a good performance. The entries that dig out by the mining process will serveas the pinyin input method thesaurus to push to the user everyday, to solve the lackof words of the input lexicon, to reduce the cost of the usrs’ input and to improve theexperience of the user. |