Font Size: a A A

Research On Bilingual Word Pairs Extraction Based On Machine Learning

Posted on:2012-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:H CaoFull Text:PDF
GTID:2178330335455726Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The traditional bilingual dictionaries are organized by human effort, they are authoritative and of high quality, but they would costs a lot of manpower and time. Meanwhile, with the development of the internet, new knowledge and topics are emerging every day, makes it hard to add all those new words to the dictionary artificially, thus the dictionary edited by human effort lack of timeliness. The large amount of information on the internet can be utilized, bilingual word pairs in web pages can be extracted from web pages effectively. If the process of extracting bilingual word pairs can be automatically, it would be a very abundant resource. However, the information on the Internet information are unstructured, disorderly, and the quality is not guaranteed, how to extract bilingual word pairs from the unstructured content is a problem, and also it is not feasible to add all the bilingual word pairs into dictionary without filtering as the quality of the bilingual word pairs extracted is not guaranteed. So there mainly exist three problems in the procedure of extracting bilingual word pairs automatically:First, how to extract bilingual word pairs from the text of unstructured data on the internet. Secondly, how to determine whether the bilingual word pairs extracted is of high quality. Last but not least, how to measure the quality of bilingual word pairs and to extract high quality bilingual word pairs from those extracted word pairs that are of low quality.Aiming at solving the problems listed above, this paper proposals several methods to extract bilingual word pairs based on machine learning models with high precision and recall, the main contributions of this paper are as follows:Firstly, the traditional methods of extracting bilingual word pairs from unstructured data on the internet are using fixed patterns, it is restricted by man's prior knowledge of the data and is of low generalization ability, this paper proposes a method using pattern mining, first we use fixed pattern as seed pattern to extract the bilingual word pairs, just as the traditional mechanism, then we use the bilingual word extracted as seeds to get more patterns, then the new patterns are used to extracted more bilingual word pairs, this process will continue on until convergence, it overcomes the limitations of fixed pattern, improves the recall rate. Experiments show that this method increases the recall of the bilingual word pairs extraction, and it is not affected by the initial seed selected, the iteration can always convergent to a stable state. Secondly, to evaluate quality of the bilingual word pairs extracted, a method that uses SVM model is proposed to fusion various factors, overcomes the limitations of various traditional method that can only attentively on one aspect, the results of the experiments show that the algorithm enhances the accuracy of bilingual word pairs extracted. Finally, it is find out that although some word pairs are judged as low quality word pairs, a considerable amount of high quality bilingual word pairs can be recalled if one of the word pairs can be truncated correctly, thus would further improve recall of the extraction. In this paper, the truncating problem is modeled as a rank problem, it ranks the candidate truncating position as rank group candidates and picks the one rank first as output. Learning to rank method is used to learn a ranking model from the marked data. Experiments show that the improved algorithm enhances recall of the extraction of the bilingual word pairs.
Keywords/Search Tags:Bilingual word pair extraction, Machine Learning, Pattern Mining, Learning to rank, Multi factor fusion
PDF Full Text Request
Related items