Research On Bilingual Word Pairs Extraction Based On Machine Learning

Posted on:2012-09-19

Degree:Master

Type:Thesis

Country:China

Candidate:H Cao

Full Text:PDF

GTID:2178330335455726

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The traditional bilingual dictionaries are organized by human effort, they are authoritative and of high quality, but they would costs a lot of manpower and time. Meanwhile, with the development of the internet, new knowledge and topics are emerging every day, makes it hard to add all those new words to the dictionary artificially, thus the dictionary edited by human effort lack of timeliness. The large amount of information on the internet can be utilized, bilingual word pairs in web pages can be extracted from web pages effectively. If the process of extracting bilingual word pairs can be automatically, it would be a very abundant resource. However, the information on the Internet information are unstructured, disorderly, and the quality is not guaranteed, how to extract bilingual word pairs from the unstructured content is a problem, and also it is not feasible to add all the bilingual word pairs into dictionary without filtering as the quality of the bilingual word pairs extracted is not guaranteed. So there mainly exist three problems in the procedure of extracting bilingual word pairs automatically:First, how to extract bilingual word pairs from the text of unstructured data on the internet. Secondly, how to determine whether the bilingual word pairs extracted is of high quality. Last but not least, how to measure the quality of bilingual word pairs and to extract high quality bilingual word pairs from those extracted word pairs that are of low quality.Aiming at solving the problems listed above, this paper proposals several methods to extract bilingual word pairs based on machine learning models with high precision and recall, the main contributions of this paper are as follows:Firstly, the traditional methods of extracting bilingual word pairs from unstructured data on the internet are using fixed patterns, it is restricted by man's prior knowledge of the data and is of low generalization ability, this paper proposes a method using pattern mining, first we use fixed pattern as seed pattern to extract the bilingual word pairs, just as the traditional mechanism, then we use the bilingual word extracted as seeds to get more patterns, then the new patterns are used to extracted more bilingual word pairs, this process will continue on until convergence, it overcomes the limitations of fixed pattern, improves the recall rate. Experiments show that this method increases the recall of the bilingual word pairs extraction, and it is not affected by the initial seed selected, the iteration can always convergent to a stable state. Secondly, to evaluate quality of the bilingual word pairs extracted, a method that uses SVM model is proposed to fusion various factors, overcomes the limitations of various traditional method that can only attentively on one aspect, the results of the experiments show that the algorithm enhances the accuracy of bilingual word pairs extracted. Finally, it is find out that although some word pairs are judged as low quality word pairs, a considerable amount of high quality bilingual word pairs can be recalled if one of the word pairs can be truncated correctly, thus would further improve recall of the extraction. In this paper, the truncating problem is modeled as a rank problem, it ranks the candidate truncating position as rank group candidates and picks the one rank first as output. Learning to rank method is used to learn a ranking model from the marked data. Experiments show that the improved algorithm enhances recall of the extraction of the bilingual word pairs.

Keywords/Search Tags:

Bilingual word pair extraction, Machine Learning, Pattern Mining, Learning to rank, Multi factor fusion

PDF Full Text Request

Related items

1	Research And Implement On Mining Parallel Bilingual Translation Pair From The Web
2	Rank Optimization For Person Re-identification Through Intelligent Machine Learning Techniques
3	Word Pair Extraction And Web-based Mining Of OOV Translations
4	Research On Multimodal Learning To Rank Based On Deep Semantic Features
5	Research On Large-Scale Bilingual Parallel Corpus Extraction From The Web
6	Research On Multi-View Classification Algorithms Based On Dictionary Pair Learning
7	Multi-word Expression Extraction Based On Chinese-English Bilingual Corpus
8	Study On Emotion Cause Pair Extraction Based On Fusion Word Vectors
9	Bilingual Word Representation Learning From Non-parallel Corpora
10	Research On Cross-language Document Sorting Learning Method Based On Bilingual Document Similarity