Font Size: a A A

Research On Key Technology In Mining Web Bilingual Corpora

Posted on:2015-02-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z D ZhuFull Text:PDF
GTID:1268330425494713Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the development of statistical techniques, the large-scale bilingual corpora have been indispensable fundamental resources for cross-language processing research field. The bilingual corpora have been applied to mine fine-grained translation equivalents, such as bilingual terminologies, named entities and bilingual lexicography, to support statistical machine translation and cross-language information retrieval. However, existing bilingual corpora are significantly scarce in practical use, especially the low-density languages. In recent years, the original bilingual resources are witnessing rapidly increasing on the web with its advantage of innovative content and vast sources. Mining bilingual corpora from web have become the focus of attention.With the purpose of study on mining bilingual corpora, this thesis designs two systems to mine parallel corpora and comparable corpora respectively together with four key technologies which includes parallel webpages identification, content extraction, keyphrase extraction and cross-language document similarity. The main work includes:1) Parallel Webpage Identification Based on the New Heuristic Information To solve the problem of heterogeneous web structure with mining parallel corpora from web, this thesis develops tag structure alignment calculated in accordance with the improved edit distance and the similarity of co-occurrence number sequence calculated in accordance with maximal common subsequences as the new heuristics. Then we apply a support vector machine to combine these heuristics to classify pages as parallel pages or not. This approach reduces dependence on page structure information to improve the adaptability of the low-density language.2) Web Content Extraction Based on Text Density Model In order to avoid misjudgment boundary and obtain useful content from different layout webpages, this thesis proposes an approach of web content extraction which is based on the text density model, integrating page structure features with language features to convert text lines of page document into a positive or negative density sequence. Additionally, the Gaussian smoothing technique is adopted to revise the density sequence, which takes the content continuity of adjacent lines into consideration. Finally, the improved maximum sequence segmentation is adopted to split the sequence and extract web content. Without any human intervention or repeated training, this approach can maintain the integrity of content and eliminate noise disturbance.3) Keyphrase Extraction Based on LDA Model In order to solve the problem that existing methods lose the comprehensive analysis of significance, readability and coverage of document topics, a new algorithm of keyphrase extraction TFITF which bases on the implicit topic model is presented. The algorithm adopts the large-scale corpus to produce latent topic model to calculate the TFITF weight of vocabulary on the topic and further generate the weight of vocabulary on the document. Then adjacent lexical are picked as keyphrases based on co-occurrence information. Lastly, according to the similarity of vocabulary topics, redundant phrases are excluded. The method can effectively improve the precision and recall of keyphrase extraction.4) Cross-language Document Similarity Based on Bi-LDA Model In order to solve the problem of existing methods which adopt inter-translate words and relative features cannot evaluate the topical relation between cross-language document pairs, this thesis adopts Bi-LDA model to analyze document topic structure and gives the similarity of cross-language documents by KL divergence between document-topics, cosine similarity between values of Topic Frequency-Inverse Document Frequency and condition probability between documents to construct comparable corpora. This method enhances the understanding of document semantic information, overcomes the superficial matching of vocabulary and obtains similar documents with consistent topics.The system of mining parallel corpora mainly adopts parallel webpages identification and content extraction. The system of mining comparable corpus mainly adopts content extraction, keyphrase extraction and cross-language document similarity. The experiment results that the method of the thesis can effectively improve the utilization of web resources and the quality of bilingual corpora.
Keywords/Search Tags:web mining, bilingual corpora, parallel corpora, comparable corpora, parallel webpage identification, content extraction, keyphraseextraction, cross-language document similarity
PDF Full Text Request
Related items