Font Size: a A A

Research Of Uighur, Kazakh And Kirgiz Webpage Language Identification Based On N-gram

Posted on:2016-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:D D LiFull Text:PDF
GTID:2308330476450401Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Language Identification(referred LID) is a process used to identify the predefined written language of various types of documents, often used as the first step of machine translation, classification, search, information retrieval, text processing systems.Before we do such a series of related work like generating the dictionary,configuration files, and the list of stop words, we need to know the type of language of a given text.Uighur, Kazakh and Kirgiz language are the three most commonly used languages of Ethnic Minorities in Xinjiang,both belong to the Altaic Turkic, are adhesive language,which words have very rich morphological changes, resulting in the user is difficult to avoid spelling and grammatical errors.Based on the above, this paper uses three kinds of based on N-gram methods to carry out the reserach on Uighur, Kazakh and Kirgiz webpage language identification. because based on N-gram method is reliable, and has high fault-tolerant capability for misspellings,grammatical errors, and other various textual errors, don’t need to know the relevant knowledge of the language.The author extracted Uighur 2512, Kazakh 2137, Kirghiz 1274 webpage documents from the Internet, both are saved in text form of.txt,which formed the original data set, and roughly according to a 2: 1 ratio,divided this corpus in two parts: training and test sets.After that,N = 2,3,4,5 were selected, and used the frequency statistical methods to construct N-gram feature library of each language.This paper used the ONG method based on the distance vector, the MNG method,which based on Boolean matching, and the ING methods,which uses N-gram frequency and N-gram location, selected the first 100,200, 300, 400,500 N-gram features in the N-gram feature library of these three languages to do the Uighur,Kazakh and Kirgiz webpage language identification experiments.And used precision,recall and F1 methods to assess the effectiveness of these three methods.The experimental results showed that, the MNG method give the best performance to identify the three languages, ING method followed, ONG is the worst.Overall, when the parameter N = 2, the identification results of the three methods are both the best, and those three methods give the best performance to Uighur, Kazakh followed, the the identification result of Kirgiz language is the worst.Based on the above work, the author designed and implementation of a Uighur, Kazakh and Kirgiz webpage language identification system based on N-gram.
Keywords/Search Tags:Uighur,Kazakh and Kirgiz language, webpage language identification, N-gram method
PDF Full Text Request
Related items