Research Of Uighur, Kazakh And Kirgiz Webpage Language Identification Based On N-gram

Posted on:2016-11-04

Degree:Master

Type:Thesis

Country:China

Candidate:D D Li

Full Text:PDF

GTID:2308330476450401

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Language Identification(referred LID) is a process used to identify the predefined written language of various types of documents, often used as the first step of machine translation, classification, search, information retrieval, text processing systems.Before we do such a series of related work like generating the dictionary,configuration files, and the list of stop words, we need to know the type of language of a given text.Uighur, Kazakh and Kirgiz language are the three most commonly used languages of Ethnic Minorities in Xinjiang,both belong to the Altaic Turkic, are adhesive language,which words have very rich morphological changes, resulting in the user is difficult to avoid spelling and grammatical errors.Based on the above, this paper uses three kinds of based on N-gram methods to carry out the reserach on Uighur, Kazakh and Kirgiz webpage language identification. because based on N-gram method is reliable, and has high fault-tolerant capability for misspellings,grammatical errors, and other various textual errors, don’t need to know the relevant knowledge of the language.The author extracted Uighur 2512, Kazakh 2137, Kirghiz 1274 webpage documents from the Internet, both are saved in text form of.txt,which formed the original data set, and roughly according to a 2: 1 ratio,divided this corpus in two parts: training and test sets.After that,N = 2,3,4,5 were selected, and used the frequency statistical methods to construct N-gram feature library of each language.This paper used the ONG method based on the distance vector, the MNG method,which based on Boolean matching, and the ING methods,which uses N-gram frequency and N-gram location, selected the first 100,200, 300, 400,500 N-gram features in the N-gram feature library of these three languages to do the Uighur,Kazakh and Kirgiz webpage language identification experiments.And used precision,recall and F1 methods to assess the effectiveness of these three methods.The experimental results showed that, the MNG method give the best performance to identify the three languages, ING method followed, ONG is the worst.Overall, when the parameter N = 2, the identification results of the three methods are both the best, and those three methods give the best performance to Uighur, Kazakh followed, the the identification result of Kirgiz language is the worst.Based on the above work, the author designed and implementation of a Uighur, Kazakh and Kirgiz webpage language identification system based on N-gram.

Keywords/Search Tags:

Uighur,Kazakh and Kirgiz language, webpage language identification, N-gram method

PDF Full Text Request

Related items

1	Study On The Kazakh Named Entity Recognition Method Based On N-gram Model
2	Research On Spelling Checker/Corrector For Kazakh Corpora
3	Synthesis Of Sign Language Animation Based On Uighur Text
4	Research On Language Identification Of Social Media Short Text Based On N-Gram Vector Feature
5	Research Of A Kazakh Sentence Similarity Computing
6	Kazakh Character Encoding And The Input Method Of The Design And Implementation
7	Ontology-based Hazard Information Extraction From Kazakh Food Complaint Documents
8	Acoustical Analysis Of Vowels And Fricatives In Mandarin Chinese Pronounced By Kazakh College Students
9	Research On Xin Jiang Kazakh Websites
10	Language-independent text learning with statistical n-gram language models