Research On N-gram Based Hierarchical Text Language Identification

Posted on:2019-06-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M T Y M H S M Mai

Full Text:PDF

GTID:1528305651465764

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Language Identification(LI)is a technique that uses computers to automatically divide electronic text into pre-designated language.LI is the first step in natural language processing systems such as information retrieval,search engines,speech synthesis,automated question and answer,and machine translation.Due to the lack of research on LI in China,and the fact that open source LI tools do not take into account the Minority languages,the following research work was carried out:Uyghur,Kazak,and Kirk characters are arranged in the Arabic script area in the Unicode encoding scheme.Therefore,the above three languages share the coding area with other languages that use Arabic scripts.This brings inconvenience to the language retrieval,classification and arrangement of Uyghur,Kazakh,and Kirgiz languages.In order to solve this problem,this paper first analyzes the character features of the above three languages and counts their unique features in character use.Finally,based on this,a LI for Uyghur,Kazakh,and Kirgiz texts based on statistically unique characters is designed.The experimental results show that when the length of the text is more than70 words,the LI rate of this algorithm reaches more than 96.67%,which is suitable for LI at the paragraph level.Since Uighur,Kazakh,and Kirgiz are described in Arabic scripts,when collecting the above three languages on the Internet,they must be distinguished from other Arabic script-based languages.However,the above-mentioned genre identification based on the unique features of statistical characters is not suitable for the identification of genres with sentence level and more language types.At the same time,most languages based on Arabic script are adhesive language,and the word feature model is not suitable for the identification of the language of adhesive language.To solve this problem,proposes a LI technique based on character-level N-grams for Arabic script-based languages.The experimental results show that in the LI of sentence-level,in addition to the Persian and Western Persian two similar languages,the recognition rate of other languages is more than 97%.This method is suitable for LI of Arabic script language short texts.In addition,in this paper use in the Arabic script-based languages that belong to different language groups has also conducted LI experiments.Through the analysis of the experimental results,it is found that the error rate of LI mainly occurs among the languages that belong to the same language group,especially between similar languages.In addition,when the number of language types in the analysis is reduced,the LI rate effectively improved.Because different scripts are arranged in different encoding areas in the Unicode encoding scheme.Use this feature to identify different script portions in the text.In addition,different languages belong to different language families,and one language group can be further divided into more detailed language groups.The language between the same language group has similar vocabulary and grammatical structure.Using this feature can identify different language groups in the same scripting language.Most of the errors in language identification occur between similar languages.Similar languages have a high degree of similarity in vocabulary and grammatical structure.With this feature,similar language groups in the same script’s language or language group can be identified.Based on this,in this paper proposed different types of hierarchical text recognition algorithms and performs comparative experiments one by one.The experimental results show that after the character script identification,language group identification,similar language group identification,similar language identification in similar language group,the LI performance is gradually improved.In sentence-level language identification,the LI rate in sentence level is 98% when using four-stage LI algorithm.The proposed method has significantly improved compared to the open source LI tool langid.py.In addition,this paper also analyzes the influence of the content of foreign words written in the same script on LI.According to the latest research results of some scholars,the similarities between languages in some language groups that historical linguists speculated are not high enough.If the similarity between the languages in the language group is not high enough,then the identification accuracy of the language group is reduced.In addition,when judging similar language groups in the LI,the first perform LI use all languages in the LI system,and then similar language groups are selected according to the recognition results.However,this method is inconvenient when other languages are added.In order to optimize the performance of the language group identification in hierarchical LI,in this paper proposed an algorithm that automatically forms language groups and similar language groups based on similarities between languages.Experimental results show that the algorithm can automatically form language groups in a short time from the same script language.In hierarchical LI,when the automatically formed language groups are used,the correctness of the LI is higher than that of the language group estimated by the historical linguists.The experimental results also show that from the similar language group in language group or the similar language group automatically formed in the language using the same character script,the automatically recognized similar language group is the same as the similar language group detected in the language recognition experimentAnother advantage of the automatic forms language group and similar language group algorithms studied in this paper is that when language is added,language groups and similar language groups are automatically formed first,and then it is determined which language group or similar language group the newly added language belongs to.Only the language group or similar language group related classifier to which the newly added language belongs is updated.without the need to update the classifier in the entire LI system.This not only saves time,it does not affect the efficiency of LI in other languages.

Keywords/Search Tags:

language identification, N-gram, language group, similar language, similar language group, script recognition

PDF Full Text Request

Related items

1	Application Research On Statistical Language Model Of Large Vocabulary Continuous Speech Recognition System
2	Research On Language Features Of Entertainment News Show Host Of Provincial Television Terrestrial Channels
3	Research Of Uighur, Kazakh And Kirgiz Webpage Language Identification Based On N-gram
4	The Optimization And Implementation Of The Efficiency And Performance Of Chinese Language Model Based On Recurrent Neural Network
5	Researching And Building Of The Mongolian Large Vocabulary Independent Continuous Speech Recognition System
6	Study On Segmentation Technology Based On Group Of Similar Images
7	Design And Implementation Of The Fee Calculate Engine Baesed On Script Language
8	Language-independent text learning with statistical n-gram language models
9	Mining Of Semantic Similar Items Based On Cross-Language Mapping
10	Telephone Voice-based Minority Language Recognition Research