Font Size: a A A

Extraction Research Of Uyghur Domain Term

Posted on:2015-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhongFull Text:PDF
GTID:2298330431491891Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of new technology, new phenomena, new concepts and newthings, more and more domain terms have entered the rank of our languageapplication, which greatly enriches the vocabulary of our language. In addition,nowadays, information technology develops rapidly, and the social science has risento a new level, which brings enormous changes to people’s daily life and production.These changes also contribute to the development of domain items towardsdiversification. Owing to the constant expand and accelerating changes of the domainitems, the automatic identification and extraction of domain items become animportant and difficult issue. With the widespread use and popularity of the Internet,in the Xinjiang Uyghur Autonomous Region, relatively to the past, it is now easier forpeople to obtain information from the network, and make some free comments aimingat the hot topics and news. Because of the mixture of all kinds of comments, manyexperts and scholars put their work emphasis on the processing of the information.For the sake of the extracting the Uyghur domain items automatically, we makevarious analysis and deep research with the help of subject study and the advantagesof the local platform. Combining with the characteristics of Uyghur domain itemsand the related statistical characteristics, we design an automatical extractionalgorithm and make it come true, wherein the automatical extraction of Uyghur isachieved.The details of Uyghur domain term as follows:(1) The term of the word in the field of extraction: As the field of terminologyexists Uyghur word difficult to obtain, collate artificial heavy workload, the use ofmanually processing efficiency is relatively low and the accuracy is difficult to beassured, and many other issues, this article through in-depth research and analysis, combining Uyghur language features and pragmatic habit, make full use of existinglinguistic vocabulary principle to a variety of special conjunctions Uyghur rely onstatistical algorithms, such as the mutual information as a screening tool, morequickly accumulation of work to achieve the original terms of the Uighur areas ofsingle words and to expand and build a terminology database fields Uyghur word onthis basis. With the established terminology database fields Uyghur words, after fullvalidation and comparison, we designed and determined the Uyghur word domainterminology collar feature template, and ultimately the use of CRFs in the networktidied Uyghur text, the basic realization of the work automatically extractingterminology of the word in the field, lay a foundation for the relevant informationmining and automatic summaries of work.(2) the field of multi-word term extraction: the field of terminology basedUyghur language features and pragmatic habits, research and analysis related totoday’s Arabic term extraction method, using linguistic knowledge and statisticalalgorithms, this paper designed and implemented the Uyghur areas of multi-wordterms of a combination of rules and statistical algorithms automatic extractionmethods. The method is divided into four stages:①a variety of different sourcescorpus preprocessing, whose main work includes rough cut of the corpus, Uyghurstop words filtering and speech tagging;②take N Motoko string for strings, useimproved mutual information algorithm and log-likelihood ratio calculation substringinside the joint strength, combined with the field of multi-word term Uyghur speechconstitutes a rule, to build the field of candidates for multi-word term Uyghur initialset;③the use of language in the relative word frequency statistics difference, on thefield of multi-word term candidate for secondary screening, and ensure to get as muchof the field of multi-word terms;④combined C_value value for the field ofmulti-word term candidate screening was performed again, with us to build a goodmulti-Uyghur affix the library after word domain terms for treatment, to obtain the final field of terminology. Experiments show that, using the above method can extractmulti-word domain terms Uyghur well.Uyghur is an important part of our language, and it has a great number ofspeakers and it can be used in wide regions. Uyghur is an effective tool for thedevelopment of inheriting Chinese civilization, and it is also the necessary carrier oftransmiting our culture To some extent, the extraction of Uyghur terminology willhelp to carry out the work of standardizing the domin term of Uyghur in our country,and it can also promote the development of the whole cultural undertakings ofUyghur.
Keywords/Search Tags:Uyghur, domain term, conditional random fields, mutual information, log_likelihood ratio
PDF Full Text Request
Related items