| Constructing high-quality tagged corpora is a fundamental part in the field of Uyghur natural language processing. At present, more corpora of higher quality are required in the fields of machine translation (MT), information retrieval (IR), web text mining, etc. Automatic Stem Segmentation and Part Of Speech (POS) tagging are fundamental to the construction of tagged corpora.This thesis intends to solve the problem of the Stem Segmentation by combining the Bidirectional Matching algorithm and Omni-word Segmentation algorithm. Compared with the Maximum Matching algorithm, this method can improve the precision of the stem segmentation. In this thesis, the improved binary-seek-by-character dictionary query mechanism is employed in the application of Uyghur stem segmentation and it can improve the efficiency.Furthermore, POS tagging methods are explored, and the merits and demerits of both rule-based and statistic-based methods are analyzed. The Uyghur POS tagging is studied by applying the probabilistic method and the unigram Hidden Markov Model (HMM) is adopted. The Relative Frequency Training (RFT) method is used to estimate the model parameters. And the problem of the data sparseness is solved through the backing off data smoothing algorithm. At last, the part of speech is tagged in sentences by utilizing Viterbi algorithm. The unigram HMM based on the probabilistic method and Viterbi algorithm are proved effective in solving the problem of Uyghur POS tagging effectively. |