Font Size: a A A

The Design And Implementation Of A Segmentation Algorithm Based On Semi-Supervised Machine Learning Method

Posted on:2005-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z GuanFull Text:PDF
GTID:2168360152468745Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The word is the minimum language unit that can be operated independently, however Chinese and a lot of eastern languages do not have any delimiters between words. Therefore word segmentation is a key sub-problem of Chinese information processing, such as machine translation, information retrieval and text classification.The research work has the following characteristics: Firstly, it makes use of an unsupervised machine learning framework and bases on unlabeled corpus. It does not use the artificial dictionary to set up language model. It uses Expectation Maximization algorithm to train model and optimizes the parameter of model. In order to improve the performance of model further and reduce few chunks and local maximal problems that occur with the traditional EM algorithm, a dictionary prune algorithm based on mutually information is studied. Considering that mutual information can capture coupling between two characters better than other means, we apply mutual information between two characters, instead of maximum likelihood, to divide string of characters. It can find the weakness position of dependence between the two characters in words and has improved the correct rate of pruning effectively. Secondly, a word segmentation algorithm based on active learning is proposed. Aiming at exploiting both labeled data and unlabeled data, the algorithm can not only be trained on raw corpus, but also on most informative labeled samples automatically chosen by active learning procedure.Following the methods presented above, a word segmentation system based on semi-supervised machine learning is reseached and implemented. The experiment results indicate that the method of this paper not only has comparatively satisfactory performance but also keep the lower language resource dependence. It achieves approving results by actively choosing a few samples to label.(word recall 79.6%,word precision 77.8%)...
Keywords/Search Tags:Word Segmentation, Unsupervised Machine Learning, Semi-supervised Machine Learning, Active Learning, EM Algorithm
PDF Full Text Request
Related items