Font Size: a A A

Chinese Word Segmentation Based On Active Learning

Posted on:2016-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:X T LiangFull Text:PDF
GTID:2308330473465468Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation(CWS) is an important task in Chinese language processing. Most of traditional segmentation methods are based on dictionary and statistical model, but they all need a large number of labeled samples. Active learning can use the selection strategy to choose some most valuable samples from abundant unlabeled samples in the training process, and employ these chosen samples to improve the performance of Chinese word segmentation. So active learning is studied and some Chinese word segmentation algorithms are proposed in this dissertation.In this dissertation, firstly, the research background and methods of CWS is introduced. Secondly, some active learning schemes are reviewed and some theoretical research issues and applications at home and abroad are discussed. In the end, our research work in detail is introduced. The main contributions of this dissertation are summarized as follows:1. An active learning method based on query by committee is proposed. The algorithm uses the ensemble learning to construct the committee and selects some most useful unlabeled samples for manual annotation. Finally the method is tested in corpora, and compared with the existing method.2. To solve the problems of lacking of training samples and accessing a large number of labeled samples laborious, one fresh active learning method based on stratified sampling strategy is proposed. The proper names are separated from other characters for the sample selection. To further minimize the annotation effort, a diversity measurement among the instances is considered to avoid duplicate annotation.3. According to the further study of uncertain sampling, an active learning algorithm based on near neighbors is proposed. The scheme estimates near neighbors entropy of unlabeled sample and labels the sample with the highest value. To increase the diversity, the Euclidean distance between unlabelled sample and the training set is employed to decrease the same samplings.
Keywords/Search Tags:Natural language processing, Chinese word segmentation, Active learning, Selection strategy
PDF Full Text Request
Related items