Chinese Word Segmentation Based On Active Learning

Posted on:2016-03-12

Degree:Master

Type:Thesis

Country:China

Candidate:X T Liang

Full Text:PDF

GTID:2308330473465468

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Chinese word segmentation(CWS) is an important task in Chinese language processing. Most of traditional segmentation methods are based on dictionary and statistical model, but they all need a large number of labeled samples. Active learning can use the selection strategy to choose some most valuable samples from abundant unlabeled samples in the training process, and employ these chosen samples to improve the performance of Chinese word segmentation. So active learning is studied and some Chinese word segmentation algorithms are proposed in this dissertation.In this dissertation, firstly, the research background and methods of CWS is introduced. Secondly, some active learning schemes are reviewed and some theoretical research issues and applications at home and abroad are discussed. In the end, our research work in detail is introduced. The main contributions of this dissertation are summarized as follows:1. An active learning method based on query by committee is proposed. The algorithm uses the ensemble learning to construct the committee and selects some most useful unlabeled samples for manual annotation. Finally the method is tested in corpora, and compared with the existing method.2. To solve the problems of lacking of training samples and accessing a large number of labeled samples laborious, one fresh active learning method based on stratified sampling strategy is proposed. The proper names are separated from other characters for the sample selection. To further minimize the annotation effort, a diversity measurement among the instances is considered to avoid duplicate annotation.3. According to the further study of uncertain sampling, an active learning algorithm based on near neighbors is proposed. The scheme estimates near neighbors entropy of unlabeled sample and labels the sample with the highest value. To increase the diversity, the Euclidean distance between unlabelled sample and the training set is employed to decrease the same samplings.

Keywords/Search Tags:

Natural language processing, Chinese word segmentation, Active learning, Selection strategy

PDF Full Text Request

Related items

1	Research On Chinese Word Segmentation Based On Text And Audio
2	Study On Chinese Word Segmentation Based On Recurrent Neural Network Language Model
3	Research On Chinese Word Segmentation Based On Deep Learning
4	Research On Chinese Word Segmentation Methods Based On Deep Learning
5	Research On Chinese Word Segmentation Based On Deep Learning
6	Applied Study On Chinese Word Segmentation Based On Deep Learning
7	Research Of Chinese Word Segmentation Based On Deep Learning
8	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
9	The Methodology And Implementation Of Chinese Natural Language Query In Databases
10	Based On The Statistics Of Open Chinese Word Segmentation