Font Size: a A A

Sequence Labeling: Supervised Learning And Applications

Posted on:2012-07-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:B Z TangFull Text:PDF
GTID:1488303383997739Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of machine learning theory, machine learning algorithms aregradually used to tackle complex learning tasks. In the case of supervised learning algo-rithms, they are not limited to deal with classification problems any more, some complexproblems have attracted growing attention of researchers. Among them, sequence label-ing problems, which widely exist in many research fields, are the most popular ones. Inthis paper, we discuss the supervised learning algorithms for sequence labeling problemsand apply them to several problems in the fields of natural language processing and bioin-formatics. The content of this thesis includes the following parts:Firstly, we present several applications of large margin based sequence labeling algo-rithms. Statistical language models are usually used to tackle sequence labeling problems,and has achieved good performance on lots of applications. However, all of them sufferfrom over-fitting problem more or less. Large margin based sequence labeling algorithms,which introduce the large margin theory into sequence labeling algorithms, can not onlyachieve good performance, but also make sure their generalization abilities in theory. Forthese reasons, we apply them to English chunk, Chinese word segmentation, Biomedi-cal named entity recognition and protein secondary structure prediction. When tested onpublic datasets respectively, the large margin based sequence labeling algorithms achievebetter results compared to related algorithms.Secondly, confidence-weighted online sequence labeling algorithms are presentedfor the sparsity of data. To describe the data sparsity of sequence labeling problems inthe natural language field, we introduce novel linear discriminant online learning algo-rithms: confidence-weighted online sequence labeling algorithms, following the thinkingof Confidence-weighted classification algorithms which provide a confidence probabilityfor each feature weight. We apply these algorithms to English chunking, Chinese wordsegmentation, Chinese named entity recognition and biomedical named entity recogni-tion. The experiments shows that the proposed algorithms outperform the current relatedonline algorithms, and are comparable with the state-of-the-art of?ine algorithms, such asconditional random field.Thirdly, frequency-based online adaptive N-gram models are presented for sequence labeling problems. N-gram model is the most basic model for sequence labeling prob-lems, and widely adopted in the practical application systems for its simplicity and highefficiency. In the N-gram based application systems, different users always correspond todifferent models. For a identical user, N-gram model should be changed with the time. Tosolve these problems, frequency-based online adaptive N-gram learning algorithms areintroduced to adjust the parameters of N-gram model automatically according to the pro-cess of using. Experiments conducted on Syllable-to-Character show that the proposedmodels achieve good performance.Fourthly, we present a reranking-base Stacking ensemble learning algorithm. En-semble learning algorithms can usually improve the performance by combining someindividual models. In this paper, we present a reranking-based Stacking ensemble learn-ing algorithm, which is able to find the best linear combination of the base classifiers onthe training samples. An extended algorithm for sequence labeling problems are also in-troduced. This Stacking algorithm contains the following steps: 1. training the individualmodels, 2. the predictive scores of each possible class label or sequence label returned bythe base models are collected together, 3. rerank all possible class labels according thescores collected by step 2. In fact, this process is a process of finding a linear optimumcombination of all base models. For classification problems, this algorithm outperformsthe related ensemble learning algorithms. For sequence labeling problems, we discuss itsperformance on biomedical named entity recognition, and find that it can achieve betterperformance than individual models and the baseline.Fifthly, a cascade method is presented for detecting hedges and their scope in naturallanguage text. In the practical application, there is a kind of problems, which need to labelobservation sequences at different levels. We call these problems as multi-task sequencelabeling problems, and cascade learning algorithms are usually used to tackle them. Here,we treat detecting hedges and their scope as a multi-task sequence labeling problem, andpresent a two-layer cascade system to deal with it. What is a hedge? A hedge is a wordor a word sequence which can not be presented as factual information. It is a hot topic.”Detecting hedges and their scope in natural language text”is the main topic of the inter-national public shared task: CoNLL-2010. The proposed cascade method for this sharedtask achieve competitive performance. For the subtask of hedge detection, our systemachieves best result on the Biomedical corpus.
Keywords/Search Tags:Supervised learning algorithms, Sequence labeling problem, Natural lan-guage processing, Bioinformatics, Stacking ensemble learning
PDF Full Text Request
Related items