Research On Chinese Parallel Structure Recognition Based On Semi-Supervised Learning

Posted on:2022-09-10

Degree:Master

Type:Thesis

Country:China

Candidate:D Yang

Full Text:PDF

GTID:2518306524451854

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

As a common component structure in natural language,the correct recognition of coordinate structure can greatly improve the performance and efficiency of automatic syntactic analyzer,and can also promote the construction of Chinese tree library.The recognition results can also be directly applied to machine translation,information extraction and other fields.Due to the complex and changeable characteristics of Chinese,coordinate structure recognition has become a research difficulty in the field of Chinese information processing.In the current research,rule-based methods need to manually construct rule templates according to specific language syntax and domain.The application of this method has high cost and poor portability.Although the statistical-based method has achieved good results,it is highly dependent on the annotated corpus and does not use the rich semantic information in the unlabeled corpus.At the same time,the existing annotated corpus cannot meet the language model in the era of big data.In view of these shortcomings,this paper explores effective methods to identify coordinate structures,and proposes a coordinate structure recognition method based on semi-supervised learning to try to solve the problems of lack of coordinate structures annotated corpus and lack of semantic information in unlabeled corpus.First of all,researches on Chinese coordinate structures is currently relied heavily on annotated data without using semantic information in un-annotated data and semi-supervised learning not introduced.A coordinate structures recognition method based on semi-supervised learning is proposed in the framework of conditional random fields(CRF).Word embedding are trained from the unlabeled data and unsupervised features are extracted.Then linguistic features are introduced for comparative experiments to examine the effects of different features on coordinate structures recognition.Experimental results show that the unsupervised features can improve the recognition of coordinate structures and the F-score reach 85.71%,F-score of 85.72% when combined with linguistic and unsupervised features.The unsupervised features reduce the workload of selecting features manually and incorporate semantic information into the recognition model in a more concise way.Secondly,the research on coordinate structure recognition is limited by the problem of small amount of labeled data.Semi-supervised learning and active learning are effective methods to improve the recognition performance of supervised learning by using unlabeled data at the cost of a small amount of labeled data.To this end,this paper proposes a combination of cooperate training(Tritraining)algorithm and active learning identification method in semi-supervised learning.Firstly,the sample points are automatically labeled based on cooperate training,and then some samples with high uncertainty are extracted by active learning algorithm for manual correction.Before the training model,in order to balance the sample distribution of the training set,a rule-based undersampling data editing method suitable for parallel structures is proposed,and the unlabeled data is undersampled based on rules,which lays a good foundation for subsequent model training.In terms of the measurement of sample uncertainty,an enhanced least confidence sample selection strategy(En LC selection strategy)is proposed to label and correct unlabeled samples with high values to improve the efficiency and quality of model labeling.The experimental results show that the enhanced Tri-training method combined with active learning can effectively expand the scale of annotated corpus,and the model performance using En LC selection strategy is better.

Keywords/Search Tags:

coordinate structure, semi-supervised learning, conditional random field, tri-training, active learning

PDF Full Text Request

Related items

1	Study On Text Emotion Analysis Based On Supervised Learning
2	Research On Semi-supervised Learning-based Automatic Speech Annotation
3	Research On Partially Labeled Problem Based On Active Learning And Semi-supervised Mechanism
4	Research On The Application Of Geometric Information In The Semi-supervised Learning
5	Research On Semi-supervised Learning Algorithm Based On Tri-training Algorithm
6	Several Theoretical Issues On Semi-supervised Learning
7	Research On The Application Of Semi-supervised Learning In Natural Language Processing
8	Key Information Extraction Of Sequence Data Based On Deep Neural Network
9	Incorporating Self-training And Active Learning For Intention Detection
10	Research On Optimization Of Semi-supervised Classification Algorithm Combining With Active Learning