Font Size: a A A

Research On Chinese Parallel Structure Recognition Based On Semi-Supervised Learning

Posted on:2022-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:D YangFull Text:PDF
GTID:2518306524451854Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
As a common component structure in natural language,the correct recognition of coordinate structure can greatly improve the performance and efficiency of automatic syntactic analyzer,and can also promote the construction of Chinese tree library.The recognition results can also be directly applied to machine translation,information extraction and other fields.Due to the complex and changeable characteristics of Chinese,coordinate structure recognition has become a research difficulty in the field of Chinese information processing.In the current research,rule-based methods need to manually construct rule templates according to specific language syntax and domain.The application of this method has high cost and poor portability.Although the statistical-based method has achieved good results,it is highly dependent on the annotated corpus and does not use the rich semantic information in the unlabeled corpus.At the same time,the existing annotated corpus cannot meet the language model in the era of big data.In view of these shortcomings,this paper explores effective methods to identify coordinate structures,and proposes a coordinate structure recognition method based on semi-supervised learning to try to solve the problems of lack of coordinate structures annotated corpus and lack of semantic information in unlabeled corpus.First of all,researches on Chinese coordinate structures is currently relied heavily on annotated data without using semantic information in un-annotated data and semi-supervised learning not introduced.A coordinate structures recognition method based on semi-supervised learning is proposed in the framework of conditional random fields(CRF).Word embedding are trained from the unlabeled data and unsupervised features are extracted.Then linguistic features are introduced for comparative experiments to examine the effects of different features on coordinate structures recognition.Experimental results show that the unsupervised features can improve the recognition of coordinate structures and the F-score reach 85.71%,F-score of 85.72% when combined with linguistic and unsupervised features.The unsupervised features reduce the workload of selecting features manually and incorporate semantic information into the recognition model in a more concise way.Secondly,the research on coordinate structure recognition is limited by the problem of small amount of labeled data.Semi-supervised learning and active learning are effective methods to improve the recognition performance of supervised learning by using unlabeled data at the cost of a small amount of labeled data.To this end,this paper proposes a combination of cooperate training(Tritraining)algorithm and active learning identification method in semi-supervised learning.Firstly,the sample points are automatically labeled based on cooperate training,and then some samples with high uncertainty are extracted by active learning algorithm for manual correction.Before the training model,in order to balance the sample distribution of the training set,a rule-based undersampling data editing method suitable for parallel structures is proposed,and the unlabeled data is undersampled based on rules,which lays a good foundation for subsequent model training.In terms of the measurement of sample uncertainty,an enhanced least confidence sample selection strategy(En LC selection strategy)is proposed to label and correct unlabeled samples with high values to improve the efficiency and quality of model labeling.The experimental results show that the enhanced Tri-training method combined with active learning can effectively expand the scale of annotated corpus,and the model performance using En LC selection strategy is better.
Keywords/Search Tags:coordinate structure, semi-supervised learning, conditional random field, tri-training, active learning
PDF Full Text Request
Related items