Font Size: a A A

Research On Cross-domain Chinese Word Segmentation Method Based On New Word Discovery

Posted on:2022-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2518306569460344Subject:Communications and information processing
Abstract/Summary:PDF Full Text Request
Different from English and other languages,there is no obvious delimiter between Chinese words,which causes certain difficulties to understand Chinese.In order to facilitate the computer's understanding of Chinese,the Chinese text needs to be segmented by words,which is the Chinese word segmentation task.In recent years,the Chinese word segmentation method based on neural network has achieved far more effects than traditional word segmentation methods,and has become the current mainstream word segmentation method.Training neural network models requires large-scale manual labeled corpus,and manual labeling requires a lot of manpower and material resources,and it is obviously unrealistic to perform manual labeling for each field.The currently available labeled corpus basically belongs to the news field.If the model trained on news corpus is used to segment other fields,the effect will drop sharply.This is caused by the expression gap between texts in different domains and out of vocabulary(OOV)words,which is the domain adaptability problem of Chinese word segmentation.In response to the above problems,this paper proposes a cross-domain Chinese word segmentation system combined with a new word discovery algorithm,which realizes the use of unlabeled corpus in the target field through automatic labeling.The main work of this paper is as follows:(1)Based on the traditional new word discovery algorithm and Chinese word segmentation algorithm,this paper builds a cross-domain Chinese word segmentation system that combines new word discovery.The system first uses the new word discovery algorithm to extract a vocabulary of new words from the target domain corpus,then automatically annotates the unlabeled target corpus based on the vocabulary,and finally uses a trained model based on the automatic tagging corpus to segment the target domain;(2)In view of the shortcomings of the existing new word discovery algorithms that the vocabulary has many spam words and poor domains,this paper fully considers the statistical information and semantic information of the words in the corpus,and proposes an unsupervised new word discovery algorithm based on vector-enhanced mutual information.Through the discovery of new words on the corpus of each target domain,it can be seen that this method can significantly enhance the accuracy and domain of the new word vocabulary;(3)Aiming at the shortcomings of a small number of noise samples in the automatic tagging corpus,this paper proposes a Chinese word segmentation algorithm based on adversarial training.By comparing the word segmentation of corpus in various fields,it can be seen that this method can significantly improve the accuracy,robustness and generalization of the model.
Keywords/Search Tags:Chinese Word Segmentation, New Word Discovery, Vector Enhancement Matual Information, Adversarial Training
PDF Full Text Request
Related items