Research On Cross-domain Chinese Word Segmentation Method Based On New Word Discovery

Posted on:2022-05-23

Degree:Master

Type:Thesis

Country:China

Candidate:X Li

Full Text:PDF

GTID:2518306569460344

Subject:Communications and information processing

Abstract/Summary:

Different from English and other languages,there is no obvious delimiter between Chinese words,which causes certain difficulties to understand Chinese.In order to facilitate the computer’s understanding of Chinese,the Chinese text needs to be segmented by words,which is the Chinese word segmentation task.In recent years,the Chinese word segmentation method based on neural network has achieved far more effects than traditional word segmentation methods,and has become the current mainstream word segmentation method.Training neural network models requires large-scale manual labeled corpus,and manual labeling requires a lot of manpower and material resources,and it is obviously unrealistic to perform manual labeling for each field.The currently available labeled corpus basically belongs to the news field.If the model trained on news corpus is used to segment other fields,the effect will drop sharply.This is caused by the expression gap between texts in different domains and out of vocabulary(OOV)words,which is the domain adaptability problem of Chinese word segmentation.In response to the above problems,this paper proposes a cross-domain Chinese word segmentation system combined with a new word discovery algorithm,which realizes the use of unlabeled corpus in the target field through automatic labeling.The main work of this paper is as follows:(1)Based on the traditional new word discovery algorithm and Chinese word segmentation algorithm,this paper builds a cross-domain Chinese word segmentation system that combines new word discovery.The system first uses the new word discovery algorithm to extract a vocabulary of new words from the target domain corpus,then automatically annotates the unlabeled target corpus based on the vocabulary,and finally uses a trained model based on the automatic tagging corpus to segment the target domain;(2)In view of the shortcomings of the existing new word discovery algorithms that the vocabulary has many spam words and poor domains,this paper fully considers the statistical information and semantic information of the words in the corpus,and proposes an unsupervised new word discovery algorithm based on vector-enhanced mutual information.Through the discovery of new words on the corpus of each target domain,it can be seen that this method can significantly enhance the accuracy and domain of the new word vocabulary;(3)Aiming at the shortcomings of a small number of noise samples in the automatic tagging corpus,this paper proposes a Chinese word segmentation algorithm based on adversarial training.By comparing the word segmentation of corpus in various fields,it can be seen that this method can significantly improve the accuracy,robustness and generalization of the model.

Keywords/Search Tags:

Chinese Word Segmentation, New Word Discovery, Vector Enhancement Matual Information, Adversarial Training

Related items

1	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
2	Research On Chinese New Word Discovery Technology Based On Large Scale Network Corpus
3	Research And Implementation Of Chinese Word Segmentation Algorithm
4	The Research On Chinese Word Segmentation System Based On SVM
5	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
6	Research Of Chinese Word Segmentation In BERSE
7	Comparative Research On Open-Source Chinese Word Segmentation Machines
8	Research On Word-vector-representation-based New Word Discovery And Name Entity Recognition
9	Research On Chinese Named Entity Recognition Based On Feature Enhancement
10	Based On The Understanding Of The Chinese Word System Design And Realization