Font Size: a A A

Research On Chinese Address Segmentation Method For A Small Amount Of Labeled Data

Posted on:2021-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:F Z WangFull Text:PDF
GTID:2428330623469179Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Address encoding technology refers to the process of mapping the Chinese address described by text into spatial coordinates.It consists of address standardization,address segmentation,address matching,address location and other steps.Among them,Chinese address segmentation is the basic work of address encoding and greatly affects the performance of address matching,address location and other follow-up work.Chinese address segmentation can be regarded as the application of Chinese word segmentation in the field of address encoding.At present,there are few researches on this specific field.In view of the shortcomings of the current study on Chinese address segmentation model and the labelled data's high cost,this paper proposes a Chinese address segmentation method for a small amount of labeled data.This method uses active learning algorithm.It combines the existing model's predict and sample selection function to choose most valuable data from unlabeled data,then labels them and inserts them into the training dataset for iteration.This method improves the address segmentation model and sample selection strategy respectively.It's composed of the following two parts.1.The current mainstream Chinese address segmentation model based on BiLSTM+CRF has the problem of long term dependence and the disadvantage of parallel computing due to the structure characteristic of LSTM.The Chinese address segmentation model based on improved Transformer+CRF,from the perspective of the model research,replaces BiLSTM with the improved Transformer for feature extraction.On the one hand it lets the text character get the global information and interact with other characters through the Attention mechanism.On the other hand,the ability of the Transformer model to learn location information is optimized by adding multiple location information matrices.2.Since the current active learning sample selection strategy is too simple and can not make full use of the sample text information,the address sample selection strategy designs sample selection function accord with the characteristics of Chinese address,considering the addresses' uncertainty,diversity,and the importance of characters together,to more effectively pick out address samples which are more valuable to address segmentation model.In this paper,real address data including standard structured address and nonstandard address,is collected,which is cleaned,screened and labeled as experimental dataset.The experimental results show the proposed method can effectively reduce the cost of address annotation and achieve better segmentation effect with less labeled data.
Keywords/Search Tags:chinese address segmentation, Transformer, active learning
PDF Full Text Request
Related items