Font Size: a A A

Chinese Address Analysis Base On Conditional Random Field

Posted on:2019-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhaoFull Text:PDF
GTID:2428330578472838Subject:Cartography and Geographic Information System
Abstract/Summary:PDF Full Text Request
At present,there are a large number of address description information.However,this kind of information often needs to be transformed into spatial coordinates to help analysis and research.Geocoding is a technology that converts address description information into space point.Geocoding generally includes two main processes,address resolution and address matching.The performance of address resolution directly affects the accuracy of matching,and ultimately affects the quality of the entire coding system.Address resolution mainly refers to the process of word segmentation and type recognition of address components.The address participle is divided into rules based participle and statistical based participle.The rule based participle is based on the address segmentation dictionary.Some retrieval algorithms are used to query the dictionary.The accuracy of the participle depends on the accuracy and perfection of the dictionary.The method of statistical segmentation is by training the tagged address corpus,taking the address as the observation sequence and the tagging set of the participle as the tagging sequence,and marking the address through the model,and the result of the annotation can be transformed into the result of the participle.Although the method of word segmentation based on statistics takes a certain amount of time,it can solve the ambiguity problem of partial address participle,and its annotation results can not only be output as participle tagging but also the annotation of the address component type.It can solve the problem of address participle and address component type identification at the same time by statistics.Questions.In statistical annotation,conditional random field is the most commonly used statistical model at present.Therefore,this paper proposes an address resolution method based on conditional random fields.This method transforms the problem of address component recognition and address component recognition to the annotation problem of participle and address components,and combines the participle and address component annotation to form the annotation data set of the output sequence.Fully consider the composition and usage habits of addresses,and set up features that are helpful for address resolution.By constructing the address corpus and setting up the corresponding feature template,the conditional model of Chinese non standard address is trained,and the untagged Chinese address can be tagged through the address model,and the optimal address annotation output sequence is obtained.Finally,it is transformed into the result of the classification of participle and address components.In this paper,CRF++ is used as a modeling tool to mark 282257 address data of Ji'nan,which is extracted by Gao de.A corpus is set up.The address condition model is obtained by training the corpus,and 80000 labelling addresses are tested as test sets to verify the performance of the model.The experimental results show that the accuracy rate of the conditional random field is basically in line with the requirement of address matching,and the accuracy rate is up to 80%.In which the context window is set to[-2,2],and the feature template with comprehensive lexical features and address features as the training template is the best,and the accuracy rate is 89.02%.
Keywords/Search Tags:geocoding, Address resolution, corpus, CRF
PDF Full Text Request
Related items