Font Size: a A A

The Research Of Address Standardization Algorithm Based On AC Automaton And Address Probability Model

Posted on:2020-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhangFull Text:PDF
GTID:2428330572467214Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The address was a key bridge connecting among people,objects and things.The analysis of address played an indispensable role in processing business competition,public opinion management and wisdom city construction.Address standardization was transformed unstructured and spatialized addresses into standard Chinese address structures,enabled computers to understand and recognize address information.Traditional Chinese address standardization was mainly carried out from three aspects: dictionary,statistics and semantics.This thesis had presented an address standardization algorithm based on combined dictionary and probability statistics,which established a 12-level configurable address hierarchy model at the first,which based on a large number of instance addresses.The algorithm was based on a five-level lightweight address base.Firstly,the AC automaton algorithm was used to quickly tag the administrative,key words and auxiliary words in the Chinese address,and a set of Chinese address elements was obtained and the address vector model(AVSM)was established.Secondly,using cosine similarity and address tree to determine the first five level administrative address elements of AVSM.Thirdly,the following non-administrative address elements ware determined according to the keyword and the probability address model.Finally,the single standardized address was checked by the address rank rule,and the missing address rank element was filled by the completion dictionary for all the addresses that pass the check.This algorithm,effectively had combined the characteristics of fast segmentation based on dictionary and the advantages of probability statistics to effectively solve address ambiguity.standardize a large number of address data at a short time.The administrative database and the completion dictionary could maximize the effect of address completion,while keywords and probability models could effectively identify as unlisted words.This algorithm token into account the performance and maintainability of word segmentation.
Keywords/Search Tags:Chinese address, standardization, A ho-Corasick automaton, nlp
PDF Full Text Request
Related items