Font Size: a A A

Research On Chinese Address Matching Based On Word2Vec

Posted on:2021-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:A N ZhongFull Text:PDF
GTID:2370330629485246Subject:Cartography and Geographic Information System
Abstract/Summary:PDF Full Text Request
In the Internet era,the quality of address data varies,with errors or incomplete information,and the address text itself is difficult to directly participate in the numerical calculation of the model in computer.The traditional address matching method only interprets the address literally from the text,which has low computational efficiency and poor matching accuracy.The matching problem of non-standard address data is urgent to be solved.Aiming at the essence and structural characteristics of Chinese address and combining with the breakthrough and innovation in text comprehension technology in the field of natural language processing,this paper applies address vectorization method and similarity measure method to address matching problem,based on Word2 Vec.Then we discuss these unsupervised address matching methods for non-standard address data,and their ability to match on different types of data.The main research contents of this paper are as follows:(1)With the experimental data of Shenzhen address data sets,the address dictionary from the Internet is used as the auxiliary data of the word segmentation tool to improve the accuracy of Chinese address word segmentation;(2)Input the address corpus into Word2 Vec to obtain the trained word vector.On this basis,considering the characteristics of the constituent words and structure of the address,the method of factor mean,power mean,TF-IDF weighted average and SIF embedding is used to obtain the vector representation of the address sentence under different methods;(3)Combining cosine similarity,Jaccard similarity coefficient and WMD similarity calculated from the word level,several methods were used to measure the similarity degree between the non-standard address and the corresponding correct standard address,and the evaluation index of address retrieval matching was designed to analyze and verify the matching quality under different methods.The final experimental results show that the WMD similarity method has high accuracy and reliability in the matching of non-standard address data from the word level,and the combination method of factor mean and cosine similarity has good computational efficiency and matching ability.Address matching based on Word2 Vec is an efficient matching method that takes into account the semantics of Chinese address expression.
Keywords/Search Tags:natural language processing, address vectorization, text similarity, unsupervised address matching
PDF Full Text Request
Related items