Font Size: a A A

Chinese Location Recognition Based On Statistics And CRF

Posted on:2019-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:W TengFull Text:PDF
GTID:2428330572955292Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With high speed of the development of the Internet in our society,as one of the most important information carriers in the production and life of human society,the network contains a great deal of valuable geographical location information.However,most of this information is in the form of web texts.Therefore,how to extract unstructured geographic information from web texts becomes the most important issue.The recognition of Chinese location is to extract the geospatial entities from Chinese digital texts.Chinese characters used in place names often have strong word formation abilities and diverse features,So it is difficult to accurately locate the location and boundaries of geographical names from the text.In this regard,this paper deeply analyzes the characteristics of Chinese geographical names,and converts the identification of geographical names into sequence labeling problems,and trains conditional random field model to recognize locations,At the same time,it designs an algorithm for the recognition of complex geographical names and modifies the results of the CRF model trained before.The main contributions of this article are as follows:(1)In view of the fact that the existing models have low recognition accuracy for complex geographical names,this article designs a algorithm based on information entropy and point mutual information to deal with this problem.The algorithm uses a location database to generate a relevance dictionary,and based on this,the correlation between the adjacent words in the text is calculated to determine the boundary of complex location names and its contexts,and finally realize the recognition of complex location names.(2)A rules-based window detection algorithm for the location recognition is proposed.In the existing research,the rules method combined with conditional random field models is mainly used as a supplemental means to the CRF recognition result,and plays the role of correction,disambiguation,and recall.However,because of its directing effect on the recognition results of the upper layers,there is no ability to make up for other unrecognized names hidden in the original text,and thus the impact is limited.For purely rule-based methods of geographical name recognition,it needs to apply a polling rule set to the sentence in the recognition process,and the efficiency is very low.This paper improves the above two shortcomings,applies the rule recognition method directly to the original text,and uses the geographical name feature words to coarsely locate the suspected place names in the original text,and further confirms or excludes them in combination with the detection window and rule sets.From the actual results,this method can effectively use the existing set of rules for the identification of geographical names,can better coordinate with the CRF model,and improve the effect of recruitment.(3)By crawling the authoritative website NGAC's geography article title data,this article makes a complex geographical corpus which provides a reliable corpus of training and verification for the identification of complex geographical names with the rules of The principle and application of Chinese information extraction.
Keywords/Search Tags:Chinese Location Recognition, CRF, Information Entropy, PMI
PDF Full Text Request
Related items