Font Size: a A A

Automatically Chinese Address Recognition And Normalization

Posted on:2011-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:H SunFull Text:PDF
GTID:2178330338981793Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Internet, the data on network and number of users both grow exponentially. Nowadays, how to locate the accurate information and deal with those different formats generated by different users are two key problems to be solved. Such problems are serious especially in the field of Emergency Management, so we used address data of the field as our experimental object and focused on au-tomatically Chinese address recognition and standardization.Automatically address recognition is a sub-task of Named Entity Recognition which belongs to Natural Language Processing. Existing researches usually employ rule based methods or statistical learning based methods. We used the latter one which involved Maximum Entropy model to identify address data from plain text. Including:1. Characteristic analysis: including word frequency features of Chinese address and the contexts.2. Feature selection and modeling: we defined features used in maximum entro-py model and applied the model in address recognition.3. Experimenting: we validated our methods based on experimental results. As it turned out the improvement was notable.Another part of our work was Chinese address standardization which involved address labeling and normalization. Chinese address labeling separates long address into different parts based on their semantic roles and adds labels onto them. We used Conditional Random Field model in our experiments, the work included:1. Chinese long address segmentation: we used heuristic method and statistical language model to improve the token results of existing tool.2. Chinese address structured labeling: we used Conditional Random Filed model to label the address elements, and validated our results. We also built a corpus containing 6000 long address with labels. The experimental result showed that our method had a great improvement.3. Chinese address normalization: we used rule-based method to solve the prob-lems including word missing, misspelling and duplication of name.
Keywords/Search Tags:Named Entity Recognition, Statistical Learning Methods, Maximum Entropy Model, Conditional Random Model, Feature Weight
PDF Full Text Request
Related items