Font Size: a A A

The Design And Implementation Of A Chinese Organization Names Retrieval System

Posted on:2014-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y S LianFull Text:PDF
GTID:2308330503952563Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
The “Organization” refers to the state organs, public organizations or other enterprises. It is widely used in people’s daily lives. It has a lot of features, including large quantities, wide coverage, complex composition and so on. Moreover, new terms are generated continually with the development of social and economic. Some organizations usually have the abbreviations because of too many words in their names. And the users are more likely to enter short titles or other types of unknown words when using the retrieval systems to retrieve the organization names. These irregular inputs cause a lot of difficulties to natural language processing. Therefore, how to search and match these irregular inputs exactly has become one of the focuses in the information processing area.This article is on the basis of related research analysis at domestic and international. It made an in-depth analysis and research on the issue of how to retrieve the Chinese organization names efficiently, and designed and carried out the system based on several theories. First of all, this article studied and discussed the segmentation rules of organization names systematically, summarized the existing Chinese words segmentation specification, and proposed a segmentation regulation that is suitable for Chinese organizations. Second, as studying the method of generating abbreviations for organization names, this article summarized the features of name structure through analyzing a large number of organization names, made a formal description of the full name structure of the organizations, and summed up an abbreviation generating regulation of organization names, which is based on the users’ habits. In order to implement the organization retrieval system, this article customized a Chinese segmentation tool based on conditional random fields,which is in connection with the organization names segmentation problem; According to the abbreviation generating regulation of organization names, this article designed an abbreviation generator based on conditional random fields, and solved the problem of abbreviation collection in the database. As to matching the entered string to the full name of an organization, this article combined the string matching algorithm with the field matching algorithm, and made a fuzzy matching algorithm based on edit distance, thus to improve the accuracy of matching the user inputs to the organization names; for the sort of candidate words, this article used vector space model as the retrieved sort algorithm, sat weight for each candidate word, and sorted them by their Cosine similar degrees, thus had a good effect. The experimental results show that the accuracy of the retrieval system has achieved to 92.1%, satisfied the basic standard of practical use.
Keywords/Search Tags:Chinese Organization Name, Chinese Abbreviation, Chinese Segmentation, Conditional Random Field, Edit Distance, Vector Space Model
PDF Full Text Request
Related items