Font Size: a A A

Research On Theory And Application Of Aggregation Model For Short Texts

Posted on:2018-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2348330512483303Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
By providing credit card payment service for merchants,China UnionPay can get a lot of information of merchants,called "internal information",which containing the merchants' names,merchants' types,locations and transactions.Meanwhile the merchants have some additional information on the Internet,such as commercial circle,ratings,popularity and so on,called "external information".If a merchant's internal information can be associated with its external information,UnionPay can get extra information of a given merchant and learn it from multiple perspectives.The process of associating a merchant's internal information with its external information is known as "the aggregation of merchants".The challenge in our study is to associate a pair of similar names together which belong to the same merchant and prevent from conjugating pairwise similar names which belong to different merchants in two merchant's data sets.Such study is also known as named entity normalization which is of important theoretical and practical interests for data resource integration across different fields.In this thesis,we carry out the aggregation task for large amounts of Chinese short texts with text mining technology.The main research content has two parts,namely basic algorithms research and applied research:(1)Based on the text mining technology,we have carried out many researches on traditional similarity algorithms for short text and proposed a new similarity algorithm called generalized Jaro-Winkler.The algorithm takes the prefix,text length,the sequence of the same characters and the interval between the same characters into consideration while the traditional similarity algorithms only take one or two factors into account.We have chosen 6 similarity algorithms for comparisons.Experimental results show that the new algorithm has better performance in precision and stability compared with other algorithms.(2)We have proposed an effective aggregation model for a large amount of Chinese short texts.In order to ensure the matching efficiency,we have devised a filter framework named fast filtering which can decrease the volume of candidate pairs sharply.Moreover,in order to ensure the matching accuracy,we have done a lot of analysis on the merchants' information and devised an improved framework named refined matching which can improve the accuracy of aggregation for Chinese short texts significantly by applying the new algorithm into the model.In conclusion,this thesis not only obtains theoretical innovation on similarity algorithm,but also applies the proposed algorithm into a computational framework which allows us to fulfill the task of short text aggregation for Union Pay.A lot of experiments show that our new similarity algorithm and aggregation model are usable and reliable.The researches in this thesis will enrich text mining technology and provide meaningful reference on named entity normalization.
Keywords/Search Tags:big data, text mining, named entity normalization, similarity matching, generalized Jaro-Winkler, inverted index, fast matching, refined matching, aggregation for short text
PDF Full Text Request
Related items