Font Size: a A A

Research On Parameters Correlation And Optimization In Text Similarity Measurement

Posted on:2011-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:X XuFull Text:PDF
GTID:2178360305994207Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of computer network and application technologies, Internet becomes the primary channel of information memory and communion, but it also brings the disaster of information high-speed increase. So information processing technologies such as Data Excavation, Information Retrieval and Text Classification emerge. As the basis of those information processing technologies, text similarity measurement technology has deep study significance and extensive application prospect.Parameters in text similarity measurement such as similarity threshold, precision, recall rate, size of moving window, shingle measure coefficient threshold, extractive rate and length of text are interrelated and complicated. The thesis firstly analyses pivotal technologies such as text mathematical expression, feature generation, feature picking and similarity calculation according to the clue of text similarity measurement implementation process; based on this, it implements and compares two kinds of the most typical algorithms; then it studies the correlation of those parameters combining the shingling algorithm experiment; at last it proposes the parameters optimization suggestions, and proposes and analyzes the parameters such as similarity threshold adaptable algorithm for text similarity measurement.The algorithm is applied to the system of text similarity measurement for the fund which has 7378 proposals in 2009. The results show that the algorithm has high performance in pratical use, and can make precision and recall rate achieve up to more than 95% no matter the length of the text is long or short.
Keywords/Search Tags:text similarity measurement, algorithm, shingle, parameters correlation, recall rate
PDF Full Text Request
Related items