Font Size: a A A

Chinese Webpage Feature Extraction In Learning To Rank Algorithms

Posted on:2010-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2178360332457875Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Existing Sorting technology in search engine has evolved over two generations. The first-generation search engine is based upon the statistical ranking of word frequency and position, e.g., Infoseek, Excite, Lycos, etc. However there are some disadvantages with this method, such as the fact that it doesn't utilize the properties of web pages, like hyperlinks and anchors. Moreover, many webpage editors stack keywords on their pages to intervene the judgment of search engines for the sake of higher ranking in search results. The second generation search engine is based on the sorting of link analysis, such as Hyperlink Analysis from Baidu and PageRank from Google. In order to be displayed on the first page of search engines, web sites often use increasing links or exchanging links among themselves or set up cheat links. As a result, those websites with excellent contents but less links are very difficult to be found by search engines.Learning to rank, a new method in webpage ranking, is able to compensate the insufficiency of the two methods mentioned above. However, the existing methods of learning to rank only apply to the English webpage, while the learning to rank in Chinese webpage is lack of research.In order to do this, the thesis for the different features of Chinese webpage and Chinese webpage to design and implement a Chinese webpage feature extraction system in learning to rank. In addition to applying the traditional TF, IDF, DL such as word frequency statistical methods, but also applied to the classic language models of document relation extraction methods, such as BM25, LMIR_ABS, LMIR_DIR and LMIR_JM. At the same time, this thesis applied Edit Length to Chinese webpage feature extraction in learning to rank.Then, we set up the learning to rank platform, and implemented the classical RankNet and RankSVM algorithms for the extracted Chinese webpage features. And we compared the performance of features in the Chinese webpage ranking using experimental results.Finally, we input the features with Edit Length (EL) and features without EL to RankNet and RankSVM systems, respectively, and then compared the error rate. The experimental results show that between the RankNet and RankSVM systems, the error rate with EL reduced more than 3% to 10% compared with error rate without EL, which confirms the contributions of introducing additional features.
Keywords/Search Tags:learning to rank, feature extraction, Edit Length, RankNet, RankSVM
PDF Full Text Request
Related items