Font Size: a A A

Html Tags And Chinese Segmentation-based Web Index And Implementation

Posted on:2004-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:J S DongFull Text:PDF
GTID:2208360095960505Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In the thesis, we carefully investigate the function of HTML tags which is used to decorate the WebPages content and analyze a lot of WebPages source code. During our research, We analyze various kinds of index and weighting tactics , techniques of Chinese word segmentation, English words stemming algorithm. This system adopts the tactic of full text index. Referencing to the equation for weighting the index in the traditional IR, tf*idf, we explore the characters of HTML tags to improve the expression of WebPages. WebPages analysis and weighting tactics based on HTML tags are designed and realized. We perfect the Chinese word segmentation algorithm based on dictionary in addition.The whole system adopted the object orient programming, database technology, JDBC and Java multi-thread technology, etc.. In probation, it reached a high degree of accuracy to Truncation of Chinese words. With the increasing of training constantly, the word dictionary for truncation could be perfected, the accuracy of truncation of Chinese vocabulary could be improved. Based on truncation of Chinese vocabulary and HTML mark analyze, the index tactics could express the content of webpage better, and build foundation for similarity calculation of vector space model.
Keywords/Search Tags:Search Engine, Stemming, Indexed, Weighting, Chinese word Segmentation, Vector Space Model
PDF Full Text Request
Related items