Font Size: a A A

Improved TF-IDF Algorithm Based On Keyword Search Of The Forum And Its Application

Posted on:2016-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:F SunFull Text:PDF
GTID:2308330470463590Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development,Internet has become more and more important in people’s life. Forum has appeared as a product of Internet age and has developed quickly. Different forums include almost all aspects that related to people closely, everyone can find their favorite forums, and different sites also try to set up the forums that related to themselves. Thus they can not only communicate with users, but also the users can interact on each other. At the same time, the content of the site has increased. Forum is a general category, it includes various sectors. With variety of users, so the number of published posts is huge. How to find the post that you need from the site quickly, the commonly used method is to enter a keyword in the search box to retrieve it. But how to increase the accuracy and the speed of searching a keyword in the forum, It is related to the segmentation of text and the keyword weights calculation closely.The original TF-IDF keyword weighting method looks easily, and it also has low time complexity, but the results of Keyword extraction are inaccurate, especially for the post content of the Forum. The extractive keyword often can’t play a key role in the content of the posts. It will affect the efficiency of search directly.In order to improve the performance of the system of searching keywords, and against to the text of posts are often composed by living language and simple, even it is difficult to distinguish the replies because the replies is simple and meaningless. This article will explain how to improve TF-IDF algorithm in the keyword extraction system. The main methods are as follow:(1)In the posts classification, we will calculate the cosine between certain posts of Yao Lake Tribune. Based on the existing theories, when man find the numeric of the relevance of posts is higher than 0.18, we showing some correlation between posts, we set 0.18 as the threshold of posts’ classification.(2)In order to increase the effect of Tokenizer, we add the dictionary of keywords and stop words in the system, the dictionary is modifiable. After separate the words, we can find how many words are unnecessary and how many words are representative through contrasting to the two dictionaries.(3) We will considerate the feature of posts’ structure fully and also considerate the title and the content of posts. At the same time, we will introduce the concept of key factors to modify the formula of TF-IDF.(4)In order to make our keywords more representative, we will set a threshold of the resulting words. O nly when the TF-IDF value of the resulting words is higher than the threshold we set, we will identify it as keywords, Otherwise we will ignore it.On this basis, we design and implement the keyword search system based on Yao Lake Tribune. This system is divided into three sections: basic data management subsystem, participles and keyword extraction system and user search subsystem.The experimental results show that the efficiency of searching for forum posts has improved after using the new method.
Keywords/Search Tags:Forum, Keyword, TF-IDF, post
PDF Full Text Request
Related items