Font Size: a A A

Web-page Summarization By Tag Classification Content Sharing System

Posted on:2008-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:H YangFull Text:PDF
GTID:2178360215490917Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In the recent years,with the development of Internet and the increase of network bandwidth, Internet could be benefit for the people life more and more. For example, Internet will be a new economy model as the development of electronic business, but when the content capacity of internet becomes bigger and bigger, people can't utilize internet data effectively, although the search engine can help people to search useful information, the search engine need to overcome some difficult problems to improve user experiences. How to deal with internet data and use it more effectively, which becomes a hot topic in the research area of nature language process.People require a big improvement nature language process wth the appearance of Web2.0, because the Web2.0 applications based on the nature language process get a big success today, they want to keep this status. A typical web2.0 sample is the Tag Classification Content Sharing System. In this system, users can manage and classify web-page by tags, users can search and see what they need in this system. Based on the tags system, developers can implement better applications.At present, people use nature language process to classify and summarize the web pages, which improves the efficiency of using web resources. The classification for web pages can clean up the web resources, users can get them easily; the summarization for web pages can focus on the main content of web pages, users can get the mail content of web pages easily.In this paper, we extract the extra knowledge from the tags in the Tag Classification Content Sharing System to improve Web-page summarization. Because tag information is strong related to the main content in web pages, we can utilize tags in web page summarization. After implementing a basic model by this idea, we use related tags concept to improve our model.We summarize the current development of automated summarization, and analyze the characters of the web page, after that, we introduce the concept about web2.0, especially about the tags concept. Based on above introduction, we propose our summarization models:Firstly, we initialize a basic TF Model to generate summary. This model utilizes the TF/IDF method to check the importance of one word, after that, we introduce Luhn's method to select the best sentences as summary. Based on the initial model, we improve this model by the special characters in the web page, such as the italic, bold and underline text. Those texts can provide significant information for us. We define four features for web pages, and then apply the Na?ve Bayesian classifier to train a summarizer, select the best sentence as summary.The TF(Term Frequency) model and improved model both analyze the web page itself, select the significant sentences as summary. Combining with the Web2.0 tag concept, we utilize external tag information to generate summary. We propose this model first, and then we improve this model by considering related tags and semantic overlap sentences. We employ related tags to estimate words weight; we employ law of cosines to unify the semantic overlap sentencesThis paper evaluates the evaluation summaries model with the classic recall rate, the precision rate, F1 and ROUGE. Firstly experiments on the sharing system of existing classification tag content show that trial Note tag can well reflect the main content of the website. Then, experiments classified on a labeling sharing system and on the Open Directory Project are performed respectively. It can be seen through the experiments that the content sharing system based on the classification tag has generated better results than that of the model simply use of word frequency abstract. And the effect of the improved model is similar to that of the model based on word frequency label. This is due to the uncertainty of the labels. If a tag classification system has a large share of the tag data, it will be able to generate better abstracts.
Keywords/Search Tags:Automated summarization, Web2.0, Tag, Term Frequency
PDF Full Text Request
Related items