Font Size: a A A

Research On Some Key Technologies In Web Site Summarization

Posted on:2018-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:S A LiFull Text:PDF
GTID:2348330512987253Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the expansion of the scale of Internet,a large number of network data is growing at a rapid rate.The Internet has gradually become the main way for people to acquire knowledge.The emergence of search engines has helped users solve some of the problem to find information,but the search engine has a disadvantage of low precision.In order to better filter of the search engine returned information,automatic summarization technology applied to the network text has become necessary.In the various types of information providers on the Internet,the website is one of the main source,but with the growing complexity of the site,it becomes more difficult to find information.Summaries of Web sites can help solve this problem.At present,websites summary made by volunteers like DOMZ has already existed,and it has been widely used in various fields.But the Web sites summarization manually generated needs to spend a lot of manpower and time to maintain,and it is subjective.This paper presents methods to generate websites summarization automatically for academic websites and comprehensive websites.At present,there is not so much research on websites summarization,the main research has focused on web pages.But the method used to generate the web page summarization is not suitable for the website:There are several key questions to generate websites summarization automatically:1)Extracting the text content of a website.Websites usually contain multiple pages.The differences between the web page summarization and the website summarization is how to extract contents of multiple pages in the site.At the same time,different web pages have variable text structure.There are many links,navigation bars,advertising strips and other non-text information.How to extract text from complicated pages is the first problem needed to be solved.2)Currently,the methods for generating multi-document summary automatically have based on statistical features,based on the association map and etc.But these methods are not suitable for websites,because they did not consider the characteristics of websites and the special environment to generate summarization.3)Large-scale integrated web sites possess the characteristic of complex structure and content variety.How to obtain the description of such sites and generate summary based on the description is a key issue.In this paper,we analyzed the advantages and disadvantages of the existing method of generating single document and multi-document summary automatically,and explained why these methods are not suitable for website summarization.Begin from the site content extraction,and gradually generate website summary automatically.The specific works and achievements of this paper include:First of all,this paper presents an algorithm to extract text information of the website.The prerequisite of websites summary is to get the content of the website,This algorithm first uses the width-first search strategy to obtain web pages of the website,and then analyzes the page source code into a DOM tree.Using statistical method to achieve the target of information extraction.This approach can overcome the shortcomings of traditional wrapper methods that need to be designed rules in advance.Through the experimental analysis we find this method can be used to extract the comprehensive text of the website which is suitable to generate the website summary automatically.Then,a method that summarize a Web site automatically based on the hierarchical structure of the Web site and Latent Dirichlet Allocation is proposed on the basis of the synthesized text of the website(H-LDA).This method makes full use of the "site" feature of sentences.It combines the statistical characteristics of traditional documents to generate the site structure characteristics of sentences.The algorithm is applicable to academic institutions,because the hierarchical structure of such sites is clearer.Experiments show that summary generated by this method can provide more information than obtained from the home page.We also find using the site hierarchy is better than using LDA aloneFinally,we proposed a method that summarizes comprehensive website automatically(SE-LDA).The algorithm.uses the search engine to obtain the description information of this kind of websites,are uses the "search engine sort"feature to generate websites summary automatically from the statistical feature and the semantic understanding.Experiments demonstrate the feasibility of our method,and it provide more information than the first page of the website.At last,through the comparison experiments we verify that SE-LDA is more suitable for integrated web sites than H-LDA.
Keywords/Search Tags:Web Site, Web Site Summarization, Latent Dirichlet Allocation, Hierarchy of Web Site, breadth-first-search, search engine sorting, DOM tree
PDF Full Text Request
Related items