Font Size: a A A

Research On Technology Of Software Component Obtaining From The Internet

Posted on:2011-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:D K XuFull Text:PDF
GTID:2178360302999162Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, internet has become a platform for sharing resources.Most of the resources on the network display as Web form.So do the component library resources on the network. The research purpose of this paper is to find component resources from network, analyze web pages which may contain component resources, and extract component information to local disk.To achieve these objectives, the paper processes web data as follows:First, the paper describes component with BNF which needs crawling on the Web. Based on the description component storage model and baseline document generate. And they prepare for the subsequent chapters;Secondly,this paper identifies component topic of the web pages from internet with Bayesian TF-IDF algorithm in four aspects including webpage content, virtual text, title text and keywords text and makes a storage of the pages relevant to component subject;Thirdly, with combining crawling strategy of page rank and shark search this page sorts the URLs to be treated.With the comprehensive strategy crawler can crawl high priority URL first and avoid the theme of migration in crawling process;Fourth, based on relevance and the visual characteristics of the page block algorithm this page analyzes web page with component information and identifies the topic blocks from the web page;Fifth, this paper creates four matrixes from the adjacent constraint, feature constraints, location constraints and relevance between entities, and then clusters the entities with the improved transitive closure method.At the end of the chapter based on the baseline document and the storage model this paper matches the clustered entities to attributes of the component storage model and generates XML document to store the extracted component information;This paper implements the technology exploration of obtaining components from the Internet. In the summary of each chapter this paper also presents summaries of the four technologies to be further improved.The summaries are directions needing to continue research.
Keywords/Search Tags:Component Obtaining, Component Description Model, Topic Page-Recognition, Comprehensive Crawling Strategy, Fuzzy Clustering
PDF Full Text Request
Related items