Font Size: a A A

Identification And Extraction Of Multiple Topic Content Blocks In A Web Page

Posted on:2006-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:G M LuoFull Text:PDF
GTID:2168360155953039Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
This world is full of various informations . Especially onthe web, there is so many web pages that the traditional searchingengine can't satisfy our demands. Even the most famous searchingengine-google can't index the 10% of the total web pages. Facingthis situation ,people choose the focus crawling technology. Byanalysising the web page's content,focus crawling can get theinformation of user's needs without a huge database. The contentof this paper is about the key technology in the field of focuscrawling which is identification and extraction of multiple topiccontent blocks in a web page.Web page disposal technology involved a lot of technologiesabout web page's construction , we use DOM which is proposed byw3c. DOM expresses a html document to a node object tree. And thenode not only represents the document's xml element, but alsorepresents the other content attributes of the document. forexample , attributes ,comments ,data and so on. Every node objecthas the own interface which maps the xml contents of the node andthe node actually is interface ,too . The supporter of the orientobject will say every DOM object is inherited from node . And thenode is used to lead the DOM tree , add a new node or modify adocument's construction . DOM is a technology which has the samefeature in different operation system or program language , itpermits the program and script access and modify a document'scontent and type attribute dynamic and define a lot of objectsand methods which can do any random operation to a document .Except the web page's construction , we also use the webpage's vision feature. For a web page writer, in order to organizea web page nice and make the web page reader could read easily ,he will let the contents which have the close topic be together ,and the vision features of them will be consistent which meansthe same topic blocks will have the same vision features , in otherwords , they have the same color ,the same font , the same wordsize , and so on.In the field of focus crawling , there is a very tough problem,the tunnel problem which means there are some outlying pagesbetween our target page and the crawling page , but the pagescontact each other by links . Just like our car need get througha tunnel to reach the place we want . But we feel a little confused ,why the outlying pages have the links which points to our targetpages ? Out of question, the so-call outlying pages are actuallyis relative to our crawling topic, the reason they are considerto be outlying is wrong class define. And the reason of thismistake is multiple topic document . The reason of multiple topicis that the view is different between web page writer and web pagereader which mean that a web page is a single topic in the viewof writer , but in the view of reader , it is multiple topic .For example , web page maker write a page about fruit . in hisview , the page is a single topic page , all of them are aboutfruit . But if the search engine user want to get information aboutapple, then the page is a multiple topic web page . Because , inreader's view , the banana has the same status as the apple . Andthis situation will be forced when we use text vector as a document.We will overlook the apple if the maker write a little about theapple. And if the writer give a link to the web which is aboutapple , we will get a tunnel problem . The work we doing want todo something to solve the tunnel problem . Thinking about theexample , if we analysis the content by block not page, we willget the link we really need . The concrete work we doing will filter a page firstly, thatwill do a great favor to our future work , which let's not to thinktoo many different situations .In the concrete content extractionworking , We use DOM tree and web page's vision features to dividethe web page. For the future working we define a value which thepage'size can't small than it. That is for extraction of the textfeatures vector. Our working is affected by the working of Microsoft Asiaengineer academe and LiuBing. We use the thinking of theiranalysis about vision features for reference , but we don't agreethe view of web page's main idea is the writer's idea whichoverlook multi-topic and make tunnel problem . We think we shouldnot use the writer's idea and not use the reader's idea to analysisthe page , we should be the bridge which connect the writer andthe reader . So we focus our target to reasonable content block...
Keywords/Search Tags:Identification
PDF Full Text Request
Related items