Font Size: a A A

A Study On The Auto-Indexing For Web Information

Posted on:2015-01-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:1108330470484808Subject:Information resource management
Abstract/Summary:PDF Full Text Request
With the development of internet and the promotion of information project, Web is becoming a great resource space, which provides a way to exchange or share information for us and has a profound influence on all areas of our life. In order to search for necessary resources from the Web information, which is massive, isomers, disordered and updating with time, people begin to realize the importance of Web information management:such as auto-indexing for Web information.The search sets auto-indexing for web information as the entry point and web page coordinate system, organization of web page, reading habit of web page browsing, as the object to explore the factors that have influence with the result of indexing web page.Based on literature review and summarizing previous work, the author sets out the research hypothesis:according to the web page coordinate system and the type of website, the suitable dissection ratios of web page will be used to divide the web page into several regions;With judging which region the information block of web page belongs to, we find the importance of different web page regions in auto-indexing, and write a program to realize the hypothesis.The detail of above works will be described as following:First, study and realization on web page gathering. we archive the function of batch collection and manual collection, solve the problem of web page code conversion and html to xml conversion in the process of collecting web page.are described as following:Second, based on the web page coordinate system and reading habit of web page browsing, we divide the web page into nine regions with suitable dissection ratios of web page.The information block in the same region are regarded as a information cluster and processed with the same auto-indexing weight.Third, searching for suitable dissection ratios of web page for types of websites.Different website has different methods of information publication, such as news website, most news are reported by text including few photos, some website allows readers to publish their comments on news at web page, so the height of web page changes relatively greatly.we test news website, science website, sports website with different dissection ratio to get the most suitable ratio.Fourth, searching for the auto-indexing weight of regions.When people visits a web page, there are some characteristics, such as visual focus, reading habit etc, have an effect on people’s reading action, so web page design should be the key to organize information for web page maker. If we find the importance of different web page regions, the accurateness of web page auto-indexing will be improved, so the sample tests for news website, science website, sport website are made to get the auto-indexing weight of different regions in web page.Last, write a program to test and verify the author’s assumption.Experiments of web page auto-indexing were done for types of websites, with considering of information noises and regional characteristics of web page etc, and the results is good.At the completion of the work, we argue about the association of page height, page width with recall rate, precision rate and hope the result is useful for later research plan.Summarizing the above work, we feel about the key technology in every link of web page auto-indexing, such as suitable dissection ratios of web page for types of websites, the association of web page coordinate system and auto-indexing, complete the whole work flow of web page auto-indexing. Those researches will be useful and instructive to the web information management. Certainly, the research has many deficiencies that need to be optimized in future.
Keywords/Search Tags:Web information, Auto-indexing, Web page coordinate system, Dissection ratios of web page, Weight
PDF Full Text Request
Related items