Font Size: a A A

Study On Topic-Specific Web Information Collection And Analysis Technology

Posted on:2007-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z TangFull Text:PDF
GTID:2178360185474494Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Currently, search engine has become people'main access to gather information on the web. Traditional generic search engine use a program named Crawler to collect information from the whole Web, it has some disadvantages such as non-specific information collection, high rates of pages missing, and can not meet the needs of specific professional groups. What we need is a focused search engine, well classified, containing profound and entire data, and updating in time.We designed a focused search engine, and studied the topic-driven crawler's Web information collection and analysis technology; In accordance with the different methodology used to assess the value of links, we classified the search strategy, analyzed and compared characteristics, advantages and disadvantages of various search strategies. Also we analyzed several common Web community structure, and point out that the existing topic-driven Web information collection techniques that based on partial information had some problems: the contradictions between "partial optimistic" and "topic drift" on technical level, and"Recall"rate and"Precission"rate of the results. Therefore, we supposed to use Genetic Algorithm, which is highly interoperable, adaptable, Global, and based on probability of selection, to solve these issues. Mainly work is about:①According to the differences of destination and methodologies between traditional generic search engines and focused search engine, we designed a focused search engine,introduced the function of each part of the search engine.②Studied the technologies about information collection, analysis and information retrieval, mainly about the topic-specific Web information collection and analysis technologies. Through comparison and analysis, we found out the existing technologies'advantages and disadvantages.③Studied the genetic algorithm's concepts, characteristics, methods and its mathematical mechanisms, supposed to use it in the topic-driven Web information collection area to improve information collection system's performance.④By analyzing the difference and similarity between genetic algorithm and Web information collection technologies, we discussed the feasibility and some noteworthy issues when using genetic algorithm in Web information collection system. We designed...
Keywords/Search Tags:Focused search engine, information collection, crawling strategy, search strategy, Genetic Algorithm
PDF Full Text Request
Related items