| Web resources with massive growth have become an important source of gaining information for enterprises. These resources have the characteristics of semi-structure, discrete, real-time and heterogeneity. It has turned into a significant research area that how information of particular topic is extracted from Web resources and provided instantly for business companies as valuable intelligence.The subject is Web-based Topic Search oriented Enterprise Competitive Information. It focuses on the design and implementation of topic Web crawler, which is the core module of Topic Search. The main work is as follows:Topic Web Crawler: With a comprehensive analysis of the existing search algorithms, genetic algorithm based on non-greedy strategy is adopted to enhance the global convergence of information collection.Web Document Analysis:A Web document is converted to a document tree correspondingly, and relevant information is accessed effectively and rapidly by traversing this tree; After content refinement and text extraction, text eigenvector is established by using an improved calculation of weights of feature items.Topic Degree Evaluation: On the basis of the topic degree evaluation of Web document text, compute web links' topic degree combined with anchor text, URL string as well as the context.As discussed above, the overall design of CI system and implementation of Topic Search are described with detail. |