Font Size: a A A

Technology Of Acquiring Semantic Feature Of Web Information

Posted on:2009-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:H R WeiFull Text:PDF
GTID:2178360308478333Subject:Computer Software and Theory disciplines
Abstract/Summary:PDF Full Text Request
As Internet developing rapidly and personal computer being popular, more and more information has been applied on Internet. Web resources are tremendous in amount, complete in category, and nearly any information can be found on it. Nevertheless, confronted with the huge sea full of information, what is useful for a given user occupies a little part of it. It's important to provide an effective tool as searching engine or information integration system, helping people to find the very information that they are interested in properly and quickly. In the immense information space, Web information is usually organized in the form of Websites, which establish their own classification categories for pages organization and publication. Otherwise, for different Websites the classified criterion is not unified and they don't have the standard for naming category items, which leads to the difference of semantics. With the differences, multi-source Web information couldn't satisfy the purpose of compatibility and merger, so much so that the effective integration. A way for representing semantic features of Web information is needed urgently to solve the problem of Websites'classifying semantic difference.To settle the semantic difference of category items of Website, the thesis illustrates semantic feature representations based on vector space model and repeating pattern, it also shows semantic updating strategy of the two semantic feature representations above. For a single classification system, the atom nodes of it are classification concepts, whose standard semantic features of classification concepts show the potential semantics of Web category items, resolving the problem of uniform understanding and standard description of different Web category information.The main work for the thesis is around the purpose of uniform understanding and standard description of category semantics to implement standardization of Web information semantics, and it consists of three respects, that are technology of acquiring Web pages' information, study on representation of Web information semantic feature, study on time validity and updating strategy of Web information semantic feature. Firstly, downloading the texts from Web pages, analyzing HTML tags, extracting useful information that could represent semantic features from Web pages and their structures, improving current TF-IDF algorithm of computing weights to enhance the accuracy of feature item's weight, secondly, there are two standard methods to represent semantic features of Web information, the method based on vector space model represents category concepts as feature vectors by Web pages' segment, data cleansing, feature weight calculating, feature extracting, feature vector making, the method based on repeating pattern gets repeating patterns and repeating times of each category concept to represent semantic features by using correlative matrix algorithm to find Web pages' repeating patterns, using y approximate matching algorithm to uniform similar repeating patterns in different categories, thirdly, setting new updating strategy of semantic feature to refresh semantics of category concepts in time by changes and time validity of Web information to make semantic features more accurate.
Keywords/Search Tags:Web information semantic feature, vector space model, repeating pattern, time validity, updating
PDF Full Text Request
Related items