Font Size: a A A

Research On Rough Set For Application In Web Mining

Posted on:2007-01-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:G X YiFull Text:PDF
GTID:1118360242461967Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Web mining is generally defined as discovering potential and available patterns or knowledge on Internet by means of data mining technique. With the help of Web mining, the engine may find high quality page and make Web server intelligent by analyzing semantic structure and click information. The present Web mining technology, especially the core algorithmic of Web document classification and clustering are based on statistical word frequency Vector Space Model (VSM). The key of the algorithms is the strategy of terms selection and the measurement of similarity. In order to improve the results, many researchers have pursued for these two techniques, such as adopting different terms weight and similarity formulas. But the relationship between the terms is rarely studied. How to describe the relationship exactly and what to make use of the association is a new way to improve the conventional Web mining algorithm. The rough set is a powerful mathematics tool for dealing with uncertain relationship and the extended rough set is more satisfied for practical application.The center viewpoint of rough set was dissected on the base of knowledge category. It is necessary to extend the rough theory for more application. the research the Web application based on extended rough set theory was carried out. With target of Web information search and by the tool of extended rough set theory, aiming at knowledge discovery, the individual interest Web Rank algorithm on fuzzy rough set was carried through. The method of query term expansion and Web pages classification and clustering by means of tolerance rough set were also systematically developed.Upon classical rough set, the ability of new object forecast is low because of data simulation being over normal degree. It is helpless for rough set to dispose fuzzy data directly. And the description of boundary is simple. For example, there is no relation of part containing or belonging to in rough set. Hereby, several extended rough modes upon classical rough conception were discussed, such as variable precision rough set and fuzzy rough set and tolerance rough set. Some relational properties between these modes were analyzed. It is necessary to point out that these modes could unify to the generalized rough set in nature just only difference of relation or membership function. This is useful of intuitively explaining rough theory and enlightening somebody to implement better data mining algorithm.One of important reasons caused low precision was presented, which was inaccurate express of the query. So a new method of automatic query expansion based on tolerance rough was put forward. In the algorithm, the uncertain connection between query terms and retrial documents was describe as term tolerance class. The upper approximation set of query sentence was looked as query expansion. The new additional terms were also given weight numbers. The results of experiment on standard data collection showed that the approach was effective on query expansion and high search precision was gained. In order to overcome the"topic draft"in HITS and PageRank, the new personal interest page rank algorithm based on fuzzy rough set was discussed. In this algorithm, history query terms were denoted user's interests and the connection between user interests and documents was described by means of fuzzy rough set. The combination of upper approximation set and lower approximation set were made use of measuring the similarity between interest and document. The experiments results showed that the method was feasible.Some Web document taxonomies about rough set were summarized. In most methods, the class was looked as exclusive object and the connection between classes was little in use. A Web document classification based on tolerance rough set was presented. In the algorithm, the key of cross conception among categories was described using terms tolerance rough set and the capability of Web classification was improved highly.Several clustering strategies were discussed and the essence of clustering was expounded as"to hold together". A new algorithm of search results clustering based on tolerance rough set was given. The category label algorithm was implemented. The contrast experiment demonstrated that the proposal method exceed general K-mean clustering algorithm.
Keywords/Search Tags:Web mining, information search, classification, clustering, rough set, extended rough set, tolerance rough set
PDF Full Text Request
Related items