People search is one of the most important search activities. People search engines, social networks, and other associated applications have gradually become research hotspots. Person attributes extraction is an important basis of these studies. The paper mainly focuses on person attributes extraction from Wikipedia, and then constructs a similarity network using person attributes and other information in the Wikipedia text.Infobox in person text summarizes the main attributes of the person in the form of table, which provides an important resource for person attribute extraction. However, less than forty percent of the Wikipedia texts contain infobox, and some attributes are missing. Therefore, how to automatically generate infoboxes and fill the missing attributes becomes one of our study contents.Wikipedia texts have different kinds of infobox templates, and different infobox templates may contain different kinds of attributes. Therefore, infobox template type must be determined before attribute extraction. We take it as a typical text classification task, and text category labels are infobox templates types. In feature selection, a method based on hyperlink words, text categories and entity words is proposed. Compared with using all the words for features, experiments show that the classification performance of proposed method has certain advantages.In the task of attributes extraction, we use "person-attribute-value" triples extracted from some existing infoboxes. For a given attribute, our system marks the person name and the attribute value in the corresponding sentences in free texts of Wikipedia, and automatically acquires marked data set. Patterns of each attribute can be generated automatically by machine learning algorithms. Then more attributes can be acquired by means of pattern matching, and at the same time the attributes can be used to generate infobox or fill the missing attributes. We do experiments for the five commonly used attributes. The result showed that our method could extract person attributes effectively.Afterwards, we mine a similarity network using the extracted person attributes and other information in Wikipedia text. Firstly, information about person is divided into different properties and Person Model is proposed. Then, for different dimensions, different similarity calculation methods are used. Finally, for the total similarity of person model, we take person entity as a system, so systematic similarity measure can be employed. Moreover, we define four types of relations. For two given persons, not only can similarity between two persons be gotten, but also relation and their common value can be output. Through experiment on the real person data in Wikipedia, we analyze the distribute feature of social network and the method is proved to be feasible. |