news

Abandoning manual annotation, the AutoAlign method automates the knowledge graph alignment based on a large model

2024-07-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

This work was jointly completed by a team of scholars from Tsinghua University, University of Melbourne, Chinese University of Hong Kong, and University of Chinese Academy of Sciences, including Rui Zhang, Yixin Su, Bayu Distiawan Trisedya, Xiaoyan Zhao, Min Yang, Hong Cheng, and Jianzhong Qi. The team focuses on research in large models, knowledge graphs, recommendation search, natural language processing, and big data.

As an important carrier of structured knowledge, knowledge graphs are widely used in many fields such as information retrieval, e-commerce, and decision reasoning. However, due to the differences in representation and coverage of knowledge graphs constructed by different institutions or methods, how to effectively integrate different knowledge graphs to obtain a more comprehensive and rich knowledge system has become an important issue in improving the coverage and accuracy of knowledge graphs. This is the core challenge that the Knowledge Graph Alignment task aims to solve.

Traditional knowledge graph alignment methods must rely on manual annotation to align some entities and predicates as seed entity pairs. This method is expensive, inefficient, and the alignment effect is poor. Scholars from Tsinghua University, the University of Melbourne, the Chinese University of Hong Kong, and the University of the Chinese Academy of Sciences jointly proposed a method for fully automatic knowledge graph alignment based on a large model - AutoAlign. AutoAlign completely eliminates the need for manual annotation of the aligned seed entities or predicate pairs, but instead performs alignment entirely through the algorithm's understanding of the semantics and structure of the entities, significantly improving efficiency and accuracy.



论文:AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by Large Language Models,36 (6) TKDE 2024

Paper link: https://arxiv.org/abs/2307.11772

Code link: https://github.com/ruizhang-ai/AutoAlign

Model Introduction

AutoAlign consists of two main parts:

Used to align predicatesPredicate Embedding Module(Predicate Embedding Module)。

The entity embedding learning part for aligning entities consists of two modules:Attribute Embedding Module(Attribute Embedding Module)和Structure Embedding Module(Structure Embedding Module)。

The overall process is shown in the figure below:



Predicate Embedding Module: The predicate embedding module aims to align predicates that represent the same meaning in two knowledge graphs. For example, align "is_in" and "located_in". To achieve this goal, the research team created a predicate proximity graph, merged the two knowledge graphs into one graph, and replaced the entities in it with their corresponding types (Entity Type). This approach is based on the following assumption: the same (or similar) predicates should also have similar corresponding entity types (for example, the target entity types of "is_in" and "located_in" are likely to belong to location or city). Through the semantic understanding of the types by the large language model, these types are further aligned, which improves the accuracy of triple learning. Finally, through the learning of the predicate proximity graph by graph encoding methods (such as TransE), the same (or similar) predicates have similar embeddings, thereby achieving predicate alignment.

In terms of specific implementation, the research team first built a predicate proximity graph. A predicate proximity graph is a graph that describes the relationship between entity types. Entity types represent broad categories of entities and can automatically link different entities. Even if the surface forms of some predicates are different (such as "lgd:is_in" and "dbp:located_in"), by learning the predicate proximity graph, their similarities can be effectively identified. The steps to build a predicate proximity graph are as follows:

Entity type extraction:The research team extracted entity types by obtaining the value of the rdfs:type predicate for each entity in the knowledge graph. Usually, each entity has multiple types. For example, the German entity may have multiple types in the knowledge graph, such as "thing", "place", "location", and "country". In the predicate proximity graph, they replaced the head entity and the tail entity of each triple with a set of entity types.

Type Alignment:Since entity types in different knowledge graphs may use different surface forms (for example, "person" and "people"), the research team needs to align these types. To this end, the research team uses the latest large language models (such as ChatGPT and Claude) to automatically align these types. For example, the research team can use Claude2 to identify similar type pairs in two knowledge graphs, and then align all similar types into a unified representation. To this end, the research team designed a set of automatic prompts that can automatically obtain alignment words based on different knowledge graphs.

In order to capture predicate similarity, multiple entity types need to be aggregated. The research team proposed two aggregation methods: weighted and attention-based functions. In experiments, they found that the attention-based function works better. Specifically, they calculated the attention weight of each entity type and obtained the final pseudo-type embedding by weighted summation. Next, the research team trained the predicate embedding by minimizing the objective function so that similar predicates have similar vector representations.

Attribute embedding module and structure embedding module: Both the attribute embedding module and the structure embedding module are used for entity alignment. Their ideas are similar to predicate embedding, that is, for the same (or similar) entities, the predicate in the corresponding triple should also be similar to the other entity. Therefore, in the case of predicate alignment (through the predicate embedding module) and attribute alignment (through the Attribute Character Embedding method), we can use TransE to make similar entities learn similar embeddings. Specifically:

Attribute Embedding Learning: The attribute embedding module establishes the relationship between the head entity and the attribute value by encoding the character sequence of the attribute value. The research team proposed three combination functions to encode the attribute value: sum combination function, LSTM-based combination function and N-gram-based combination function. Through these functions, we can capture the similarity between attribute values, so that the entity attributes in the two knowledge graphs can be aligned.

Structural Embedding Learning: The structural embedding module is improved based on the TransE method and learns the embedding of entities by giving different weights to different neighbors. Aligned predicates and implicitly aligned predicates will receive higher weights, while unaligned predicates are treated as noise. In this way, the structural embedding module is able to learn from aligned triplets more effectively.

Joint training: The three modules, predicate embedding module, attribute embedding module and structure embedding module, can be trained alternately, influence each other through alternating learning, and optimize the embedding to achieve the overall optimality in the representation of each structure. After the training is completed, the research team obtained the embedding representation of entity, predicate, attribute and type. Finally, we compare the entity similarity (such as cosine similarity) in the two knowledge graphs and find entity pairs with high similarity (need to be higher than a threshold) for entity alignment.

Experimental Results

The research team conducted experiments on the latest benchmark dataset DWY-NB (Rui Zhang, 2022). The main results are shown in the following table.



AutoAlign significantly improves the performance of knowledge graph alignment, especially in the absence of manually annotated seeds. Existing models are almost unable to perform effective alignment without manual annotations. However, AutoAlign still achieves excellent performance under such conditions. On both datasets, AutoAlign significantly improves over the best existing baseline models (even with manual annotations) without manually annotated seeds. These results show that AutoAlign not only outperforms existing methods in alignment accuracy, but also shows a strong advantage in fully automated alignment tasks.

references:

Rui Zhang, Bayu D. Trisedya, Miao Li, Yong Jiang, and Jianzhong Qi (2022). A Benchmark and Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning. VLDB Journal, 31 (5), 1143–1168, 2022.