Lecturers
Event Details
Location | Room 4 (Virtual) |
Date | May 2, 2022 |
Time | 1:00 PM to 4:30 PM |
Today’s KGs are typically deployed in SPARQL endpoints, as large files, or using web APIs. Endpoints and APIs are effective for query-based use cases that use data aggregations or operate on a limited number of nodes or edges at a time. These approaches do not support use cases that require access to large portions of KGs to obtain, manipulate, and custom-tailor the data. These use cases require a host of different approaches, tools, and formats. For instance, entity linking with Wikidata or DBpedia requires building indices using all the text literals in the graph, computing statistics about the ambiguity and variance of entity labels, using global graph statistics such as PageRank for resolving ambiguous candidates, and computing various graph embeddings. Composing such a pipeline today requires knowledge of multiple tools that are not designed to work together, making it costly to implement these pipelines for large KGs.
User-friendly creation and exploitation of knowledge graphs requires the consolidation of existing tools into a toolkit that would be accessible for entry-level semantic web and AI researchers and developers. We believe that similar to an NLP researcher using SpaCy and an ML researcher using scikit-learn, AI researchers and developers should be able to easily exploit the wealth of knowledge available in KGs with minimal effort. For this reason, we have developed KGTK: an integrated knowledge graph toolkit, which provides a wide range of knowledge graph manipulation and analysis functionality, and supports common use cases such as large-scale network analysis, data enrichment, and quality estimation. Other available tools such as kglab and graphy address similar use-cases as KGTK, but contain a small subset of the tools available in KGTK and provide no support for working with hyper-graphs (edges with qualifiers) such as Wikidata. A tutorial on KGTK at The Knowledge Graph Conference would be a notable contribution to the goal of democratizing knowledge graphs for a broader set of KG and AI researchers, allowing them to get familiar with KGTK, and allowing us to discover directions for further improvement of the toolkit.
KGTK is a comprehensive open-source framework for the creation and exploitation of large KGs, designed for simplicity, scalability, and interoperability. KGTK represents KGs in tab-separated (TSV) files with four columns: edge-identifier, head, edge-label, and tail. All KGTK commands consume and produce KGs represented in this simple format, so they can be composed into pipelines to perform complex transformations on KGs. The simplicity of its data model also allows KGTK operations to be easily integrated with existing tools, like Pandas or graph-tool. KGTK provides a suite of commands to import Wikidata, RDF (e.g., DBpedia), and popular graph representations into the KGTK format. A rich collection of transformation commands make it easy to clean, union, filter, and sort KGs, while the KGTK graph combination commands support efficient intersection, subtraction, and joining of large KGs. Advanced functionalities of KGTK include a query language called Kypher, which has been optimized for querying KGs stored on disk with minimal indexing overhead; graph analytics commands to support scalable computation of centrality metrics such as PageRank, degrees, connected components, and shortest paths; lexicalization of graph nodes, and computation of multiple variants of text and graph embeddings over the whole graph. In addition, a suite of export commands supports the transformation of KGTK KGs into commonly used formats, including the Wikidata JSON format, RDF triples, JSON documents for ElasticSearch indexing, and graph-tool. Finally, KGTK allows browsing locally stored KGs using a quick and user-friendly browser, text search interface, and a similarity interface. KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop computer.
We have used KGTK in multiple settings, focusing primarily on the construction of subgraphs of Wikidata, an analysis of over 300 Wikidata dumps since the inception of the Wikidata project, linking tables to Wikidata, construction of a consolidated commonsense KG combining multiple existing sources, the creation of an extension of Wikidata for food security, and the creation of an extension of Wikidata for the pharmaceutical industry.
Our tutorial will be organized as follows. In the first half of the tutorial, we will introduce KGTK’s data format and the wide range of import, curation, transformation, analysis, and export commands, which can be flexibly chained into streaming pipelines through the command line. In the second half, we will show the utility of KGTK in several common and diverse KG use cases. This tutorial will introduce AI researchers and practitioners to effective tools for addressing a wide range of KG creation and exploitation use cases, and inform us on how to bring KGTK closer to its users. KGTK is publicly available under the MIT license.