The Case for KGs: Private vs. Public Graphs
By Paco Nathan
It’s an interesting time to be working with knowledge graphs. A time of stark contrasts in realizations. On the one hand, I talk with many people who would otherwise be quite expert about machine learning use cases who say, “Yeah, well, no one really uses KGs in production.” On the other hand, there are so many instances of large, private KGs used in industry.
We could tour through KG use cases at most all of the well-known tech companies, especially within ecommerce and social networks: Google – yes, yes, we’ve all heard about that one. Bing – ok, that makes sense too. Both of those companies use KGs for search, we get it. Amazon and eBay – um, sure, more product-oriented search. Microsoft/LinkedIn – obviously. Facebook, Twitter, Pinterest. IBM. Hey Siri. See also Uber, Lyft, Airbnb, Netflix, etc. For comparisons among some of these larger practices refer to “Industry-scale Knowledge Graphs: Lessons and Challenges” by Natasha Noy, et al., in ACM Queue 17:2 (2019). Typical use cases include: discovery, recommendations, data governance, compliance, conversational agents, and so on.
There’s a whole other realm of private KGs in scholarly infrastructure used for research, such as Scopus, Dimensions, and arguably other reference tools such as Wolfram Alpha fits here even though its usage is free.
Throughout private industry, there are also quite a range of KG use cases among the different business sectors … for example, Refinitiv in FinTech and AstraZeneca in Pharma.
While most all of the examples mentioned above are built and maintained by private firms for commercial uses, there are other large KGs based on public information, intended for public use. Often these result from community projects or government agencies.
Common Crawl maintains an open repository of the world’s web pages and Wikidata serves as the central storage for structured data in Wikipedia and its sister projects. While not quite Google or Bing search, they provide analogous data+metadata at scale.
In terms of research tools and scholarly infrastructure, Semantic Scholar, Research Gate, PubMed, OpenAIRE, Crossref, Unpaywall, and more similar services are all free to use, with open APIs – and quite good. Plus, ORCID has some analogies to LinkedIn among researchers and has been growing. There’s a Python open-source library richcontext.scholapi that federates searches and API integration across most of these services.
Last month here we looked at knowledge graphs in government, especially for geospatial data. While the following do not provide one-to-one comparisons with sophisticated financial services such as Refinitiv, there are open/community projects such as RePEc (economic research) and GDelt (worldwide news in 100+ languages).
The point of drawing those contrasts above was that much of the knowledge work mentioned above is driven by commercial needs. While there are public/government/community analogs in many cases, the private KGs tend to be larger and better-funded. This is the point where we should discuss Underlay at MIT. This is a project within the Knowledge Futures Group, led by Danny Hillis, Joel Gustafson, Samuel Klein, et al.
The Underlay is a global, distributed graph of public knowledge. Initial hosts will include universities and individuals, such that no single group controls the content. This is an attempt to replicate the richness of private knowledge graphs in a public, decentralized manner.
Underlay.org
The naming pun is that an “underlay” is the opposite of an “overlay”. Think of it as a counterbalance for the large private KGs driven by commercial entities. Underlay’s premise is that a KG can be constructed from distributed transactions which they called assertions: immutable statements that specify provenance, cryptographically signed for trust- or context-based filtering. These assertions are combined through a process called reduction: modifying the graph to produce a consistent state. The results are curated into groupings called collections: containers for useful scopes of sub-graphs. This “ARC” protocol is not quite a blockchain, although not far from blockchain’s foundational concept of a distributed ledger.
Imagine if many universities or other community projects around the world collaborated to host a distributed Underlay of the world’s knowledge. Then we wouldn’t necessarily rely on the good graces of advertising companies, e.g., Google, Facebook, etc., for representing unbiased knowledge resources to the public. Also, the governance for Underlay has been engineered to be much more sophisticated than earlier crowd-sourced resources such as Wikipedia – by using RFCs, much like how the Interwebs were originally built.
Overall, check out the full project repos at https://github.com/underlay/