Of nodes and edges
Mayank Kejriwal, HLF15 participant. The best advice that I’ve ever been given as a Ph.D. student was by my advisor (of course!), who told me in my first year, “Science is a social process.” At the time, like many students probably, I did not fully understand the implication of those words. After all, which profession does not involve a social process?
As I approach the final year of my PhD, I’ve come to appreciate that wisdom more deeply. In computer science (CS), we primarily publish our work in conferences rather than in journals. For CS researchers, the conference-based publication model has proven to be enduring, mainly because of the fast-paced nature of CS research, but also because we place a special emphasis on sharing feedback, code and datasets.
At conferences, I’ve seen actual cases where some of my peers have gone from being total strangers to planning collaborations that eventually led to co-authored papers. In my own case, I recently ended up availing of a cross-continental opportunity simply by discussing my research in an accessible manner.
At this point, I’d like to talk about my research briefly. I work on a large-scale version of a half-century old Artificial Intelligence problem called Entity Resolution (ER). In layman’s terms, the abstract version is to get a computer to automatically determine when two entities refer to the same underlying entity. As with so many concepts in life, ER is best demonstrated through an example:
The syntactic details in this diagram are not important here. At a high level, the nodes in this directed labeled graph refer to entities, which are (usually real-world) objects of interest over one or several domains. It is common to refer to the domain as a class, with the entity itself an instance of the class. For example, Microsoft is an instance of the class Company (not shown). Edges in this data model typically denote relationships between entities. In this framework, the goal of an ER system is to automatically detect the symmetric :sameAs relationships (shown in red in the diagram) between equivalent entity pairs.
The problem, which on the face of it seems simple enough for a human being (at least for common-sense domains of discourse), has befuddled AI researchers for decades. Versions of it have shown up not just in graph databases, but also in tabular (Relational) databases and even Natural Language Processing applications. In recent times, its importance has only increased. The reason lies in a cliché, namely Big Data. To convince you of that, let me give three brief applications of ER research:
- Data Integration: Data integration, namely the process of querying multiple source databases under a single target interface, has long been the traditional use-case of ER. Data integration has numerous applications of its own, including linking gene ontologies and biological datasets, customer databases, and company databases in the event of mergers and acquisitions. ER is necessary because source databases often contain equivalent entities. Not resolving these equivalent entities is known to create both theoretical and practical problems.
- Knowledge graphs, Linked Data and all things Semantic Web: Tim Berners-Lee, inventor of the World Wide Web, described the linked data movement (www.linkeddata.org) as an effort to make the Web machine-readable. A huge part of this effort is to represent data on the Web as a knowledge graph, similar to the example I showed above (from two real-world knowledge graphs). One of the principles of Linked Data is to ensure that datasets do not exist in silos, but are linked to one another. Given that the current Web of Linked Data already contains over 30 billion edges and continues to grow, the need of the day is to build large-scale ER systems that link newly published data to the existing ecosystem. Industry is taking note of these efforts: Google acquired the company behind Freebase, one of the world’s largest knowledge bases, to build a proprietary knowledge graph that adds semantics to its search results.
- Social Media: More than ER, social media is typically concerned with link prediction, which is a more general problem than ER. Technically, however, link prediction and ER have both become mainstream AI problems and often involve similar algorithms. ER also continues to be important in social media, however, especially when it comes to linking profiles from different social media websites. An interesting case application, for which commercial products are already being patented, is in human resources and recruiting. For example, at least one company offers products that attempt to link user profiles from myriad information sources (e.g. Github, LinkedIn) to present an interested recruiter with a unified profile of a candidate. More importantly, similar efforts are also being explored by law enforcement and anti-terrorism organizations. On the flipside, this application of ER has invited concerns about privacy and security.
Stepping back from my research, my advisor’s words are why I’m so excited about the Heidelberg Laureate Forum (HLF), for which I left yesterday. I remember first hearing about the HLF through an email from my graduate department. Curious about the event, I ended up browsing through some of the archives of the previous HLF in 2014. I was stunned by both its depth (evidenced by the many luminaries who had attended) and its overall mission of fostering scientific interaction in a unique, high-energy setting.
On the one hand, the HLF brings together a group of scientists and mathematicians who have reached the pinnacle of their fields, and whose achievements are an inspiration to all of us. But also exciting is an opportunity to interact with peers from all over the world, many of whom are still working on our dissertations. I’m also very impressed by the diversity of the experience that HLF has planned for us this year, ranging from workshops, Hot Topic events, a poster session, keynotes, outings to major industrial and scientific centers in the Heidelberg region, and most importantly, ample opportunities for everyone to interact. The potential that the HLF has to spark new ideas and connections is, without loss of generality, dizzying.
As the legendary Sir Isaac Newton said, “If I have seen further than others, it is by standing upon the shoulders of giants.” Those of us who are fortunate enough to be attending the HLF next week have a golden opportunity to learn from not one, but several giants. I, for one, can hardly wait.
Mayank Kejriwal is currently pursuing his Masters and PhD in Computer Science at the University of Texas at Austin under the supervision of Daniel P. Miranker. His thesis is titled “Populating a Linked Data Entity Name System”, and concerns a 56 year old Artificial Intelligence problem called Entity Resolution (ER) that has recently also emerged as a Big Data problem. His research is funded by a grant from the US National Science Foundation, with cloud-infrastructure support from Microsoft Azure, and has been published in the International Conference on Data Mining (ICDM), the International Semantic Web Conference (ISWC) and the Journal of Web Semantics. The author would like to thank the US National Science Foundation for funding this trip and the author’s own research.