GARUDA: A System for Large-Scale Mining of Statistically Significant Connected Subgraphs

Special Interest Group in Data (SIGDATA)
Indian Institute of Technology Kanpur


Abstract
The steady growth of graph data in various applications has resulted in wide-spread research in finding significant sub-structures in a graph. In this paper, we address the problem of finding statistically significant connected subgraphs where the nodes of the graph are labeled. The labels may be either discrete where they assume values from a pre-defined set, or continuous where they assume values from a real domain and can be multi-dimensional. We motivate the problem citing applications in spatial co-location rule mining and outlier detection. We use the chi-square statistic as a measure for quantifying the statistical significance. Since the number of connected subgraphs in a general graph is exponential, the naive algorithm is impractical. We introduce the notion of contracting edges that merge vertices together to form a super-graph. We show that if the graph is dense enough to start with, the number of super-vertices is quite low, and therefore, running the naive algorithm on the super-graph is feasible. If the graph is not dense, we provide an algorithm to reduce the number of super-vertices further, thereby providing a trade-off between accuracy and time. Empirically, the chi-square value obtained by this reduction is always within 96% of the optimal value, while the time spent is only a fraction of that for the optimal. In addition, we also show that our algorithm is scalable and it significantly enhances the ability to analyze real datasets.

Related Publications
GARUDA: A System for Large-scale Mining of Statistically Significant Connected Subgraphs
Satyajit Bhadange, Akhil Arora* and, Arnab Bhattacharya
Proc. of the 2016 VLDB International Conference on Very Large Data Bases (Demonstration Track).

Mining Statistically Significant Connected Subgraphs in Vertex Labeled Graphs
Akhil Arora, Mayank Sachan and, Arnab Bhattacharya
Proc. of the 2014 ACM SIGMOD International Conference on Management of Data.

Video
GARUDA: A System for Large-Scale Mining of Statistically Significant Connected Subgraphs [6 mins 30 secs]

Demo
GARUDA: A System for Large-Scale Mining of Statistically Significant Connected Subgraphs
*Disclaimer: All the features have been tested on Google Chrome and Apple Safari. On other browsers, some of the features might not work as expected!

Primary Contributors