Analysis of clustering algorithms for large dynamic graphs
Large graphs can be encountered in many application domains. One specific example that drives this project is malware detection, where similarity graphs of malware samples can be used to represent the knowledge of antivirus providers regarding samples collected from the wild. These malware similarity graphs are typically very large, because there are tens of millions of samples captured in the wild, and in addition, they also continuously grow in size, as new samples are captured on an everyday basis. These large dynamic graphs need to be clustered for various purposes, for instance, to identify malware families or to increase search and storage efficiency of the samples.
The task of the student is to study the problem of clustering of large dynamic graphs. This involves a thorough review of the literature, identifying existing proposals for clustering algorithms in this context, understanding their operation, and comparing their merits and potential weaknesses. This also involves the identification of various metrics for measuring the quality of a clustering. Finally, the student should also try to propose enhancements to existing algorithms or new algorithms for clustering aiming at maintaining a good quality clustering of a dynamically changing large graph while being sufficiently efficient for practical use. New proposals should be validated by real measurements on large graphs with node number in the order of millions.