Improving storage efficiency in large malware repositories
Large malware repositories store as many variants of malware as possible. Typically, they do not filter out samples similar to already stored samples during the processing of incoming malware feeds, and they store similar samples as separate instances. A consequence of this is that identical byte sequences that occur in different variants of the same malware are stored multiple times, and this results in an inefficient use of storage resources. Previous analysis has shown that this redundant storage of the same bytes sequences can account for more than half the storage needs of large malware repositories.
The task of the student is to design and implement a storage scheme that removes redundant storage of the same byte sequences as much as possible, hence improving the storage efficiency of large malware repositories. This task involves studying how the degree of similarity between malware samples can be measured and quantified at the byte sequence level, how to identify the causes of similar and dissimilar byte sequences in otherwise similar samples, how similarity metrics can help finding clusters of similar samples in a large malware repository, and how such clusters can be stored efficiently by some sort of differential storage method. The student should identify and overview existing approaches for differential storage systems and select an approach that best fits the needs of large malware repositories. Finally, based on all the mentioned studies and overview of the literature and existing solutions, the student should design and implement a prototype differential storage system suitable for large malware repositories, and evaluate the efficiency gains that it provides compared to the traditional, redundant storage of samples.