Creation of a Metadata Database for IoT Malware Datasets
The Internet-of-Things (IoT) consists of embedded computers connected to each other and to the Internet, and it forms the basis of innovative new applications in different domains. However, besides its advantages, it also poses security risks. One specific security problem is that embedded IoT devices can be infected by malware, endangering the trustworthiness of IoT systems and the availability of Internet-based services. Hence, currently, intensive research is going on aiming at new, efficient algorithms for IoT malware detection. This research needs malware sample datasets for evaluating the proposed algorithms. Several such datasets have been made available, but either they contain raw malware samples and no sufficient metadata or they contain only metadata selected by the publisher of the dataset without any access to the malware samples themselves.
The goal of this project is to build a metadata database for some existing IoT malware datasets, which can be used to easily create, so called, collections with various features on Virus Total, the largest available malware repository in the world. So the metadata database would allow access to the metadata, while the samples themselves would be accessible via collections in Virus Total. Such a comprehensive approach would be extremely valuable to the IoT malware research community.
The task of the student is to design and to implement a framework that allows for the automated processing of existing malware sample datasets with the purpose of extracting metadata from them and building a metadata database. It should be easy to perform similarity searches in the database, so potentially, the use of a graph-based database can be considered, where records of similar samples can be connected. In addition, the framework should consider that requesting information (e.g., AV labels, submission dates, and other analysis results) from Virus Total consumes some quota, so Virus Total access should be minimized. Part of the task is to demonstrate the usage of the framework and to evaluate its operation.