Airflow-based implementation of IoT malware metadata database maintainer
The rapid growth of the Internet of Things (IoT) has enabled many new and creative uses, but it has also made IoT devices frequent targets for malicious actors. As a consequence, thousands of new malware samples appear every day, including brand-new species and variants of known families. To keep pace with the rapid creation of new malware, security professionals must develop novel, automated methods for malware detection. For this reason, security professionals need to use honeypots to collect samples or/and create malware datasets to evaluate their methods; however, the quality of such feeds and datasets is often unclear, so in [1] the authors goal is to effectively filter out and discard benign files, extract metadata from likely malicious samples, and create graph-based databases containing only metadata from verified malware.
The student's task is to implement the approach described in the previously mentioned paper within the Apache Airflow framework and add additional features. While working on the assignment, the student will become familiar with technologies and concepts such as graph databases (Neo4j), scheduling and monitoring platforms (Airflow), similarity digest schemes (TLSH), and malware analysis platforms (VirusTotal).
[1] D. Maliga, R. Nagy and L. Buttyán, "A pipeline for processing large datasets of potentially malicious binaries with rate-limited access to a cloud-based malware analysis platform," 2024 32nd International Conference on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Krakow, Poland, 2024, pp. 1-6, doi: 10.1109/MASCOTS64422.2024.10786551.