-
Essay / Malware Classification Using Machine Learning
Malware usually poses a headache in almost all mobile phones, laptops, memory cards, etc. The most common technique used by malware to avoid detection is binary obfuscation, or using encryption. One of the techniques malware uses to evade detection is binary obfuscation, either through encryption (polymorphism) or metamorphism (different code for the same functionality). To detect them quickly and efficiently, they should be grouped according to their family. This gives rise to a growing need for automated, self-learning, fast and efficient techniques that will be robust against these attacks. In this article, we only intended to classify malware into their respective families and not to detect them (identify whether they are malware or not). A criterion of 500 counts of an observed value should be selected for our feature dataset which will be used by our machine learning algorithms. In this paper, we focus on novel data visualization techniques such as malware image representation and classification based on artificial neural networks and K-Nearest Neighbor. Say no to plagiarism. Get a tailor-made essay on “Why violent video games should not be banned”?Get the original essay Malware analysis is usually carried out in the form of “static analysis”, “dynamic analysis » and also “signature-based analysis”. During static analysis, disassembly code files are analyzed for malicious system calls. A model must be built for the control flow graphs. Whereas, in dynamic malware analysis technique, data is analyzed in a controlled environment and also traced (system logs). This mentioned process is extremely slow and also resource and time consuming. Both mentioned techniques work well, but static code analysis suffers from malware implementation differences, while dynamic malware analysis is limited to the environment and triggering conditions of malware, and is therefore also a scalable option. To analyze the malware signature, it must be constructed using N-Gram techniques. Malware disassembly is analyzed for most repetitions of operational codes, and N-Grams must be built on top of that. In order to visualize the data, we use malware visualization techniques. We will convert each malware byte code into a grayscale image. Malware from different families has similarities in terms of visual appearance, this is the basic principle that is followed. These images should be used for image-based classification. OPCODE must be calculated from the teardown code. The objective of this article is to implement machine learning algorithms in order to classify malware into their respective families. The data should be sourced from www.kaggle.com provided by Microsoft containing 10,868 malware samples belonging to total 9 different classes i.e. the files are from nine different malware families namely Ramnit, Lollipop, Kelihos ver3 , Vundo, Simda, Tracer, Kelihos. ver1, Obfuscator.ACY and Gatak respectively. The objective here is to analyze, visualize malware and analyze data atprior. The goal is therefore to develop a new integrated model that takes advantage of all models. Problem Definition: Extensive work has been done in terms of malware analysis. Static, dynamic, and signature-based malware analysis techniques have been widely researched. A post based on image-based malware visualization was one of the preferred ways [1] to explain how to form an image from malware binaries and how to visualize these images. In the alternative approach to extract data from the disassembly code, which could be used for classification [2], the data accuracy was not optimal. This article suggests a way to extract new features based on N-Grams, code sections, operational code sequences and DLL calls. But before signatures for malware can even be developed, certain tasks must be performed in malware detection and classification. Related work: Extensive work has been done on malware analysis. Many articles are published on static, dynamic, and signature-based malware analysis techniques. A publication based on image-based malware visualization as one of the preferred methods [1]. This article explains how to create an image from binary malicious files and how to view these images. This,machines are used for image-based classifications. We also referred to a paper that defines how to extract data from the disassembly code, which could be used for classification.[2] This article suggests a way to extract new features based on N-Grams, code sections, operational code sequences and DLL calls. But before signatures for malware can even be developed, there are tasks that need to be done in malware detection and classification. Analysis: We studied a few articles that use the same principles as ours to classify malware into their families. It has been observed that in case of missing data, multi-layer perception model and logical regression are good. Image visualization techniques were used, resulting in an average predictive accuracy of 95% using the deep neural network. We also found that the methodology gives optimal results compared to other available techniques. While machine learning based malware classification for Android apps using multimodal image representations [3] is a bit slow in terms of data processing. Proposed methodology: To analyze the signature, the signature is constructed using N-Gram techniques. Malware disassembly is analyzed for most repetitions of operational codes, and N-Grams are constructed based on this. We propose to use malware visualization techniques. Our goal is to convert each malware byte code into a grayscale image. During research and analysis, it has been observed that malware from different families have similarities in terms of visual appearance, which presents us with an opportunity to exploit this weakness. These malware images will be used for image-based classification. From the teardown code we will calculate the number of OP-CODE, DLLs and number of sections from.