(全英语版)处理恶意软件的随机森林分类器算法(Random Forest Classifier On Malware)

这篇具有很好参考价值的文章主要介绍了(全英语版)处理恶意软件的随机森林分类器算法(Random Forest Classifier On Malware)。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

Random Forest Classifier On Malware

(copyright 2020 by YI SHA, if you want to re-post this,please send me an email:shayi1983end@gmail.com)

(全英语版)处理恶意软件的随机森林分类器算法(Random Forest Classifier On Malware)

Overview


随机森林分类器是最近很流行的一种识别恶意软件的机器学习算法,由 python 编程语言实现;用于杀毒软件的传统基于特征码、签名、启发式识别已经无法完全检测大量的变体,因此需要一种高效和准确的方法。很幸运的是我们有开源的 sklearn 库能够利用:

In this tutorial,I'll show you how to use random forest classifier machine learning algorithm to detect malware by Python programming language;

The traditional yet obsolete signature-based or heuristic approach used by majority anti-virus softwares  are no longer suitable for detecting huge-scale malware variations emerged nowadays;for these billions of variations,we need a fast、automatically and accurately way to make judgement about an unknown software binary is malicious or benign;


The Python sklearn library provide a Random Forest Classifier Class for doing this job excellently,note the simplest way of using random forest algorithm is in a dichotomy scenario:determine or classified an unknown object into its two possible categories ;which means any task that involve dichotomy,not merely malware-benign ware identification,can take advantage of Random Forest Classifier; 


So let's enter our topic,from a high-level overview perspective,I'll extract any printable string whose length large than five characters from the two training datasets:malware and benign ware,respectively;then compress these data using hashing trick to save memory usage and boosting analysis speed;then I use these data,along with a label vector,to train our random forest classifier machine learning model,make it to have a general concept about what is a malware or benign ware;finally,I pass in a sheer previously unseen Windows PE binary file to this classifier,let it make prediction,the resulting value is a probability of its maliciousness,and feed this to other components logic inside an anti-virus;

(don't worry too much about aforementioned terminologies,I will explain them as I bring you to the code line by line;)


Implementation and Execution

We import the first three prerequis Python libraries: 

✔ re(regular expression); 

✔ numpy; 

✔ FeatureHasher Class(perform string hashing ):



The definition of function get_string_features() as shown in following figures,it take an absolute filename path as its first argument,and an instance of FeatureHasher Class as its 2nd argument;

The "front-end" of this function open a PE binary file specified by caller,and use regular expression performing text match on that file,return all matched strings into a list(the variable strings);


For example,if we extract strings from a malware binary using above code snippet,findall() method will return a list containing all candidate strings:


The "back-end" of this function iterate over this strings list,using every string as a key,and 1 as its corresponding value to build a feature dictionary,indicating that string existing within this binary;then it use the transform() method coming from FeatureHasher Class, to compress this dictionary,after that,dense the resulting sparse matrix,convert it to a standard numpy array,and return the first element to the caller:


To make this point more clear,I do some experiment to show you the internal working of that code chunk:



As you can see from the above figure,compare to the original list we used for storage raw strings,this function return a very large numpy array, but most of then are zero,only 256 / 20000 = 1% are occupy by 1;


Next,I formally acquire every fully absolutely filename path from the given two training data set directory by using the following code piece:

Basically,this will construct two lists of complete filename path for malware and benign ware locate in hard disk drive,respectively,and the execution output is shown in following figure:


Now we can actually invoke get_string_features() on full_file_path_malware and full_file_path_benign lists to extract hashed string-based features for every binary;I achieve this by a compact list comprehension expression;also,we need another label vector aforementioned to tell the machine learning detector the rules of how to treat these binaries as malicious or benign:


According to machine learning community and mathematical convention , we frequently use capitalized "X" to represent a matrix and lower-case “y” to represent single vector;because get_string_features() return a list,calling it repeatedly will produce a list of lists——so "X" is a two-dimensional matrix;also,"y" has identical length with "X",and labeled 1 for all malware hashed string lists inside "X";labeled 0 for all benign ware hashed  string lists inside "X":



After data preparation and pre-processing,next we use Random Forest Classifier support by sklearn library, to fit(or "train")this machine learning malware detector based on this set of training data "X" and "y": 




The final step,I extracted a hashed string-based features form an unknown、real-world Windows PE binary file(which is a kind of launcher of a popular MMORPG client ^^),the use our classifier to probe it:


The predict_proba() method gives out the probability of that binary could be malicious and return it into a second element of a list,the first element is the probability  of that binary could be benign,so these two member are mutual:they adds to 100%:



As you can see,sklearn library handle the most heavy lifting works including created different decision trees randomly (to allow them form a forest)、the mathematical decision processes behind each of these trees and make a majority vote to determine whether this unknown is malicious;

So make leverage of its merits to conduct artificial intelligence-related problem solving only require several lines of code;


By carefully watch the output above you may be wondering why this customized machine learning detector treat a legitimate online game client as a malware ? 

There are several reasons can explain this seemingly "false positive" phenomenon,such as  those strings related  to anti-debugging、anti-reverse engineering techniques might appear within these launcher,which also frequently used by malware authors;but more importantly,we can change the threshold value defined in our if clause as a simple way to reduce "false positive" and increase detection accuracy;



Evaluate Performance

To evaluate the accuracy of this machine learning detector furthermore,we can setup a optional experimental procedure,called "cross validation",involves these steps: 

① Randomly divide our training data into subsets——several training sets,to train the classifier;and a test set,which playing a role as previously unseen binaries set to test the classifier,

② Let it make probability prediction about the maliciousness scores;use that scores accompanying with the test label vector(which generated randomly from also dividing the original label vector into training and test set,which representing the "official" categorize standards that we know in advance),to compute the "Receiver Operating Characteristic (ROC) curve" of this detector;

The ROC curve measures the relationship and changes between a classifier's true positive rate(TPR) and false positive rate(FPR),we can use roc_curve() function of metrics module coming from sklearn library for this task;


③ Then we record the TPR and FPR value in memory by using semilogx() function of pyplot module coming from the de facto data visualization library——matplotlib——and then exchange(alternately) the roles of traning and testing subsets,repeat above process until all subsets are covered,which is why it called "cross validation";

④ Finally,we actually draw all ROC curves computed during these processes using a series of pyplot's plotting functionalities and display it;


To preventing you get confused with all these complex steps involved in a "cross validation",I show you a overall clear logic in the following figure:


Now you have the general concept of the "cross validation",let's walk through the code:

Here,I wrapped all the logic into a cv_evaluate() function that takes "X" training dataset matrix and "y" label vector as its first two arguments,and a FeatureHasher instance as its last argument;the function import three essential libraries and modules,convert "X" and "y" to numpy arrays,set and initializes a counter variable used for final chart plotting;


The KFold instance is actually an iterator that gives a different training or test example split on each iteration,here I specified the passes of iteration is two,and randomly separating training and testing sets by setting 3rd argument shuffle=True;thus  at each iteration,we get different training and test sets to train and test a different random forest classifier(notice the instantiate stage was putting inside the for loop to guarantee each new classifier CANNOT see or remember the previous experiments and will get outcome independently);


The following figure demonstrate the process when I told KFold() to perform three times of "cross validation",as you can see clearly,a random forest classifier and a matplotlib line2D object was generated three times:

The final figure showing each of the three ROC curves being drawn,we can explain as this:within about a 1%(10^-2)false positive rate,we have approximately of maximum 80% average of true positive rate;and as the true positive rate of this machine learning detector approach from 80% to 100%,at the same time,its false positive rate also increases from 1% to 100% !!!



Summary

In this tutorial I showed you how to extract and prepare training and testing dataset then train and test a specific malware machine learning model,you also know how to evaluate its detection accuracy in a general trend,however,what technique this tutorial haven't told you is how to improve its accuracy and reduce its false positive rate;to achieve this goal you will need to train and test at least more than tens of thousands of samples(you can get them from virustotal.com),or you can redesign the feature extraction logic to include import address table(IAT) analysis of a PE file,or assembly instruction N-gram analysis of a PE file;alternatively,you can explore other machine learning algorithm provided by sklearn,such as logistic regression、decision tree,which I will leave you for exercises^^



Appendix A

This section will help you understand the internal behavior of the iterator that KFold() return;

Suppose we have a list of dictionaries store 

the correspondences between movie names and their box offices(measured by USD),in ascending order:


Now one of our requirements is to extract a sub collection from it with some specific film members,but using the traditional multi-indices may failed,because pure Python list doesn't support specifying multiple index simultaneously:


One workaround of this problem is to using numpy's array() function,convert our whole movies and box offices list to an array ( said,A),then also convert those indices to another array (said,B),then you can safely use B as indices into A,to retrieve several members at once:




This seems pretty cool,but what if we now have another requirement:to randomly divide this movies-box office array into two parts with different elements in them? 

This is where KFold() from sklearn's  cross_validation module comes into play,the following code show you how easily I accomplish this with only handful lines of code:


execution outputs:




The second argument of KFold() specify iteration passes,it must less than or equal to the number of elements in target array which we want to split on;

As you can see,within each iteration,we divide the array into two separate parts ,each part have randomly members in it;and we know that KFold() return randomly arranged indices as two sub-arrays of its parent array,in the above case,array "np_MoviesBoxOffice" has

a complete indices [0-8],indices_A and indices_B only contain partially random  indices from "np_MoviesBoxOffice";this is why we can use them index into the original parent array,to split our training and testing set!!!文章来源地址https://www.toymoban.com/news/detail-656052.html

 

到了这里,关于(全英语版)处理恶意软件的随机森林分类器算法(Random Forest Classifier On Malware)的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 分类算法-随机森林实战案例

            随机森林是一种 有监督学习算法 ,是以决策树为基学习器的 集成学习算法 。                 那什么是有监督学习呢?有监督学习就是把有已知结果的数据集拿去训练,如果训练结果与标准答案的精度足够高就可以使用这个模型去预测或者分类未知结果

    2023年04月16日
    浏览(42)
  • 四、分类算法 - 随机森林

    目录 1、集成学习方法 2、随机森林 3、随机森林原理 4、API 5、总结 sklearn转换器和估算器 KNN算法 模型选择和调优 朴素贝叶斯算法 决策树 随机森林

    2024年02月22日
    浏览(45)
  • 无涯教程-分类算法 - 随机森林

    随机森林是一种监督学习算法,可用于分类和回归,但是,它主要用于分类问题,众所周知,森林由树木组成,更多树木意味着更坚固的森林。同样,随机森林算法在数据样本上创建决策树,然后从每个样本中获取预测,最后通过投票选择最佳解决方案。它是一种集成方法,

    2024年02月11日
    浏览(48)
  • 随机森林(Random Forest)简单介绍

    随机森林是一种监督式学习算法,适用于分类和回归问题。它可以用于数据挖掘,计算机视觉,自然语言处理等领域。随机森林是在决策树的基础上构建的。随机森林的一个重要特点是它可以减少决策树由于过度拟合数据而导致的过拟合,从而提高模型的性能。 随机森林是一

    2024年02月07日
    浏览(40)
  • 【机器学习】随机森林 – Random forest

    随机森林是一种由 决策树 构成的 集成算法 ,他在很多情况下都能有不错的表现。 要深入理解上面这句话,请阅读我的另外两篇文章: 【机器学习】决策树 – Decision Tree 【机器学习】集成学习 - Ensemble Learning 随机森林属于 集成学习 中的 Bagging (Bootstrap AGgregation 的简称)

    2024年02月16日
    浏览(44)
  • 随机森林算法介绍及多分类预测的R实现

    随机森林(Random Forest)是一种经典的机器学习算法,是数据科学家中最受欢迎和常用的算法之一,最早由Leo Breiman和Adele Cutler于2001年提出。它是基于集成学习(Ensemble Learning)的一种方法,通过组合多个决策树来进行预测和分类,在回归问题中则取平均值。其最重要的特点之

    2024年02月09日
    浏览(39)
  • 机器学习之随机森林(Random forest)

    随机森林是一种监督式算法,使用由众多决策树组成的一种集成学习方法,输出是对问题最佳答案的共识。随机森林可用于分类或回归,是一种主流的集成学习算法。 随机森林中有许多的分类树。我们要将一个输入样本进行分类,我们需要将输入样本输入到每棵树中进行分类

    2024年02月15日
    浏览(41)
  • 大数据分析案例-基于随机森林算法构建新闻文本分类模型

    🤵‍♂️ 个人主页:@艾派森的个人主页 ✍🏻作者简介:Python学习者 🐋 希望大家多多支持,我们一起进步!😄 如果文章对你有帮助的话, 欢迎评论 💬点赞👍🏻 收藏 📂加关注+ 喜欢大数据分析项目的小伙伴,希望可以多多支持该系列的其他文章 大数据分析案例合集

    2024年02月02日
    浏览(54)
  • 分类预测 | Matlab实现GA-RF遗传算法优化随机森林多输入分类预测

    效果一览 基本介绍 Matlab实现GA-RF遗传算法优化随机森林多输入分类预测(完整源码和数据) Matlab实现GA-RF遗传算法优化随机森林分类预测,多输入单输出模型。GA-RF分类预测模型 多特征输入单输出的二分类及多分类模型。程序内注释详细,直接替换数据就可以用。程序语言为

    2024年02月07日
    浏览(51)
  • 【Sklearn】基于随机森林算法的数据分类预测(Excel可直接替换数据)

    随机森林(Random Forest)是一种集成学习方法,通过组合多个决策树来构建强大的分类或回归模型。随机森林的模型原理和数学模型如下: 随机森林是一种集成学习方法,它结合了多个决策树来改善预测的准确性和鲁棒性。每个决策树都是独立地训练,并且它们的预测结果综

    2024年02月12日
    浏览(34)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包