（11-3-01 ）检测以太坊区块链中的非法账户-Toy模板网

这篇具有很好参考价值的文章主要介绍了（11-3-01 ）检测以太坊区块链中的非法账户。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

本篇未完结，请持续关注更新。

源码和数据集下载在本篇最后

以太坊（Ethereum）是一种基于区块链技术的开源平台和加密货币。它于2015年由Vitalik Buterin和Gavin Wood等开发者创建，并成为比特币之后最受欢迎的加密货币之一。以太坊不仅支持加密货币交易，还为开发者和企业提供了构建去中心化应用程序的强大工具。在本节的内容中，将实型一个完整的机器学习模型项目，智能检测出以太坊区块链中的非法账户。从问题定义到模型建立和评估，再到最终的总结和建议。本实例突出了处理类别不平衡问题的重要性，并展示了如何使用多种机器学习算法来解决实际问题。此外，通过数据可视化和性能指标的使用，使得结果更具可解释性和可操作性。

实例11-1：使用模型检测以太坊区块链中的非法账户（源码路径：daima/11/illicit-account-detection.ipynb）

11.3.1 数据集介绍

本项目所使用的数据集主要用于以太坊区块链上的欺诈检测研究，这个数据集包含了已知的欺诈交易和有效交易的记录，可以用于数据分析、机器学习和欺诈检测算法的开发和测试。下面是对该数据集的概要说明：

数据来源：该数据集的来源是以太坊区块链，其中包含了一系列与以太坊账户和交易相关的信息。
目的：数据集的主要目的是为研究人员和数据科学家提供一个用于欺诈检测的样本数据集。研究人员可以使用这些数据来训练机器学习模型，以识别潜在的欺诈性交易。
数据列：数据集包含了多个列，其中包括账户地址、交易类型、交易时间间隔、交易数量、以太币价值等信息。还有一个"FLAG"列，用于指示交易是否为欺诈。
用途：这个数据集可以用于开展欺诈检测、数据挖掘、特征工程等与以太坊区块链上的交易行为相关的分析和研究。

11.3.2 数据预处理

（1）读取名为 "transaction_dataset.csv" 的数据文件，并显示数据集的前几行内容以便进行初步了解。具体实现代码如下所示。

dataset=pd.read_csv("../input/ethereum-frauddetection-dataset/transaction_dataset.csv")
dataset.head()

执行后会输出：

	Unnamed: 0	Index	Address	FLAG	Avg min between sent tnx	Avg min between received tnx	Time Diff between first and last (Mins)	Sent tnx	Received Tnx	Number of Created Contracts	...	ERC20 min val sent	ERC20 max val sent	ERC20 avg val sent	ERC20 min val sent contract	ERC20 max val sent contract	ERC20 avg val sent contract	ERC20 uniq sent token name	ERC20 uniq rec token name	ERC20 most sent token type	ERC20_most_rec_token_type
0	0	1	0x00009277775ac7d0d59eaad8fee3d10ac6c805e8	0	844.26	1093.71	704785.63	721	89	0	...	0.000000	1.683100e+07	271779.920000	0.0	0.0	0.0	39.0	57.0	Cofoundit	Numeraire
1	1	2	0x0002b44ddb1476db43c868bd494422ee4c136fed	0	12709.07	2958.44	1218216.73	94	8	0	...	2.260809	2.260809e+00	2.260809	0.0	0.0	0.0	1.0	7.0	Livepeer Token	Livepeer Token
2	2	3	0x0002bda54cb772d040f779e88eb453cac0daa244	0	246194.54	2434.02	516729.30	2	10	0	...	0.000000	0.000000e+00	0.000000	0.0	0.0	0.0	0.0	8.0	None	XENON
3	3	4	0x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e	0	10219.60	15785.09	397555.90	25	9	0	...	100.000000	9.029231e+03	3804.076893	0.0	0.0	0.0	1.0	11.0	Raiden	XENON
4	4	5	0x00062d1dd1afb6fb02540ddad9cdebfe568e0d89	0	36.61	10707.77	382472.42	4598	20	1	...	0.000000	4.500000e+04	13726.659220	0.0	0.0	0.0	6.0	27.0	StatusNetwork	EOS

（2）获取数据集维度（行数和列数），在这种情况下，如果执行 dataset.shape，它将返回一个包含两个值的元组，第一个值表示数据集的行数，第二个值表示数据集的列数。例如，如果返回的元组是 (1000, 20)，那么意味着数据集有1000行和20列。具体实现代码如下所示。

dataset.shape

执行后会输出：

(9841, 51)

（3）获取数据集的详细信息，这个命令对于快速了解数据集的结构和数据类型非常有用，以及检查是否存在缺失值。具体实现代码如下所示。

dataset.info()

执行这行代码后，将会输出有关数据集的以下信息：

数据集中每列的名称（列名）。
每列非缺失值的数量。
每列的数据类型（例如，整数、浮点数、对象等）。
数据集中的总行数。

执行后会输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9841 entries, 0 to 9840
Data columns (total 51 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   Unnamed: 0                                            9841 non-null   int64  
 1   Index                                                 9841 non-null   int64  
 2   Address                                               9841 non-null   object 
 3   FLAG                                                  9841 non-null   int64  
 4   Avg min between sent tnx                              9841 non-nul
########省略部分内容
 46   ERC20 avg val sent contract                          9012 non-null   float64
 47   ERC20 uniq sent token name                           9012 non-null   float64
 48   ERC20 uniq rec token name                            9012 non-null   float64
 49   ERC20 most sent token type                           9000 non-null   object 
 50   ERC20_most_rec_token_type                            8990 non-null   object 
dtypes: float64(39), int64(9), object(3)
memory usage: 3.8+ MB

（4）检查数据集中的重复行并删除它们，然后删除一个名为 "Unnamed: 0" 的列，因为它不需要用于进一步的分析。最后，输出了处理后的数据集的维度。具体实现代码如下所示。

# 检查并删除重复行
dataset.drop_duplicates(subset=None, inplace=True)

# 删除 "Unnamed: 0" 列（因为它不需要用于进一步的分析）
dataset.drop(['Unnamed: 0'], axis=1, inplace=True)

# 获取处理后的数据集的维度
数据集维度 = dataset.shape

执行后会输出：

(9841, 50)

（5）生成数据集的描述性统计信息，包括数据集中数值列的统计汇总，如均值、标准差、最小值、25%分位数、中位数（50%分位数）、75%分位数和最大值。这些统计信息对于初步了解数据的分布和特征非常有用。具体实现代码如下所示。

dataset.describe()

执行后会输出：

	Index	FLAG	Avg min between sent tnx	Avg min between received tnx	Time Diff between first and last (Mins)	Sent tnx	Received Tnx	Number of Created Contracts	Unique Received From Addresses	Unique Sent To Addresses	...	ERC20 max val rec	ERC20 avg val rec	ERC20 min val sent	ERC20 max val sent	ERC20 avg val sent	ERC20 min val sent contract	ERC20 max val sent contract	ERC20 avg val sent contract	ERC20 uniq sent token name	ERC20 uniq rec token name
count	9841.000000	9841.000000	9841.000000	9841.000000	9.841000e+03	9841.000000	9841.000000	9841.000000	9841.000000	9841.000000	...	9.012000e+03	9.012000e+03	9.012000e+03	9.012000e+03	9.012000e+03	9012.0	9012.0	9012.0	9012.000000	9012.000000
mean	1815.049893	0.221421	5086.878721	8004.851184	2.183333e+05	115.931714	163.700945	3.729702	30.360939	25.840159	...	1.252524e+08	4.346203e+06	1.174126e+04	1.303594e+07	6.318389e+06	0.0	0.0	0.0	1.384931	4.826676
std	1222.621830	0.415224	21486.549974	23081.714801	3.229379e+05	757.226361	940.836550	141.445583	298.621112	263.820410	...	1.053741e+10	2.141192e+08	1.053567e+06	1.179905e+09	5.914764e+08	0.0	0.0	0.0	6.735121	16.678607
min	1.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.0	0.0	0.0	0.000000	0.000000
25%	821.000000	0.000000	0.000000	0.000000	3.169300e+02	1.000000	1.000000	0.000000	1.000000	1.000000	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.0	0.0	0.0	0.000000	0.000000
50%	1641.000000	0.000000	17.340000	509.770000	4.663703e+04	3.000000	4.000000	0.000000	2.000000	2.000000	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.0	0.0	0.0	0.000000	1.000000
75%	2601.000000	0.000000	565.470000	5480.390000	3.040710e+05	11.000000	27.000000	0.000000	5.000000	3.000000	...	9.900000e+01	2.946467e+01	0.000000e+00	0.000000e+00	0.000000e+00	0.0	0.0	0.0	0.000000	2.000000
max	4729.000000	1.000000	430287.670000	482175.490000	1.954861e+06	10000.000000	10000.000000	9995.000000	9999.000000	9287.000000	...	1.000000e+12	1.724181e+10	1.000000e+08	1.120000e+11	5.614756e+10	0.0	0.0	0.0	213.000000	737.000000

（6）获取数据集中的列名，即数据集中包含的所有列的名称。获取列名是为了更好地了解数据集的结构和标识不同的特征或属性。具体实现代码如下所示。

# 获取数据集中的列名
column=dataset.columns
column

执行后会输出：

Index(['Index', 'Address', 'FLAG', 'Avg min between sent tnx',
       'Avg min between received tnx',
       'Time Diff between first and last (Mins)', 'Sent tnx', 'Received Tnx',
       'Number of Created Contracts', 'Unique Received From Addresses',
       'Unique Sent To Addresses', 'min value received', 'max value received ',
       'avg val received', 'min val sent', 'max val sent', 'avg val sent',
       'min value sent to contract', 'max val sent to contract',
       'avg value sent to contract',
       'total transactions (including tnx to create contract',
       'total Ether sent', 'total ether received',
       'total ether sent contracts', 'total ether balance',
       ' Total ERC20 tnxs', ' ERC20 total Ether received',
       ' ERC20 total ether sent', ' ERC20 total Ether sent contract',
       ' ERC20 uniq sent addr', ' ERC20 uniq rec addr',
       ' ERC20 uniq sent addr.1', ' ERC20 uniq rec contract addr',
       ' ERC20 avg time between sent tnx', ' ERC20 avg time between rec tnx',
       ' ERC20 avg time between rec 2 tnx',
       ' ERC20 avg time between contract tnx', ' ERC20 min val rec',
       ' ERC20 max val rec', ' ERC20 avg val rec', ' ERC20 min val sent',
       ' ERC20 max val sent', ' ERC20 avg val sent',
       ' ERC20 min val sent contract', ' ERC20 max val sent contract',
       ' ERC20 avg val sent contract', ' ERC20 uniq sent token name',
       ' ERC20 uniq rec token name', ' ERC20 most sent token type',
       ' ERC20_most_rec_token_type'],
      dtype='object')

（7）检查数据集中的缺失值，并计算每列中的缺失值数量。具体实现代码如下所示。

dataset.isnull().sum()

执行后会输出：

Index                                                     0
Address                                                   0
FLAG                                                      0
Avg min between sent tnx                                  0
Avg min between received tnx                              0
Time Diff between first and last (Mins)                   0
#####省略部分输出
 ERC20 avg val sent contract                            829
 ERC20 uniq sent token name                             829
 ERC20 uniq rec token name                              829
 ERC20 most sent token type                             841
 ERC20_most_rec_token_type                              851
dtype: int64

（7）再次显示数据集的前几行，以便初步了解数据集的内容和结构。具体实现代码如下所示。

dataset.head()

此时执行后会输出：

	Index	Address	FLAG	Avg min between sent tnx	Avg min between received tnx	Time Diff between first and last (Mins)	Sent tnx	Received Tnx	Number of Created Contracts	Unique Received From Addresses	...	ERC20 min val sent	ERC20 max val sent	ERC20 avg val sent	ERC20 min val sent contract	ERC20 max val sent contract	ERC20 avg val sent contract	ERC20 uniq sent token name	ERC20 uniq rec token name	ERC20 most sent token type	ERC20_most_rec_token_type
0	1	0x00009277775ac7d0d59eaad8fee3d10ac6c805e8	0	844.26	1093.71	704785.63	721	89	0	40	...	0.000000	1.683100e+07	271779.920000	0.0	0.0	0.0	39.0	57.0	Cofoundit	Numeraire
1	2	0x0002b44ddb1476db43c868bd494422ee4c136fed	0	12709.07	2958.44	1218216.73	94	8	0	5	...	2.260809	2.260809e+00	2.260809	0.0	0.0	0.0	1.0	7.0	Livepeer Token	Livepeer Token
2	3	0x0002bda54cb772d040f779e88eb453cac0daa244	0	246194.54	2434.02	516729.30	2	10	0	10	...	0.000000	0.000000e+00	0.000000	0.0	0.0	0.0	0.0	8.0	None	XENON
3	4	0x00038e6ba2fd5c09aedb96697c8d7b8fa6632e5e	0	10219.60	15785.09	397555.90	25	9	0	7	...	100.000000	9.029231e+03	3804.076893	0.0	0.0	0.0	1.0	11.0	Raiden	XENON
4	5	0x00062d1dd1afb6fb02540ddad9cdebfe568e0d89	0	36.61	10707.77	382472.42	4598	20	1	7	...	0.000000	4.500000e+04	13726.659220	0.0	0.0	0.0	6.0	27.0	StatusNetwork	EOS

（8）首先获取数据集的列名，然后计算名为' ERC20 most sent token type' 的列中各个值的数量。这对于了解特定列中不同值的分布情况非常有用。具体实现代码如下所示。

column=dataset.columns
column
dataset[' ERC20 most sent token type'].value_counts()

执行后会输出：

for col in column:

    print(dataset[col].value_counts())

（9）遍历数据集的每一列，然后计算每列中不同值的数量。具体实现代码如下所示。

# 遍历数据集的每一列并计算各个值的数量

for col in column:
    print(dataset[col].value_counts())

上述代码有助于了解每个特征或属性的分布情况，执行后会输出：文章来源地址https://www.toymoban.com/news/detail-784768.html

1       3
1458    3
1452    3
1453    3
1454    3
       ..
3527    1
3526    1
3525    1
3524    1
4729    1
Name: Index, Length: 4729, dtype: int64
0x4cd526aa2db72eb1fd557b37c6b0394acd35b212    2
0x4cd3bb2110eda1805dc63abc1959a5ee2d386e9f    2
0x4c1da8781f6ca312bc11217b3f61e5dfdf428de1    2
0x4c24af967901ec87a6644eb1ef42b680f58e67f5    2
0x4c268c7b1d51b369153d6f1f28c61b15f0e17746    2
                                             ..
0x57b417366e5681ad493a03492d9b61ecd0d3d247    1
0x57bb2d6426fed243c633d0b16d4297d12bc20638    1
0x57c0cf70020f0af5073c24cb272e93e7529c6a40    1
0x57ccf2b7ffe5e4497a7e04ac174646f5f16e24ce    1
0xd624d046edbdef805c5e4140dce5fb5ec1b39a3c    1
Name: Address, Length: 9816, dtype: int64
0    7662
1    2179
Name: FLAG, dtype: int64
0.00        3522
2.11          14
##########省略部分输出结果
Blockwell say NOTSAFU     779
DATAcoin                  358
Livepeer Token            207
                         ... 
BCDN                        1
Egretia                     1
UG Coin                     1
Yun Planet                  1
INS Promo1                  1
Name:  ERC20_most_rec_token_type, Length: 467, dtype: int64