哈希表(底层结构剖析--下)-Toy模板网

这篇具有很好参考价值的文章主要介绍了哈希表(底层结构剖析--下)。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

开散列哈希桶的模拟实现

哈希桶的基本框架

由于哈希桶的本质就是一个存有结点指针的数组.所以哈希桶存储的数据类型便是结点指针类型.
哈希结点中包括两个内置成员:
1:K,V模型组成键值对pair<K,V>.(也可以是K模型)

2: 指向下一个结点的指针.

template <class K, class V>
	struct HashNode                      //哈希结点
	{
		pair<K, V> _kv;
		HashNode<K, V>* _next;

		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			, _next(nullptr)
		{}
	};

哈希桶的本质是一个指针数组.
其内置成员有两个:
1: 装有结点指针的vector.

2: 记录哈希桶的有效数据个数_size.

template < class K, class V>
struct HashTable
{
    public:
      //...
    private:
     vector<Node*> _table;   //哈希桶.
	 size_t _size = 0;       //记录哈希桶有效数据个数.
}

增加仿函数将数据类型转换为整型

此处仿函数与闭散列的仿函数一致并且接下来的插入,查找,删除等函数都调用仿函数,以下为仿函数代码:

template < class K >                   
struct HashFunc
{
	size_t operator()(const K& key)
	{
		return (size_t)key;
	}
};

template< >                           //仿函数的默认值,如果不显示传就会默认调用.      
struct HashFunc <string>                   
{
	size_t operator()(const string& key)
	{
		size_t  val = 0;
		for ( auto& ch : key ) //遍历string,将一个个字母替换成ASCI码进行相加.
		{ 
		    val *= 131;
			val += ch;
		}
		return val;
	}
};

template < class K, class V,class Hash = HashFunc<K>>   //模板参数增加一个仿函数.
struct HashTable
{
    public:
      //...
    private:
     vector<Node*> _table;   //哈希桶.
	 size_t _size = 0;       //记录哈希桶有效数据个数.
}

哈希桶的插入函数

哈希桶的插入主要分为3个步骤:
1: 去重.

2: 扩容

3: 插入.

以下就这3个步骤展开详谈:
1: 去重
(1)调用find函数,如果找到了,就说明哈希表中已经有了该数据,不需要插入,返回false.

(2)如果没找到,就将该数据插入.

2: 扩容
(1):计算新表的长度,创建新表.

(2): 遍历旧表,将旧表的数据按照头插的方式插入到哈希桶中.

(3): 调用vector中的swap函数,新表与旧表交换.

3: 插入
(1): 通过哈希函数计算映射位置.

(2) 将旧表数据头插移入到新表中.

(3) 更新_size.
哈希表(底层结构剖析--下)

以下为插入函数完整代码:

bool Insert( const pair<K, V>& kv )
		{
			if (Find(kv.first))                         //去重
			{
				return false;
		 	}
            //方法一:                               调用Insert函数
           //HashTable<K, V> newHT; 
          //newHT._table.resize(_table.size());
          //for (auto cur : _table)
          //{
          //    while (cur)
         //    {
         //        newHT.Insert(cur->_kv);
         //        cur = cur->_next;
        // }
                  //}

           // _table.swap(newHT._table);
            //方法二:                                //直接将旧表数据移入到旧表中.
			if (_size == _table.size())                                //扩容.
			{
			//	size_t newSize = _table.size() == 0 ? 10 : 2 * _table.size();
				vector<Node*> newTable;
				newTable.resize(GetNextPrime(_table.size()), nullptr);    //扩容

				Hash hash;
				//从旧表结点移动映射新表.
				for (size_t i = 0; i < _table.size(); ++i)
				{
					Node* cur = _table[i];
					while (cur)
					{
						Node* next = cur->_next;

						size_t hashi = hash( cur->_kv.first) % newTable.size();
						cur->_next = newTable[hashi];
						newTable[hashi] = cur;

						cur = next;
					}
					_table[i] = nullptr;


				}
				_table.swap(newTable);


			}
			Hash hash1;
			size_t hashi = hash1(kv.first) % _table.size();
			Node* newNode = new Node(kv);
			newNode->_next = _table[hashi];
			_table[hashi] = newNode;
			++_size;
			return true;
		}

注意:
方法一调用了Insert函数这样虽然可以解决创建新的哈希表导致每个数据的位置打乱,导致要重新计算插入位置的问题.可是,通过insert复用的办法,是生成了新的结点插入,函数栈帧结束后,旧表的数据结点通过调用析构函数销毁,可是,这样就会造成老的结点没有被充分运用的问题.

哈希桶的删除函数

能不能传pair键值对,调用find找到该结点进行删除呢?

不能,因为我们所设计的哈希桶为单链表(节省内存),find只能找到目标结点的位置,不能找到前一个结点的位置.因此,删除后删除结点的前后结点不能顺利连接.

删除函数主要分为3个步骤:
(1): 通过哈希函数确定删除目标结点哈希桶位置.

(2): 对哈希桶进行遍历,寻找删除目标结点.

(3)如果找到就删除该结点,如果没找到,就保留当前结点的位置,查找下一个结点.

哈希表(底层结构剖析--下)

	bool  Erase(const K& key)
		{
			if (_table.size() == 0)
			{
				return true;
			}
			Hash hash;
			size_t hashi = hash(key) % _table.size();
			Node* prev = nullptr;
			Node* cur = _table[hashi];
			while (cur)                      //找到哈希桶的位置,开始寻找目标结点.
			{
				if (cur->_kv.first == key)
				{
					if (prev == nullptr)
					{
						_table[hashi] = cur->_next;
					}
					else
					{
						prev->_next = cur->_next;
					}
					delete cur;
					--_size;
					return true;
				}

				prev = cur;                //没找到就保存这个结点
				cur = cur->_next;         
			}
			return false;
		}

哈希桶的查找函数

哈希桶的查找步骤如下:
(1): 如果表为空,表明找不到了,返回空.

(2): 通过哈希函数计算目标结点哈希桶的位置.

(3): 遍历哈希桶,如果找到,返回当前结点指针,没找到返回空.

Node* Find(const K& key)
		{
			if (_table.size() == 0)        //表为空,就返回nullptr.
			{
				return nullptr;
			}
			Hash hash;
			size_t hashi = hash(key) % _table.size();
			Node* cur = _table[hashi];
			while (cur)
			{
				if ( cur->_kv.first == key )
				{
					return cur;
				}
				cur = cur->_next;

			}
			return nullptr;

		}

哈希桶的析构函数

当函数栈帧结束时,由于哈希表会调用它自己的析构函数,所以我们就不用对它额外显示写了.
但是,哈希桶中的结点没有析构函数无法析构,此时则需要我们额外对哈希桶显示写析构函数.
析构函数步骤如下:
(1): 遍历哈希表,找到每个哈希桶的头指针依次对哈希桶进行遍历删除(在删除时要提前保留下一个节点地址),直到整个表中的哈希桶全部删除.
(2) 删除完一个哈希桶就将该哈希桶的头指针设为nullptr.


		~HashTable()         //vvector会调用析构函数,但是哈希桶必须自己写.
		{
			for (size_t i = 0; i < _table.size(); ++i)
			{
				Node* cur = _table[i];
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}
				_table[i] = nullptr;
			}
		//	cout << "~HushTable()" << endl;

		}

建议哈希表的大小为素数

根据实验显示,使用除留余数法时,哈希表的大小最好是素数,这样能够降低哈希冲突概率. 可是怎么才能快速取一个类似两倍关系的素数?

STL中特意创建了一个素数数组,当插入函数需要扩容时,则遍历素数数组找到第一个大于哈希表中的数据个数的素数,这个素数就为新表的大小.

size_t GetNextPrime(size_t prime)
		{
			const int PRIMECOUNT = 28;
			//素数序列
			const size_t primeList[PRIMECOUNT] =
			{
					53, 97, 193, 389, 769,
				1543, 3079, 6151, 12289, 24593,
				49157, 98317, 196613, 393241, 786433,
				1572869, 3145739, 6291469, 12582917, 25165843,
				50331653, 100663319, 201326611, 402653189, 805306457,
				1610612741, 3221225473, 4294967291
			};
			size_t i = 0;
			for (i = 0; i < PRIMECOUNT; i++)
			{
				if (primeList[i] > prime)         //从这个数组里面取第一个大于prime的值.
					return primeList[i];
			}
			return primeList[i];                 //虽然数据不可能那么大,但是也有可能不会走if判断,
			                                     // 所以从语法上来说还要考虑所给值prime大于素数数组整个数据的情况.
		}

注意:
一般来讲,因为内存限制,我们不太可能需要一个最大素数大小的哈希表,但是有可能我们传的值会比最大素数还大.但是,从语法上考虑,当这种情况发生,我们应该也要返回一个值,这里返回素数数组中最大素数.

开散列与闭散列比较

应用链地址法(开散列)处理溢出，需要增设链接指针，似乎增加了存储开销。事实上：由于闭散列必须保持大量的空闲空间以确保搜索效率，如二次探查法要求装载因子a <= 0.7，而表项所占空间又比指针大的多(闭散列存储数据类型为pair键值对与状态构成的对象所以比指针大的多)，所以使用链地址法(开散列)反而比闭散列节省存储空间

哈希桶的时间复杂度及其测试

我们知道,在最坏情况下,所给的数据都发生哈希冲突,此时的时间复杂度为O(N),但是这种情况出现的概率极低,那么哈希桶的时间复杂度是多少呢?

哈希桶随机数测试:
但测试数据为100000个时,此时负载因子为0.677594,其中在62681个桶中最长的桶的长度才为2,平均每个桶的长度才为1.06283.
哈希表(底层结构剖析--下)
当测试数据为200000个时,此时负载因子为0.660307可见和上次相比这次测试已经扩了容,但是最长的桶的长度也仅为2,平均桶的长度为1.03515.

综合以上情况: 在平均情况下,哈希表中每个位置的哈希桶的数据个数大概为常数个,所以时间复杂度一般为O(1).文章来源地址https://www.toymoban.com/news/detail-427361.html

开散列哈希桶的模拟实现完整代码

#include <map>
#include <vector>
#include <string>
#include <iostream>
using namespace std;                     
namespace HashBucket
{
	template < class K >                   //仿函数的默认值,如果不显示传就会默认调用.
	struct HashFunc
	{
		size_t operator()(const K& key)
		{
			return (size_t)key;
		}
	};

	template< >                                 //1:对于常见类型,为了方便,我们可以对类模板进行特化.
	struct HashFunc <string>                    //并且根据实参所传类型,优先走特化版本.	                                     
	{
		size_t operator()(const string& key)
		{
			size_t  val = 0;
			for (auto& ch : key) //遍历string,将一个个字母替换成ASCI码进行相加.
			{
				val += ch;
			}
			return val;
		}
	};
	template <class K, class V>
	struct HashNode
	{
		pair<K, V> _kv;
		HashNode<K, V>* _next;

		HashNode(const pair<K, V>& kv)
			:_kv(kv)
			, _next(nullptr)
		{}
	};
	template < class K, class V, class Hash = HashFunc<K> >
	struct HashTable
	{
		typedef HashNode<K, V> Node;
	public:
		size_t GetNextPrime(size_t prime)
		{
			const int PRIMECOUNT = 28;
			//素数序列
			const size_t primeList[PRIMECOUNT] =
			{
				53, 97, 193, 389, 769,
				1543, 3079, 6151, 12289, 24593,
				49157, 98317, 196613, 393241, 786433,
				1572869, 3145739, 6291469, 12582917, 25165843,
				50331653, 100663319, 201326611, 402653189, 805306457,
				1610612741, 3221225473, 4294967291
			};
			size_t i = 0;
			for (i = 0; i < PRIMECOUNT; i++)
			{
				if (primeList[i] > prime)         //从这个数组里面取第一个大于prime的值.
					return primeList[i];
			}
			return primeList[i];                 //虽然数据不可能那么大,但是也有可能不会走if判断,
			                                     // 所以从语法上来说还要考虑所给值prime大于素数数组整个数据的情况.
		}

		~HashTable()         //vvector会调用析构函数,但是哈希桶必须自己写.
		{
			for (size_t i = 0; i < _table.size(); ++i)
			{
				Node* cur = _table[i];
				while (cur)
				{
					Node* next = cur->_next;
					delete cur;
					cur = next;
				}
				_table[i] = nullptr;
			}
		//	cout << "~HushTable()" << endl;

		}

		bool  Erase(const K& key)
		{
			if (_table.size() == 0)
			{
				return true;
			}
			Hash hash;
			size_t hashi = hash(key) % _table.size();
			Node* prev = nullptr;
			Node* cur = _table[hashi];
			while (cur)
			{
				if (cur->_kv.first == key)
				{
					if (prev == nullptr)
					{
						_table[hashi] = cur->_next;
					}
					else
					{
						prev->_next = cur->_next;
					}
					delete cur;
					--_size;
					return true;
				}

				prev = cur;
				cur = cur->_next;
			}
			return false;
		}

		bool Insert(const pair<K, V>& kv)
		{
			if (Find(kv.first))                         //去重
			{
				return false;
			}

			if (_size == _table.size())                                //扩容.
			{
			//	size_t newSize = _table.size() == 0 ? 10 : 2 * _table.size();
				vector<Node*> newTable;
				newTable.resize(GetNextPrime(_table.size()), nullptr);    //扩容

				Hash hash;
				//从旧表结点移动映射新表.
				for (size_t i = 0; i < _table.size(); ++i)
				{
					Node* cur = _table[i];
					while (cur)
					{
						Node* next = cur->_next;

						size_t hashi = hash( cur->_kv.first) % newTable.size();
						cur->_next = newTable[hashi];
						newTable[hashi] = cur;

						cur = next;
					}
					_table[i] = nullptr;


				}
				_table.swap(newTable);


			}
			Hash hash1;
			size_t hashi = hash1(kv.first) % _table.size();
			Node* newNode = new Node(kv);
			newNode->_next = _table[hashi];
			_table[hashi] = newNode;
			++_size;
			return true;
		}

		Node* Find(const K& key)
		{
			if (_table.size() == 0)        //表为空,就返回nullptr.
			{
				return nullptr;
			}
			Hash hash;
			size_t hashi = hash(key) % _table.size();
			Node* cur = _table[hashi];
			while (cur)
			{
				if ( cur->_kv.first == key )
				{
					return cur;
				}
				cur = cur->_next;

			}
			return nullptr;

		}

		size_t size()                  //哈希表的数据个数
		{
			return _size;
		}
		size_t TablesSize()          //表的长度
		{
			return  _table.size();
		}
		
		size_t BucketNum()           //有多少个桶被用了.
		{
			size_t Num = 0;
			for (size_t i = 0; i < _table.size(); ++i)
			{
				if (_table[i])
				{
					++Num;
				}
			}
			return Num;
		}
		size_t MaxBucketLenth()          //哈希桶的最大桶长
		{
			size_t maxLen = 0;
			for (size_t i = 0; i < _table.size(); ++i)
			{
				size_t len = 0;
				Node* cur = _table[i];
				while (cur)
				{
					++len;
					cur = cur->_next;
				}
				if ( len > maxLen )
				{
					maxLen = len;
				}
			}
			return maxLen;
		}
	private:
		vector<Node*> _table;
		size_t _size = 0;
	};
	void TestHT3()                          //哈希桶时间复杂度测试.
	{

		int n = 200000;
		vector<int> v;
		v.reserve(n);
		srand(time(0));
		for (int i = 0; i < n; ++i)
		{
			//v.push_back(i);
			v.push_back(rand() + i);  // 重复少
			//v.push_back(rand());  // 重复多
		}

		size_t begin1 = clock();
		HashTable<int, int> ht;
		for (auto e : v)
		{
			ht.Insert(make_pair(e, e));
		}
		size_t end1 = clock();

		cout << "数据个数:" << ht.size() << endl;
		cout << "表的长度:" << ht.TablesSize() << endl;
		cout << "桶的个数:" << ht.BucketNum() << endl;
		cout << "平均每个桶的长度:" << (double)ht.size() / (double)ht.BucketNum() << endl;
		cout << "最长的桶的长度:" << ht.MaxBucketLenth() << endl;
		cout << "负载因子:" << (double)ht.size() / (double)ht.TablesSize() << endl;
	}
}