侧边栏壁纸
博主头像
一啖焗捞丁

砖头man👷🏻‍♂️

  • 累计撰写 16 篇文章
  • 累计创建 3 个标签
  • 累计收到 1 条评论
标签搜索

目 录CONTENT

文章目录

一篇文章带你彻底弄懂HashMap

一啖焗捞丁
2021-05-01 / 0 评论 / 0 点赞 / 803 阅读 / 8,408 字

如果你有数据结构的基础,弄懂HashMap并不难

介绍

在解析源码之前我们还是先看一下它这个类的注释说明,其中提到一些很关键的地方。

/**
 * Hash table based implementation of the <tt>Map</tt> interface.  This
 * implementation provides all of the optional map operations, and permits
 * <tt>null</tt> values and the <tt>null</tt> key.  (The <tt>HashMap</tt>
 * class is roughly equivalent to <tt>Hashtable</tt>, except that it is
 * unsynchronized and permits nulls.)  This class makes no guarantees as to
 * the order of the map; in particular, it does not guarantee that the order
 * will remain constant over time.
 *
 * <p>This implementation provides constant-time performance for the basic
 * operations (<tt>get</tt> and <tt>put</tt>), assuming the hash function
 * disperses the elements properly among the buckets.  Iteration over
 * collection views requires time proportional to the "capacity" of the
 * <tt>HashMap</tt> instance (the number of buckets) plus its size (the number
 * of key-value mappings).  Thus, it's very important not to set the initial
 * capacity too high (or the load factor too low) if iteration performance is
 * important.
 *
 * <p>An instance of <tt>HashMap</tt> has two parameters that affect its
 * performance: <i>initial capacity</i> and <i>load factor</i>.  The
 * <i>capacity</i> is the number of buckets in the hash table, and the initial
 * capacity is simply the capacity at the time the hash table is created.  The
 * <i>load factor</i> is a measure of how full the hash table is allowed to
 * get before its capacity is automatically increased.  When the number of
 * entries in the hash table exceeds the product of the load factor and the
 * current capacity, the hash table is <i>rehashed</i> (that is, internal data
 * structures are rebuilt) so that the hash table has approximately twice the
 * number of buckets.
 *
 * <p>As a general rule, the default load factor (.75) offers a good
 * tradeoff between time and space costs.  Higher values decrease the
 * space overhead but increase the lookup cost (reflected in most of
 * the operations of the <tt>HashMap</tt> class, including
 * <tt>get</tt> and <tt>put</tt>).  The expected number of entries in
 * the map and its load factor should be taken into account when
 * setting its initial capacity, so as to minimize the number of
 * rehash operations.  If the initial capacity is greater than the
 * maximum number of entries divided by the load factor, no rehash
 * operations will ever occur.
 *
 * <p>If many mappings are to be stored in a <tt>HashMap</tt>
 * instance, creating it with a sufficiently large capacity will allow
 * the mappings to be stored more efficiently than letting it perform
 * automatic rehashing as needed to grow the table.  Note that using
 * many keys with the same {@code hashCode()} is a sure way to slow
 * down performance of any hash table. To ameliorate impact, when keys
 * are {@link Comparable}, this class may use comparison order among
 * keys to help break ties.
 *
 * <p><strong>Note that this implementation is not synchronized.</strong>
 * If multiple threads access a hash map concurrently, and at least one of
 * the threads modifies the map structurally, it <i>must</i> be
 * synchronized externally.  (A structural modification is any operation
 * that adds or deletes one or more mappings; merely changing the value
 * associated with a key that an instance already contains is not a
 * structural modification.)  This is typically accomplished by
 * synchronizing on some object that naturally encapsulates the map.
 *
 * If no such object exists, the map should be "wrapped" using the
 * {@link Collections#synchronizedMap Collections.synchronizedMap}
 * method.  This is best done at creation time, to prevent accidental
 * unsynchronized access to the map:<pre>
 *   Map m = Collections.synchronizedMap(new HashMap(...));</pre>
 *
 * <p>The iterators returned by all of this class's "collection view methods"
 * are <i>fail-fast</i>: if the map is structurally modified at any time after
 * the iterator is created, in any way except through the iterator's own
 * <tt>remove</tt> method, the iterator will throw a
 * {@link ConcurrentModificationException}.  Thus, in the face of concurrent
 * modification, the iterator fails quickly and cleanly, rather than risking
 * arbitrary, non-deterministic behavior at an undetermined time in the
 * future.
 *
 * <p>Note that the fail-fast behavior of an iterator cannot be guaranteed
 * as it is, generally speaking, impossible to make any hard guarantees in the
 * presence of unsynchronized concurrent modification.  Fail-fast iterators
 * throw <tt>ConcurrentModificationException</tt> on a best-effort basis.
 * Therefore, it would be wrong to write a program that depended on this
 * exception for its correctness: <i>the fail-fast behavior of iterators
 * should be used only to detect bugs.</i>
 */
 public class HashMap<K,V> extends AbstractMap<K,V> implements Map<K,V>, Cloneable, Serializable {
        //...
 }

通过阅读上面的说明应该是可以大致的了解HashMap的,这里我提取出几个我觉得比较重要的点

  1. HashMap是基于hash table(散列表)实现的
  2. HashMap影响性能的两个参数capacityload factor
  3. HashMap不是线性安全的

第一点和第三点很好理解,那第二点中影响性能参数又是怎么回事呢?在注释原文是这样说明的:

The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.

大致意思就是参数capacity表示的时候散列表中槽的数量,而load factor是衡量散列表何时对槽的数量进行扩容。

但是具体load factor这个参数是怎么定义,又是怎么去衡量的呢?我们不妨先来看一下hash table散列表。

我摘抄了wiki上的定义下来:

In computing, a hash table (hash map) is a data structure that implements an associative array abstract data type,a structure that can map keys to values.A hash table uses a hash function to compute an index,also called a hash code,into an array of buckets or slots, from which the desired value can be found. During lookup,the key is hashed and the resulting hash indicates where the corresponding value is stored.

In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at (amortized[2]) constant average cost per operation.

简单来说hash table就是通过散列函数将key计算出索引index,然后在相应数组中查询出来,而在理想情况下散列表的查询效率是常数级的,即 O(1)。另外散列表有一个非常重要的参数,load factor装载因子:

给定一个能存放 n 个元素的、具有 m 个槽位的散列表 T,定义 T 的装载因子load factora = n/m,即表示一个槽平均存储的元素数。

从数学意义上讲,所谓装载因子就是用来表示平均每一个槽所包含的元素数量,而随着装载因子越大,散列表的性能就会变得越差,既变慢。所以为了散列表能维持在一个比较好的性能上,我们会让装载因子维持在某个阈值之下。通常我们维持装载因子的手段是增加槽的数量,既扩容。

但两个不同的key通过相同的散列函数计算出来的索引index是可能会相同的,这是所谓的碰撞(冲突)。而我们解决碰撞主要是有两个方法:

  • 第一种是把所有发生碰撞的key都存在一个槽中,我们称这种方法为链接法
  • 第二种则是在发生碰撞之后继续往后找另一个空闲的槽,而这种则称为开放寻址法

那我们应该怎么去选择这两种碰撞解决方案呢?链接法是通过数组和指针构成的,而开放寻址法则是仅仅用数组所构成,这样相比之下开放寻址法所需空间少使得在同样的空间下可提供更多的槽,间接性地减少了冲突并提高了查询的速度,似乎选择开放寻址法是必然的。但是从计算机效率的角度来看,计算机中的数组是用连续的内存空间存储的,在数据量比较大时会占用了大量的连续空间,而且如果大部分的数据是空的,这样就大大降低了CPU缓存的命中率。

经过上述分析,笔者认为在数据量小的情况下是可以优先选择开放寻址法,它占用空间小、无新增分配内存消耗等优势确实带来比链接法更好的性能;而在数据量较大的情况下则优先选择链接法。而具体落实到项目则需要根据各自的业务特性进行权衡了,另外在进行衡量时需要考虑到一些机制带来的影响,比如对于开放寻址法删除机制,它会在元素删除之后为了不影响查询,通常会用一个标记去标识这个被删除的元素而不是置为空,但是这样会导致装载因子无法固定在一个阈值之下的问题产生,这或许也是HashMap采用链接法来解决hash冲突的原因之一。

源码

通过上述分析笔者应该对HashMap有了初步的了解,这里我们开始从源码的角度来进行分析。

存储结构

首先是我们找到了存储数据的数组,这对应的就是散列表的槽。

/**
 * The default initial capacity - MUST be a power of two.
 */
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

/**
 * The maximum capacity, used if a higher value is implicitly specified
 * by either of the constructors with arguments.
 * MUST be a power of two <= 1<<30.
 */
static final int MAXIMUM_CAPACITY = 1 << 30;

/**
 * The table, initialized on first use, and resized as
 * necessary. When allocated, length is always a power of two.
 * (We also tolerate length zero in some operations to allow
 * bootstrapping mechanics that are currently not needed.)
 */
transient Node<K,V>[] table;

HashMap中,对此存储数组大小的默认值和最大值分别是162^30次幂。

此处可以看到注释中存在这么一句话,length is always a power of two,这样做的原因是在HashMap中把hash值映射成索引index是通过取模运算来实现的,对于取模运算在除数为2的次幂的情况下,就可以等价转化为位运算来处理,既a mod b ==> a & (b-1),相比于取余运算,位运算效率提升会比较明显。

为了保证槽的大小固定为2的次幂,HashMap在每次创建和扩容时都会通过tableSizeFor()来进行计算:

/**
 * Returns a power of two size for the given target capacity.
 */
static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

tableSizeFor()的功能是返回距离输入参数最接近的2次幂整数,在原理上它是把n最高位1后面的位全部置为1,最后再n+1就得到了2的整数次幂了,这里cap-1为了避免它本来就是2的整数次幂而导致又加大了一倍。例子:

  n               001x xxxx xxxx xxxx
  n |= n >>> 1;   0011 xxxx xxxx xxxx
  n |= n >>> 2;   0011 11xx xxxx xxxx
  n |= n >>> 4;   0011 1111 11xx xxxx
  n |= n >>> 8;   0011 1111 1111 1111
  n |= n >>> 16;  0011 1111 1111 1111
  n+1             0100 0000 0000 0000

HashMap槽最大是32位整数,所以和右移16位数或操作已经完全足够了。

对于存储数据,HashMap是通过惰性策略来进行创建和初始化的,这里我们可以看到HashMap的构造方法:

/**
 * The load factor for the hash table.
 */
final float loadFactor;

/**
 * The next size value at which to resize (capacity * load factor).
 */
// (The javadoc description is true upon serialization.
// Additionally, if the table array has not been allocated, this
// field holds the initial array capacity, or zero signifying
// DEFAULT_INITIAL_CAPACITY.)
int threshold;

/**
 * The load factor used when none specified in constructor.
 */
static final float DEFAULT_LOAD_FACTOR = 0.75f;

public HashMap(int initialCapacity, float loadFactor) {
    if (initialCapacity < 0)
        throw new IllegalArgumentException("...");
    if (initialCapacity > MAXIMUM_CAPACITY)
        initialCapacity = MAXIMUM_CAPACITY;
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new IllegalArgumentException("...");
    this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}

public HashMap(int initialCapacity) {
    this(initialCapacity, DEFAULT_LOAD_FACTOR);
}

public HashMap() {
    this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
}

public HashMap(Map<? extends K, ? extends V> m) {
    this.loadFactor = DEFAULT_LOAD_FACTOR;
    putMapEntries(m, false);
}

从构造方法中可以看到,HashMap并没有对存储数组进行创建和初始化,而是仅仅完成了对loadFactor参数和threshold参数的设置。当我们添加数据时,HashMap才会进一步开始初始化。

loadFactor即为负载因子,而threshold则为槽扩容的阈值(HashMap为了不需要每次修改都计算扩容阈值进行判断,直接将阈值存储到threshold中)。

HashMap中,散列表的负载因子默认值为0.75。至于为什么默认装载因子的值是0.75的问题官方文档是一笔带过的,原文是这样的:

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs.Higher values decrease the space overhead but increase the lookup cost.

大概意思是默认负载因子设置为0.75是在时间和空间成本上提供了很好的折衷,虽然比较高的负载因子可减少空间上消耗,但会增加查询的花费。官方讲得好敷衍,在网上找了很久也没找到想要的答案,不过最后还是让笔者找到了比较优质的回答,大家可以借鉴一下:《HashMap的负载因子为什么是0.75》

哈希函数

对于散列表的另一个重点就是散列函数了,这里我们看到HashMap的散列函数说明。

/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

从中可以看到HashMap的散列函数是基于Object上的hashCode(),这里先不谈hashCode()是如何去计算的,而是来探究将计算出来的hashcode与高16位进行异或操作的缘由。

这一点官方给出了说明:HashMap的取模操作是通过位运算来等效处理的(上文提及: a mod b ==> a & (b-1),在这里(b-1)称为bmasking,为了方便下文说明这里起个名字:b的二次幂掩码)。如果不同keyhash值仅仅在大于槽长度的二次幂掩码上有变化,这样明显是会得出一个相同的值,从而加大了碰撞的几率。所以官方在这里将高16位的值传播到低16位上(既高16位与低16位进行异或处理),降低的碰撞的几率。简单看一下例子:

槽的长度 lenV=00000000 000000000 00000000 00001111

key1的值 key1=00000000 000000010 00000000 00000011
key2的值 key2=00000000 000000001 00000000 00000011

如果按照常规的操作: key1&lenV == key2&lenV

而如果我们采用了官方的做法就可以避免这种情况了:
 ((key1>>>16) ^ key1) & lenV != ((key2>>>16) ^ key2) & lenV

哈希碰撞

另一方面,因为HashMap解决哈希碰撞是通过链接法来实现的,所以就会存在一个弊端,即当单个槽存储的节点到达一定数量后,整体的查询时间复杂度就会退化成O(n)了。对于这种情况,JDK1.8HashMap做了一个优化,我们可以看到HashMap源码关于这一点的介绍。

/*
 * Implementation notes.
 *
 * This map usually acts as a binned (bucketed) hash table, but
 * when bins get too large, they are transformed into bins of
 * TreeNodes, each structured similarly to those in
 * java.util.TreeMap. Most methods try to use normal bins, but
 * relay to TreeNode methods when applicable (simply by checking
 * instanceof a node).  Bins of TreeNodes may be traversed and
 * used like any others, but additionally support faster lookup
 * when overpopulated. However, since the vast majority of bins in
 * normal use are not overpopulated, checking for existence of
 * tree bins may be delayed in the course of table methods.
 *
 * Tree bins (i.e., bins whose elements are all TreeNodes) are
 * ordered primarily by hashCode, but in the case of ties, if two
 * elements are of the same "class C implements Comparable<C>",
 * type then their compareTo method is used for ordering. (We
 * conservatively check generic types via reflection to validate
 * this -- see method comparableClassFor).  The added complexity
 * of tree bins is worthwhile in providing worst-case O(log n)
 * operations when keys either have distinct hashes or are
 * orderable, Thus, performance degrades gracefully under
 * accidental or malicious usages in which hashCode() methods
 * return values that are poorly distributed, as well as those in
 * which many keys share a hashCode, so long as they are also
 * Comparable. (If neither of these apply, we may waste about a
 * factor of two in time and space compared to taking no
 * precautions. But the only known cases stem from poor user
 * programming practices that are already so slow that this makes
 * little difference.)
 *
 * Because TreeNodes are about twice the size of regular nodes, we
 * use them only when bins contain enough nodes to warrant use
 * (see TREEIFY_THRESHOLD). And when they become too small (due to
 * removal or resizing) they are converted back to plain bins.  In
 * usages with well-distributed user hashCodes, tree bins are
 * rarely used.  Ideally, under random hashCodes, the frequency of
 * nodes in bins follows a Poisson distribution
 * (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 * parameter of about 0.5 on average for the default resizing
 * threshold of 0.75, although with a large variance because of
 * resizing granularity. Ignoring variance, the expected
 * occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 * factorial(k)). The first values are:
 *
 * 0:    0.60653066
 * 1:    0.30326533
 * 2:    0.07581633
 * 3:    0.01263606
 * 4:    0.00157952
 * 5:    0.00015795
 * 6:    0.00001316
 * 7:    0.00000094
 * 8:    0.00000006
 * more: less than 1 in ten million
 *
 * The root of a tree bin is normally its first node.  However,
 * sometimes (currently only upon Iterator.remove), the root might
 * be elsewhere, but can be recovered following parent links
 * (method TreeNode.root()).
 *
 * All applicable internal methods accept a hash code as an
 * argument (as normally supplied from a public method), allowing
 * them to call each other without recomputing user hashCodes.
 * Most internal methods also accept a "tab" argument, that is
 * normally the current table, but may be a new or old one when
 * resizing or converting.
 *
 * When bin lists are treeified, split, or untreeified, we keep
 * them in the same relative access/traversal order (i.e., field
 * Node.next) to better preserve locality, and to slightly
 * simplify handling of splits and traversals that invoke
 * iterator.remove. When using comparators on insertion, to keep a
 * total ordering (or as close as is required here) across
 * rebalancings, we compare classes and identityHashCodes as
 * tie-breakers.
 *
 * The use and transitions among plain vs tree modes is
 * complicated by the existence of subclass LinkedHashMap. See
 * below for hook methods defined to be invoked upon insertion,
 * removal and access that allow LinkedHashMap internals to
 * otherwise remain independent of these mechanics. (This also
 * requires that a map instance be passed to some utility methods
 * that may create new nodes.)
 *
 * The concurrent-programming-like SSA-based coding style helps
 * avoid aliasing errors amid all of the twisty pointer operations.
 */

从这段优化说明上我们得到了其中的答案,即树化。当一个槽存储的节点数量足够大时,就会将链表转化为一棵树(这棵树的结构与TreeMap的结构类似,即红黑树),这时候查找的时间复杂度就变成了O(lgn)了。

既然时间复杂度可以降低到O(lgn),那为什么不一开始就使用上了呢?

这是因为TreeNode的大小是常规节点的两倍,并且大部分的槽都不会有过多的节点,所以采取了延迟树化来处理。

当单个槽的节点数量递增到一个阈值就会进行树化,而当节点缩减到一个阈值就又会转变为链表。对于这两个阈值,HashMap分别取自86

/**
 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon
 * shrinkage.
 */
static final int TREEIFY_THRESHOLD = 8;

/**
 * The bin count threshold for untreeifying a (split) bin during a
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at
 * most 6 to mesh with shrinkage detection under removal.
 */
static final int UNTREEIFY_THRESHOLD = 6;

对于这两个取值的缘由,上文提到的JDK8优化介绍中有所说明。这里笔者把整个确定阈值的计算过程单独贴出来。

HashMap中每个槽的节点是符合Poisson distribution泊松分布的。

Poisson distribution: In probability theory and statistics, the Poisson distribution (/ˈpwɑːsɒn/; French pronunciation: ​[pwasɔ̃]), named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.[1] The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

在装载因子为0.75的情况下,计算出平均插入每个槽的概率是0.5,从而把参数0.5代入进泊松分布的公式中计算得出一个槽中有x个元素的概率:

 0:    0.60653066
 1:    0.30326533
 2:    0.07581633
 3:    0.01263606
 4:    0.00157952
 5:    0.00015795
 6:    0.00001316
 7:    0.00000094
 8:    0.00000006
 more: less than 1 in ten million

数据显示一个槽被插入8次的概率已经来到了千万分之一了,几乎是不可能事件了,所以转化为红黑树的阈值定位8。而至于为什么转化为红黑树的阈值8和转化为链表的阈值6不一样,这是为了避免频繁来回转化。

此处求解:在负载因子为0.75的情况下泊松分布的平均发生概率0.5是怎么算出来的?

而除了槽的数量到达需要扩容的阈值外,它还会再进行一次槽大小的判断,如果小于阈值MIN_TREEIFY_CAPACITY的话它会进行扩容而不是树化,具体可阅读以下代码:

/**
 * The smallest table capacity for which bins may be treeified.
 * (Otherwise the table is resized if too many nodes in a bin.)
 * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
 * between resizing and treeification thresholds.
 */
static final int MIN_TREEIFY_CAPACITY = 64;   

/**
 * Replaces all linked nodes in bin at index for given hash unless
 * table is too small, in which case resizes instead.
 */
final void treeifyBin(Node<K,V>[] tab, int hash) {
    int n, index; Node<K,V> e;
    // 判断是扩容还是树化
    if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();
    else if ((e = tab[index = (n - 1) & hash]) != null) {
        TreeNode<K,V> hd = null, tl = null;
        // 把所有链表节点转换为树节点
        do {
            TreeNode<K,V> p = replacementTreeNode(e, null);
            if (tl == null)
                hd = p;
            else {
                p.prev = tl;
                tl.next = p;
            }
            tl = p;
        } while ((e = e.next) != null);
        // 进行红黑树转换
        if ((tab[index] = hd) != null)
            hd.treeify(tab);
    }
}

单个槽经过层层的判断才能确定是否进行树化,这也意味着大多数情况下并不会发生树化的。

执行操作

在分析完一些背景后,这里就开始进入HashMap的查找、添加、删除了。而为了方便理解,此处笔者在保持逻辑不变的情况下,对一些代码进行了改造,并隐藏了大部分细节。

// 获取数据
final Node<K,V> getNode(int hash, Object key) {
    Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
    if ((tab = table) != null 
            && (n = tab.length) > 0 
            && (first = tab[(n - 1) & hash]) != null) {
        if (first.hash == hash) {
            if (first instanceof TreeNode){
                // 树操作,二分搜索树
                return ((TreeNode<K,V>)first).getTreeNode(hash, key);
            }else{
                // 链表操作,链表遍历
                return ((LinkedNode<K,V>)first).getLinkedNode(hash, key);
            }
        }
    }
    return null;
}
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    // 上述提及的惰性创建与初始化,在此处进行创建与初始化
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    // 进行数据的插入
    if ((p = tab[i = (n - 1) & hash]) == null){
        // 如果为空直接插入
        tab[i] = newNode(hash, key, value, null);
    } else {
        Node<K,V> e; K k;
        // 获取已存在或者创建的节点e
        if (p instanceof TreeNode)
            // 树操作
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        else {
            // 链表操作,期间判断槽数量是否超过阈值而进行树化
            e = ((LinkedNode<K,V>)p).putLinkedVal(this, tab, hash, key, value);
        }
        // 对节点e进行赋值
        if (e != null) { // existing mapping for key
            V oldValue = e.value;
            if (!onlyIfAbsent || oldValue == null)
                e.value = value;
            // 访问扩展点
            afterNodeAccess(e);
            return oldValue;
        }
    }
    ++modCount;
    // 判断是否需要扩容
    if (++size > threshold)
        resize();
    // 插入扩展点
    afterNodeInsertion(evict);
    return null;
}
final Node<K,V> removeNode(int hash, Object key, Object value,
                            boolean matchValue, boolean movable) {
    Node<K,V>[] tab; Node<K,V> p; int n, index;
    if ((tab = table) != null && (n = tab.length) > 0 &&
        (p = tab[index = (n - 1) & hash]) != null) {
        Node<K,V> node = null, e; K k; V v;
        if (p.hash == hash) {
            if (p instanceof TreeNode)
                // 树操作
                node = ((TreeNode<K,V>)p).getTreeNode(hash, key);
            else {
                // 链表操作
                node = ((LinkedNode<K,V>)p).getLinkedNode(hash, key);
            }
        }
        if (node != null && (!matchValue || (v = node.value) == value ||
                                (value != null && value.equals(v)))) {
            if (node instanceof TreeNode)
                // 树操作,期间判断槽数量是否低过阈值而转换会链表
                ((TreeNode<K,V>)node).removeTreeNode(this, tab, movable);
            else
                // 链表操作
                ((LinkedNode<K,V>)node).removeLinkedNode(this, tab, movable);
            ++modCount;
            --size;
            // 删除扩展点
            afterNodeRemoval(node);
            return node;
        }
    }
    return null;
}

对于HashMap的查找、添加、删除简单来说就是通过Hash函数定位对应的槽,然后再找出槽中对应的元素,再进去查找、插入和删除。

这里我觉得有个不太好的点,就是HashMap把关于树的相关操作放到了TreeNode里面,而把链表相关的操作放在外层,有点代码风格不太统一,感觉可以把链表的相关操作也放在Node或者LinkedNode里面,不过这只是我个人的看法。另外我发现JDK的源码很喜欢把赋值语句和条件语句放在一起的,这样阅读起来会更加耗费读者的精力。

扩容操作

最后在这里我们重点来分析一下resize()扩容。resize()这个方法可以说是HashMap最重要的几个方法之一,这个估计也是面试中最喜欢面试的点。

/**
 * Initializes or doubles table size.  If null, allocates in
 * accord with initial capacity target held in field threshold.
 * Otherwise, because we are using power-of-two expansion, the
 * elements from each bin must either stay at same index, or move
 * with a power of two offset in the new table.
 *
 * @return the table
 */
final Node<K,V>[] resize() {
    Node<K,V>[] oldTab = table;
    // 计算旧槽大小
    int oldCap = (oldTab == null) ? 0 : oldTab.length;
    // 计算旧扩容阈值
    int oldThr = threshold;
    // 新槽大小
    int newCap;
    // 新槽扩容阈值
    int newThr = 0;
    
    // 计算新槽大小
    if (oldCap > 0) {
        if (oldCap >= MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return oldTab;
        }
        else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                    oldCap >= DEFAULT_INITIAL_CAPACITY)
            newThr = oldThr << 1; // double threshold
    }
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {               // zero initial threshold signifies using defaults
        newCap = DEFAULT_INITIAL_CAPACITY;
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
    }

    // 计算新的扩容阈值
    if (newThr == 0) {
        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                    (int)ft : Integer.MAX_VALUE);
    }

    threshold = newThr;
    // 新建扩容后的槽
    Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
    // 替换旧槽为新建扩容后的槽
    table = newTab;
    
    // 如果旧槽不为空则进行迁移
    if (oldTab != null) {
        // 逐个逐个槽的迁移
        for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if ((e = oldTab[j]) != null) {
                oldTab[j] = null;
                // 槽中只有单个元素直接迁移
                if (e.next == null)
                    newTab[e.hash & (newCap - 1)] = e;
                // 对树结构进行迁移
                else if (e instanceof TreeNode)
                    ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                // 对链表结构进行迁移
                else { // preserve order
                    Node<K,V> loHead = null, loTail = null;
                    Node<K,V> hiHead = null, hiTail = null;
                    Node<K,V> next;
                    do {
                        next = e.next;
                        if ((e.hash & oldCap) == 0) {
                            if (loTail == null)
                                loHead = e;
                            else
                                loTail.next = e;
                            loTail = e;
                        }
                        else {
                            if (hiTail == null)
                                hiHead = e;
                            else
                                hiTail.next = e;
                            hiTail = e;
                        }
                    } while ((e = next) != null);
                    if (loTail != null) {
                        loTail.next = null;
                        newTab[j] = loHead;
                    }
                    if (hiTail != null) {
                        hiTail.next = null;
                        newTab[j + oldCap] = hiHead;
                    }
                }
            }
        }
    }
    return newTab;
}

/**
 * Entry for Tree bins. Extends LinkedHashMap.Entry (which in turn
 * extends Node) so can be used as extension of either regular or
 * linked node.
 */
static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> {

    /**
     * Splits nodes in a tree bin into lower and upper tree bins,
     * or untreeifies if now too small. Called only from resize;
     * see above discussion about split bits and indices.
     *
     * @param map the map
     * @param tab the table for recording bin heads
     * @param index the index of the table being split
     * @param bit the bit of hash to split on
     */
    final void split(HashMap<K,V> map, Node<K,V>[] tab, int index, int bit) {
        TreeNode<K,V> b = this;
        // Relink into lo and hi lists, preserving order
        TreeNode<K,V> loHead = null, loTail = null;
        TreeNode<K,V> hiHead = null, hiTail = null;
        int lc = 0, hc = 0;
        // 按照链表结构进行迁移
        for (TreeNode<K,V> e = b, next; e != null; e = next) {
            next = (TreeNode<K,V>)e.next;
            e.next = null;
            if ((e.hash & bit) == 0) {
                if ((e.prev = loTail) == null)
                    loHead = e;
                else
                    loTail.next = e;
                loTail = e;
                ++lc;
            }
            else {
                if ((e.prev = hiTail) == null)
                    hiHead = e;
                else
                    hiTail.next = e;
                hiTail = e;
                ++hc;
            }
        }

        // 进行转换回链表或树
        if (loHead != null) {
            // 判断是否需要转换回链表
            if (lc <= UNTREEIFY_THRESHOLD)
                tab[index] = loHead.untreeify(map);
            else {
                tab[index] = loHead;
                if (hiHead != null) // (else is already treeified)
                    loHead.treeify(tab);
            }
        }
        // 进行转换回链表或树
        if (hiHead != null) {
            // 判断是否需要转换回链表
            if (hc <= UNTREEIFY_THRESHOLD)
                tab[index + bit] = hiHead.untreeify(map);
            else {
                tab[index + bit] = hiHead;
                if (loHead != null)
                    hiHead.treeify(tab);
            }
        }
    }
}

阅读源码你会发现resize()已经包含了你在使用过程中的所有扩容场景了。主要分3

  1. 计算扩容后槽的大小capacity和阈值threshold
  2. 根据计算结果创建新的存储数组
  3. 将旧存储数组的数据迁移到新存储数组

这里有个很细节的地方,通过hash值和旧槽长度oldCap进行与(&)操作来判断rehash之后是否需要更换槽索引,也就是说如果hash&oldCap==0则表示槽位保持不变,反之切换到新槽。这样做可行的原因是原本通过二次幂掩码作与(&)操作是会截断高位的,而在扩容后二次幂掩码的位数新增了一位(即添加了原length所在的位上),因此直接对原length所在的位进行与(&)操作判断该位是否是1则可以判断出扩容后槽位是否有所改变。详情可阅读下面例子:

    扩容前:
            length= 0001 0000   ==> 二次幂掩码   0000 1111
            
            hash1 = 0000 1000   ==>    hash1&(length-1)==index1 ==>  0000 1000    ===> 槽1    
            hash2 = 0001 1000   ==>    hash2&(length-1)==index1 ==>  0000 1000    ===> 槽1    

    扩容后:
            length= 0010 0000   ==> 二次幂掩码   0001 1111

            hash1 = 0000 1000   ==>    hash1&(length-1)==index1 ==>  0000 1000    ===> 槽1    
                                       hash1&old_length ==> 0000 0000 == 0

            hash2 = 0001 1000   ==>    hash2&(length-1)==index2 ==>  0001 1000    ===> 槽2    
                                       hash2&old_length ==> 0000 1000 != 0

    代码中也有个巧妙的运算关于新槽位置的计算的: 新槽位置 = 旧槽位置 + 旧槽长度 (这里进行与操作也可以)
    这里通过位运算来笔画一下:

            index1    0000 1000
      + old_length    0001 0000 
            index2    0001 1000

而对于resize()的另一个常见的问题则是在JDK1.7版本中并发情况下会发生死锁的情况。这个给出了引起死锁的关键代码

// 将旧table迁移到新table中
void transfer(Entry[] newTable){
    Entry[] src = table;
    int newCapacity = newTable.length;
    for (int j = 0; j < src.length; j++) {
        Entry<K,V> e = src[j];
        if (e != null) {
            src[j] = null;
            do {
                Entry<K,V> next = e.next;    // 1
                int i = indexFor(e.hash, newCapacity);
                e.next = newTable[i];    // 2
                newTable[i] = e;    
                e = next;
            } while (e != null);
        }
    }
}

这段代码引起死锁的地方我用注释标记了起来,分别是12。下面我详细讲解一下:

在单线程环境下,这里是通过头插法进行迁移的,如下图所示。

单线程:
                                                              +
                                                              |
        +----+                                                |       +----+
        |    |                                                |       |    |
        |    |                                                |       |    |
        +----+     +-------+     +--------+     +--------+    |       +----+      +--------+     +--------+
        |    +---->+   e   +---->+  next  +---->+  last  |    |       |    +----->+  next  +---->+  last  |
        |    |     +-------+     +--------+     +--------+    |       |    |      +--------+     +--------+
        +----+                                                |       +----+
        |    |                                                |       |    |
        |    |                                                |       |    |
        +----+                                                |       +----+
        |    |                                              X |       |    |
        |    |                                              XXX       |    |     +-------+
        +----+                                     XXXXXXXXXXXXXX     +----+     |   e   |
                                                   XXXXXXXXXXXXXXX               +-+---+-+
                                                   XXXXXXXXXXXXXX                  ^   |
        +----+                                              XXX       +----+       |   |
        |    |                                              X |       |    |       |   |
        |    |                                                |       |    |       |   |
        +----+     +---------+                                |       +----+       |   |    +---------+
        |    +---->+  first  |                                |       |    +-------+   +--->+  first  |
        |    |     +---------+                                |       |    |                +---------+
        +----+                                                |       +----+
        |    |                                                |       |    |
        |    |                                                |       |    |
        +----+                                                |       +----+
        |    |                                                |       |    |
        |    |                                                |       |    |
        +----+                                                |       +----+
                                                              |
                                                              |
                                                              +

而在多线程环境下,当线程一执行到代码1处发生了阻塞暂停了下来,这时候线程二接着执行并将e节点和next节点迁移到新槽中,而且此时指向关系有newTable[i]->e->next

多线程/线程二:

        +----+
        |    |
        |    |
        +----+      +--------+
        |    +----->+  last  |
        |    |      +--------+
        +----+
        |    |
        |    |
        +----+
        |    |
        |    |
        +----+


        +----+
        |    |
        |    |
        +----+      +--------+     +-------+     +---------+
        |    +----->+  next  +---->+   e   +---->+  first  |
        |    |      +--------+     +-------+     +---------+
        +----+
        |    |
        |    |
        +----+
        |    |
        |    |
        +----+

此时线程一被唤醒,继续从代码1处开始执行并在执行到代码2处时,发生了以下关系:

  1. e.next指向了newTable[i],即e->newTable[i]
  2. 在线程二中已经将newTable[i]执行了next,即newTable[i]->next
  3. 所以合并12,则有e->nextnext->e,发生了循环指向
多线程/线程一:

        +----+
        |    |
        |    |
        +----+      +--------+
        |    +----->+  last  |
        |    |      +--------+
        +----+
        |    |
        |    |
        +----+
        |    |
        |    |
        +----+        
        

        +----+           +--------------------+
        |    |           |                    |
        |    |           v                    |
        +----+      +----+---+     +-------+  |  +---------+
        |    +----->+  next  +---->+   e   +--+  |  first  |
        |    |      +--------+     +-------+     +---------+
        +----+
        |    |
        |    |
        +----+
        |    |
        |    |
        +----+

因为线程一在线程二之前就已经将e.next(即next节点)赋值给了方法变量(即next变量),所以在线程一的下一轮循环中仍然是回到next节点继续执行并且一直死循环下去。

对于此问题的产生是因为在JDK1.7中通过头插法的方式去迁移rehash的元素而导致的,在JDK1.8中已经将头插法改为尾插法来解决此问题了。不过将非线程安全的HashMap应用到多线程环境下并非正确姿势,即使修复了这里问题也会存在其他各种问题(事实证明也是如此)。

问答

来到这里基本就告一段落了,最后再来一个HashMap的灵魂问答。

  1. 为什么HashMap的大小是2的次幂
  2. 为什么HashMap这里是用红黑树,而不是跳表
  3. 为什么HashMap中的键值可以为null

扩展阅读

0

评论区