JVM Anatomy Quark #10: String.intern()

About, Disclaimers, Contacts

"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.

The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.

Aleksey Shipilëv, JVM/Performance Geek
Shout out at Twitter: @shipilev; Questions, comments, suggestions: aleksey@shipilev.net

Question

How exactly String.intern() works? Should I avoid it?

Theory

If you have ever studied String Javadocs, you would know there is an interesting method in the public API:

public String intern()
Returns a canonical representation for the string object. A pool of strings, initially empty, is maintained privately by the class String.

When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.

— JDK Javadoc
java.lang.String

This reads as if String provides the user-accessible entry to String pool, and we can use it to optimize for memory, right? However, that comes with a drawback: in OpenJDK, String.intern() is native, and it actually calls into JVM, to intern the String in the native JVM String pool. This is due the to the fact that String interning is a part of JDK-VM interface when both VM native and JDK code have to agree on identity of particular String objects.

There are implications for having the implementation like that:

You need to cross the JDK-JVM interface on every intern(), which wastes cycles.
The performance is at the mercy of the native HashTable implementation, which may lag behind what is available in high-performance Java world, especially under concurrent access.
Since Java Strings are references from the native VM structures, they become the part of GC rootset. In many cases, that requires additional work during the GC pauses to process.

Does this matter?

Experiment: Throughput

Once again, we can construct the simple experiment. Both deduplication and interning and trivally implementable with HashMap and ConcurrentHashMap, which gives us a very nice JMH benchmark:

@State(Scope.Benchmark)
public class StringIntern {

    @Param({"1", "100", "10000", "1000000"})
    private int size;

    private StringInterner str;
    private CHMInterner chm;
    private HMInterner hm;

    @Setup
    public void setup() {
        str = new StringInterner();
        chm = new CHMInterner();
        hm = new HMInterner();
    }

    public static class StringInterner {
        public String intern(String s) {
            return s.intern();
        }
    }

    @Benchmark
    public void intern(Blackhole bh) {
        for (int c = 0; c < size; c++) {
            bh.consume(str.intern("String" + c));
        }
    }

    public static class CHMInterner {
        private final Map<String, String> map;

        public CHMInterner() {
            map = new ConcurrentHashMap<>();
        }

        public String intern(String s) {
            String exist = map.putIfAbsent(s, s);
            return (exist == null) ? s : exist;
        }
    }

    @Benchmark
    public void chm(Blackhole bh) {
        for (int c = 0; c < size; c++) {
            bh.consume(chm.intern("String" + c));
        }
    }

    public static class HMInterner {
        private final Map<String, String> map;

        public HMInterner() {
            map = new HashMap<>();
        }

        public String intern(String s) {
            String exist = map.putIfAbsent(s, s);
            return (exist == null) ? s : exist;
        }
    }

    @Benchmark
    public void hm(Blackhole bh) {
        for (int c = 0; c < size; c++) {
            bh.consume(hm.intern("String" + c));
        }
    }
}

The test tries to intern lots of Strings, but the actual interning happens only for the first walk through the loop, and then we only checking the String after the existing mappings. size parameter controls the number of Strings we intern, thus limiting the String table size we are dealing with. This is the usual case with interners like that.

Running this with JDK 8u131:

Benchmark             (size)  Mode  Cnt       Score       Error  Units

StringIntern.chm           1  avgt   25       0.038 ±     0.001  us/op
StringIntern.chm         100  avgt   25       4.030 ±     0.013  us/op
StringIntern.chm       10000  avgt   25     516.483 ±     3.638  us/op
StringIntern.chm     1000000  avgt   25   93588.623 ±  4838.265  us/op

StringIntern.hm            1  avgt   25       0.028 ±     0.001  us/op
StringIntern.hm          100  avgt   25       2.982 ±     0.073  us/op
StringIntern.hm        10000  avgt   25     422.782 ±     1.960  us/op
StringIntern.hm      1000000  avgt   25   81194.779 ±  4905.934  us/op

StringIntern.intern        1  avgt   25       0.089 ±     0.001  us/op
StringIntern.intern      100  avgt   25       9.324 ±     0.096  us/op
StringIntern.intern    10000  avgt   25    1196.700 ±   141.915  us/op
StringIntern.intern  1000000  avgt   25  650243.474 ± 36680.057  us/op

Oops, what gives? String.intern() is significantly slower! The answer lies somewhere in the native implementation ("native" does not equal "better", folks), which is clearly visible in with perf record -g:

-    6.63%     0.00%  java     [unknown]           [k] 0x00000006f8000041
   - 0x6f8000041
      - 6.41% 0x7faedd1ee354
         - 6.41% 0x7faedd170426
            - JVM_InternString
               - 5.82% StringTable::intern
                  - 4.85% StringTable::intern
                       0.39% java_lang_String::equals
                       0.19% Monitor::lock
                     + 0.00% StringTable::basic_add
                  - 0.97% java_lang_String::as_unicode_string
                       resource_allocate_bytes
                 0.19% JNIHandleBlock::allocate_handle
                 0.19% JNIHandles::make_local

While the JNI transition costs quite a bit on itself, we seem to spend quite some time in StringTable implementation. Poking around it, you will eventually discover -XX:+PrintStringTableStatistics, which will print something like:

StringTable statistics:
Number of buckets       :     60013 =    480104 bytes, avg   8.000
Number of entries       :   1002714 =  24065136 bytes, avg  24.000
Number of literals      :   1002714 =  64192616 bytes, avg  64.019
Total footprint         :           =  88737856 bytes
Average bucket size     :    16.708  ; <---- !!!!!!

16 elements per bucket in a chained hash table speaks "overload, overload, overload". What is worse, that string table is not resizeable — although there was experimental work to make them resizable, that was shot down for "reasons" . It might be alleviated with setting larger -XX:StringTableSize, for example to 10M:

Benchmark             (size)  Mode  Cnt       Score       Error  Units

# Default, copied from above
StringIntern.chm           1  avgt   25       0.038 ±     0.001  us/op
StringIntern.chm         100  avgt   25       4.030 ±     0.013  us/op
StringIntern.chm       10000  avgt   25     516.483 ±     3.638  us/op
StringIntern.chm     1000000  avgt   25   93588.623 ±  4838.265  us/op

# Default, copied from above
StringIntern.intern        1  avgt   25       0.089 ±     0.001  us/op
StringIntern.intern      100  avgt   25       9.324 ±     0.096  us/op
StringIntern.intern    10000  avgt   25    1196.700 ±   141.915  us/op
StringIntern.intern  1000000  avgt   25  650243.474 ± 36680.057  us/op

# StringTableSize = 10M
StringIntern.intern        1  avgt    5       0.097 ±     0.041  us/op
StringIntern.intern      100  avgt    5      10.174 ±     5.026  us/op
StringIntern.intern    10000  avgt    5    1152.387 ±   558.044  us/op
StringIntern.intern  1000000  avgt    5  130862.190 ± 61200.783  us/op

…but this is only a palliative measure, because you have to plan this in advance. You will waste memory if you blindly set String table size to large value, and do not use it. Even with large StringTable that you fully use, the native call costs are still eating away cycles.

Experiment: GC pauses

But what would trigger the most dramatic consequence of native String table is that it is the part of GC roots! Which means, it should be scanned/updated by the garbage collector specially. In OpenJDK, that means doing hard work during the pause. Indeed, for Shenandoah where pauses depend mostly on GC root set size, having just 1M records in String table yields this:

$ ... StringIntern -p size=1000000 --jvmArgs "-XX:+UseShenandoahGC -Xlog:gc+stats -Xmx1g -Xms1g"
...
Initial Mark Pauses (G)    = 0.03 s (a = 15667 us) (n = 2) (lvls, us = 15039, 15039, 15039, 15039, 16260)
Initial Mark Pauses (N)    = 0.03 s (a = 15516 us) (n = 2) (lvls, us = 14844, 14844, 14844, 14844, 16088)
  Scan Roots               = 0.03 s (a = 15448 us) (n = 2) (lvls, us = 14844, 14844, 14844, 14844, 16018)
    S: Thread Roots        = 0.00 s (a =    64 us) (n = 2) (lvls, us =    41,    41,    41,    41,    87)
    S: String Table Roots  = 0.03 s (a = 13210 us) (n = 2) (lvls, us = 12695, 12695, 12695, 12695, 13544)
    S: Universe Roots      = 0.00 s (a =     2 us) (n = 2) (lvls, us =     2,     2,     2,     2,     2)
    S: JNI Roots           = 0.00 s (a =     3 us) (n = 2) (lvls, us =     2,     2,     2,     2,     4)
    S: JNI Weak Roots      = 0.00 s (a =    35 us) (n = 2) (lvls, us =    29,    29,    29,    29,    42)
    S: Synchronizer Roots  = 0.00 s (a =     0 us) (n = 2) (lvls, us =     0,     0,     0,     0,     0)
    S: Flat Profiler Roots = 0.00 s (a =     0 us) (n = 2) (lvls, us =     0,     0,     0,     0,     0)
    S: Management Roots    = 0.00 s (a =     1 us) (n = 2) (lvls, us =     1,     1,     1,     1,     1)
    S: System Dict Roots   = 0.00 s (a =     9 us) (n = 2) (lvls, us =     8,     8,     8,     8,    11)
    S: CLDG Roots          = 0.00 s (a =    75 us) (n = 2) (lvls, us =    68,    68,    68,    68,    81)
    S: JVMTI Roots         = 0.00 s (a =     0 us) (n = 2) (lvls, us =     0,     0,     0,     0,     1)

So, you have +13 ms per pause just because we decided to put more stuff in the root set.

This prompts some GC implementations to only do the String table cleanups when something heavy is also done. For example, it makes little sense from JVM perspective to clean String table if classes were not unloaded — because loaded classes are the major sources of interned Strings. So, this workload would exhibit interesting behaviors at least in G1 and CMS:

public class InternMuch {
  public static void main(String... args) {
    for (int c = 0; c < 1_000_000_000; c++) {
      String s = "" + c + "root";
      s.intern();
    }
  }
}

Running with CMS:

$ java -XX:+UseConcMarkSweepGC -Xmx2g -Xms2g -verbose:gc -XX:StringTableSize=6661443 InternMuch

GC(7) Pause Young (Allocation Failure) 349M->349M(989M) 357.485ms
GC(8) Pause Initial Mark 354M->354M(989M) 3.605ms
GC(8) Concurrent Mark
GC(8) Concurrent Mark 1.711ms
GC(8) Concurrent Preclean
GC(8) Concurrent Preclean 0.523ms
GC(8) Concurrent Abortable Preclean
GC(8) Concurrent Abortable Preclean 935.176ms
GC(8) Pause Remark 512M->512M(989M) 512.290ms
GC(8) Concurrent Sweep
GC(8) Concurrent Sweep 310.167ms
GC(8) Concurrent Reset
GC(8) Concurrent Reset 0.404ms
GC(9) Pause Young (Allocation Failure) 349M->349M(989M) 369.925ms

So far so relatively good. Walking the overloaded String table takes a while. But the most damning thing would be to disable class unloading with -XX:-ClassUnloading. This effectively disables String table cleanup in regular GC cycles! You can guess what happens next:

$ java -XX:+UseConcMarkSweepGC -Xmx2g -Xms2g -verbose:gc -XX:-ClassUnloading -XX:StringTableSize=6661443 InternMuch

GC(11) Pause Young (Allocation Failure) 273M->308M(989M) 338.999ms
GC(12) Pause Initial Mark 314M->314M(989M) 66.586ms
GC(12) Concurrent Mark
GC(12) Concurrent Mark 175.625ms
GC(12) Concurrent Preclean
GC(12) Concurrent Preclean 0.539ms
GC(12) Concurrent Abortable Preclean
GC(12) Concurrent Abortable Preclean 2549.523ms
GC(12) Pause Remark 696M->696M(989M) 133.920ms
GC(12) Concurrent Sweep
GC(12) Concurrent Sweep 175.949ms
GC(12) Concurrent Reset
GC(12) Concurrent Reset 0.463ms
GC(14) Pause Full (Allocation Failure) 859M->0M(989M) 1541.465ms  <---- !!!
GC(13) Pause Young (Allocation Failure) 859M->0M(989M) 1541.515ms

Full STW GC, my old friend. For CMS, there is ExplicitGCInvokesConcurrentAndUnloadsClasses that kinda alleviates that, assuming user will call System.gc() sometimes.

Observations

We are only discussing the ways one can achieve interning/deduplication, under the presumption it is needed for either memory footprint improvements, or low-level == optimization, or some other obscure need. Those needs can be accepted or challenged separately. For more details about Java Strings, I’d plug my own talk, "java.lang.String Catechism".

For OpenJDK, String.intern() is the gateway to native JVM String table, and it comes with caveats: throughput, memory footprint, pause time problems will await the users. It is very easy to underestimate the impact of these caveats. Hand-rolled deduplicators/interners are working much more reliably, because they are working on Java side, are just the regular Java objects, generally better sized/resized, and also can be thrown away completely when not needed anymore. GC-assisted String deduplication does alleviate things even more.

In almost every project we were taking care of, removing String.intern() from the hotpaths, or optionally replacing it with a handrolled deduplicator, was the very profitable performance optimization. Do not use String.intern() without thinking very hard about it, okay?