"JVM Anatomy Quarks" is the on-going mini-post series, where every post is describing some elementary piece of knowledge about JVM. The name underlines the fact that the single post cannot be taken in isolation, and most pieces described here are going to readily interact with each other.
The post should take about 5-10 minutes to read. As such, it goes deep for only a single topic, a single test, a single benchmark, a single observation. The evidence and discussion here might be anecdotal, not actually reviewed for errors, consistency, writing 'tyle, syntaxtic and semantically errors, duplicates, or also consistency. Use and/or trust this at your own risk.
JVM uses memory for different reasons, to store its internal VM state in native memory, as well as providing the storage for Java objects ("Java heap"). We have seen the native memory part of the story in "Native Memory Tracking", but the major contender in many applications is the Java heap itself.
Java heap is normally managed by automatic memory manager, sometimes called garbage collector.  Naive GCs would allocate the large block of memory from the underlying OS memory manager, and slice it themselves for accepting allocations. This immediately means that even if there are only a few Java objects in the heap, from the perspective of OS the JVM process had acquired all the possible memory for the Java heap.
So, if we want to have unused parts of Java heap returned back to OS, we need cooperation from the GC.
There are two ways to achieve this cooperation: do more frequent GCs instead of "expanding" the Java heap to
-Xmx; or explicit uncommit of unused parts of Java heap, even after Java heap is inflated to
-Xmx. First way helps only so much, and usually in earlier phases of application lifetime — eventually, applications would like to allocate a lot. In this piece, we would concentrate on the second part, what to do when heap is already inflated.
What do modern GCs do on this front?
Footprint measurement is tricky, because we have to define what footprint actually is. Since we are talking about the footprint from the perspective of OS, it makes most sense to measure the RSS of the entire JVM process, which would include both native VM memory and Java heap.
The other significant question is when to measure the footprint. It stands to reason that the amount of application data in different phases of application lifecycle is different. That is especially true when application deliberately optimizes for footprint, with lazy/delayed operations that only happen when the actual work comes along. The easiest mistake to make while capacity planning for footprint is to start such application, snapshot its footprint, and then blow all estimates when the actual work comes in.
Automatic memory managers usually react on what is happening to application: they trigger GCs based on allocation pressure, free space availability, idleness, etc. Measuring footprint only in active phase is probably not very telling either. This is further exarcebated by observation that most applications in the world (outside the high load servers) are idle most of the time, or run on low duty cycle.
All this means we need to have the application going through different lifecycle phases to see the faces of the memory footprint story. Let us take simple spring-boot-petclinic project and run it with different GCs. These are the configurations we use:
Serial GC: the go-to GC for small-heap applications. It has low native overhead, a bit more aggressive GC policies, etc;
G1 GC: the workhorse of OpenJDK, default since JDK 9;
Shenandoah GC: the concurrent GC from Red Hat. We include it here to show some behaviors footprint-savvy GC would have.  For the purpose of this experiment, Shenandoah runs in two modes: default mode, and compact mode that tunes collector for lowest footprint. 
The experiment is driven by this simple script. We use OpenJDK 11, as decently recent JDK, but the same can be demonstrated with OpenJDK 8, as GC behaviors are not significantly different between 8 and 11 in this test.
Let us digest the RSS charts. What can we see here?
During startup, all GCs try to cope with small initial heap, and many do frequent GCs. This keeps them from inflating the heap too much. After initial active phase is done, workloads stabilize on some particular footprint level. In absence of any GC triggers, this level would be largerly defined by heuristics used for triggering the GC during startup, even if the amount of data stored in heap is the same. This gets especially quirky when heuristics has to guess what user wanted from the acupuncture of 100+ GC options.
Same RSS chart as above, repeated for convenience:
When load comes, GC heuristics again have to decide a few things. Depending on GC, its implementation and configuration, it has to decide whether to expand the heap, or do more aggressive GC cycles.
Here, Serial GC decided to perform more cycles. G1 inflated to around 3/4 of the max heap, and started doing moderately frequent cycles to cope with allocation pressure. Shenandoah in default mode, being a concurrent GC running in dense heap, opted to inflate the heap as much as possible to maintain application concurrency without too frequent cycles. Shenandoah in compact mode, being instructed to maintain low footprint, opted to make much more aggressive cycles.
This is corroborated by the actual GC frequency logs:
More frequent GC cycles also mean more CPU needed to deal with GC work:
While most of the lines are noisy here, we can clearly see "Shenandoah (compact)" taking quite a some additional time to work. That is the price we have to pay to have the denser footprint. Or, in other words, this is the manifestation of throughput-latency-footprint tradeoff. There are, of course, tunable settings to say how much we want to trade, and this experiment only shows the difference between two rather polar defaults: prefer throughput and prefer footprint. Since Shenandoah is concurrent GC, even performing effectively back-to-back GCs does not stall application all that much.
Same RSS chart as above, repeated for convenience:
When application comes idle, GCs may decide to return some resources. The obvious thing to do would be uncommitting parts of the empty heap. This is rather simple to do if heap is already sliced in independent chunks, for example when you have a regionalized collector like G1 or Shenandoah. Still, the GC has to decide if/when to do it.
Many OpenJDK GCs perform GC-related actions only in conjunction with the actual GC cycles. But an interesting thing happens. Most OpenJDK GCs are allocation-triggered, which means they start the cycle when a particular heap occupancy had been reached. If application went into idle state abruptly, it means it also stopped allocating, so whatever occupancy level it is at right now, would linger until something happens. It makes some sense for stop-the-world GCs, because we do not really want to start long-ish GC pause just because we feel like it.
There is no particular need to hook up uncommit to the GC cycle to begin with. In the case of Shenandoah, there is an asynchronous periodic uncommit, and we can see it in action as the first large drop in idle phase. For this experiment, the uncommit delay was deliberately set at 5 seconds, and we can see it indeed happened after a few seconds in idle. This performed uncommit on regions that were emptied the last GC cycle, and have not been allocated yet.
But, there is another significant part of the story: since application went to idle abruptly, there is some floating garbage that we would like to collect. This provides the motivation for having a periodic GC that should knock out the lingering garbage out. Periodic GC is responsible for the second big step down in idle phase. It frees up new regions for periodic uncommit to deal with later.
If GC cycles were frequent enough already (see "Shenandoah (compact)"), the effect of all this is largerly irrelevant, as footprint is already quite low, and nothing excessive had been committed on top.
Same RSS chart as above, repeated for convenience:
Again, doing periodic GCs with concurrent GC implementation is less intrusive to do: if load is back up when we are mid-GC-cycle, nothing bad is going to happen. That is in contrast to STW GC, that would have to guess if performing a major GC cycle is a good idea or not. In worst case, we would have to explicitly tell JVM to perform it, and at least G1 reacts to this request reliably. Note how the footprint for most collectors is down to the same level after Full GC, and how periodic GC and uncommit got there much earlier without user intervention.
Periodic GCs. Perfoming periodic GC cycles help to knock out lingering garbage. Concurrent GCs routinely perform periodic GC cycles: Shenandoah and ZGC are known to do it. G1 is supposed to gain this feature in JDK 12 with JEP 346. Otherwise, one can employ the external or internal agent to call for GC periodically when time is right, with the hard part of defining what is the right time. See, for example, Jelastic GC Agent.
Heap uncommit. Many GCs already do heap uncommits when they think it is a good idea: Shenandoah does it asynchronously even without the GC requests, G1 sure does it on explicit GC requests, pretty sure Serial and Parallel also do it in some conditions. ZGC is going to do it soon as well, let’s hope JDK 12. G1 is supposed to deal with synchronicity by performing periodic GC cycles with JEP 346 in JDK 12. Of course, there is a trade-off: committing memory back may take a while, so practical implementations would impose some timeouts before uncommits.
Footprint-targeted GCs. Many GCs provide flexible options to make GC cycles more frequent to optimize for footprint. Even something like increasing the frequency of periodic GCs would help to knock the garbage out earlier. Some GCs may give you the pre-canned configuration packages that instruct the implementation to make footprint-savvy choices, including configuring more frequent/periodic GC cycles and uncommits, like Shenandoah’s "compact" mode.
Every time you see switching to some GC implementation made the footprint happy, do understand why and how it did so. This would help you to clearly understand what you paid for it, and also whether you can achieve the same without any migration.