Please create CachedMemoryMeterStrategy that caches shallow sizes of classes #65

daniel-rusu · 2023-10-20T03:07:09Z

I worked on a JVM memory measurement library that used reflection to analyze deep object graphs and found a very significant performance improvement by simply caching shallow sizes. I don't have references but I believe it was on the order of 10 times faster for some common scenarios.

The implementation would be quite trivial by creating a new CachedMemoryMeterStrategy class which contains:

a HashMap cache from class to shallow size
a regular MemoryMeterStrategy to be used as a fallback when measuring a new type of object that hasn't been encountered before

Scenarios where this is extremely beneficial:

Collections and arrays usually store objects of the same type (or a small number of subtypes)
Collections themselves can wrap every item ending up with many wrappers of the same type (such as Node in HashMap)
Tree structures usually contain a very small number of types (eg. subtree & leaf nodes)

Even if an object graph contains no repeated classes, this allows us to re-use the same MemoryMeter instance to measure a large number of similar objects so that the cache from the first measurement speeds up the measurements of subsequent objects.

daniel-rusu · 2023-10-20T03:18:06Z

The MemoryMeter.Builder could have a flag that controls caching with a default of false. When enabled, it would select the MemoryMeterStrategy as usual and wrap it in the fairly trivial CachedMemoryMeterStrategy.

The new CachedMemoryMeterStrategy only needs to override the measure method and delegate everything else to the backing strategy.

blerer · 2023-10-20T08:23:04Z

As Jamm support multithreading you cannot use a simple HashMap and have to use a ConcurrentHashMap or some other concurrent Map. I did some experience with caching in the past. When running Jamm with Java 17 and the Instrumentation strategy the JMH benchmark were showing that the caching approach was actually slowing things done. For other strategies it was effectively beneficial but you then ended up with the issue that you cannot control how much memory Jamm is then using. Considering that if you are serious about performance and quality of the measures you would go for Java 17 and the Instrumentation strategy. I decided to not implement that functionality. Of course, I could have missed something and I am willing to reconsider that choice if you have number that invalidate my findings. :-)
In the meantime, we designed Jamm such as anybody should be able to use their custom strategy by creating the MemoryMeter using the MemoryMeter(MemoryMeterStrategy, FieldAndClassFilter, FieldFilter , boolean, MemoryMeterListener.Factory) constructor.

daniel-rusu · 2023-10-20T18:19:39Z

Thanks, I tried it out in my project (using Kotlin):

class CachedMemoryMeterStrategy(
    val backingStrategy: MemoryMeterStrategy
) : MemoryMeterStrategy by backingStrategy {
    private val cache = ConcurrentHashMap<Class<Any>, Long>()

    override fun measure(entity: Any): Long {
        return cache.getOrPut(entity.javaClass) {
            backingStrategy.measure(entity)
        }
    }
}

Then I created the MemoryMeter with this function:

private fun createMemoryMeter(strategyType: Guess, cacheShallowSizes: Boolean): MemoryMeter {
    val backingStrategy = MemoryMeterStrategies.getInstance().getStrategy(listOf(strategyType))
    val strategy = when (cacheShallowSizes) {
        true -> CachedMemoryMeterStrategy(backingStrategy)
        else -> backingStrategy
    }
    return MemoryMeter(
        strategy,
        Filters.getClassFilters(/* ignoreKnownSingletons = */ true),
        Filters.getFieldFilters(
            /* ignoreKnownSingletons = */ true,
            /* ignoreOuterClassReference = */ false,
            /* ignoreNonStrongReferences = */ true,
        ),
        NoopMemoryMeterListener.FACTORY,
    )
}

Here are some benchmarks on JDK 17 using Jamm 0.4.0 (ran each test 3 times and reported the duration in seconds):

SPECIFICATION:
- 30.8, 29.9, 31.0
SPECIFICATION with Caching:
- 24.2, 24.5, 23.5
UNSAFE:
- 30.1, 30.4, 28.7
UNSAFE with Caching:
- 24.2, 24.0, 23.8
INSTRUMENTATION:
- 24.1, 23.7, 24.2
INSTRUMENTATION with caching:
- 23.9, 23.6, 23.5

So the instrumentation shows a tiny improvement that's close to the margin of error but the other strategies show a healthy 20% improvement. This suggests that there's some other overhead that's dominating the runtime. I suspect that performance would improve much more if each array element weren't enqueued and instead processed directly. But either way, a 20% improvement is a good start.

daniel-rusu · 2023-10-20T18:56:32Z

Created a separate topic for measuring array elements much more efficiently:
#66

Once that's implemented, I expect the improvements from caching to become more pronounced as the array overhead is probably the main bottleneck when measuring collections.

The memory saved by #66 should be greater than the cache overhead when dealing with medium-sized collections or larger. The rest can be left up to documentation to explain that when enabling caching, it would cache the shallow size of all classes encountered for the entire duration of the MemoryMeter existence so that will keep growing as new types are encountered.

daniel-rusu · 2023-10-21T17:49:13Z

@blerer , I just remembered that the caching that I added before also included the relevant fields per class in addition to the shallow size to avoid having to re-traverse the inheritance hierarchy and to avoid re-filtering the fields each time. Thinking about it some more, I also realized that adding caching at the strategy level is not the best location for that. Setting up a temporary cache each time measureDeep is called makes it easier to cache both shallow size & fields without changing the current architecture and also addresses the concern about the memory usage growing unbounded throughout multiple usages.

Since I didn't have an easy way to try this, as a hack, I temporarily copied the MemoryMeter class and added a local cache inside the measureDeep method and found that the performance was now measurably faster than all current strategies (about 20% faster than instrumentation and 35% faster than unsafe or spec).

The extra memory from the temporary cache can be offset by the memory savings of handling arrays more efficiently (I could submit that separately). The impact of a temporary cache would become even larger once arrays are handled more efficiently.

Since my hack is in Kotlin, I wanted to get your go-ahead before putting in the effort to create a pull request in Java. Does this sound good to you?

blerer · 2023-10-24T08:46:47Z

@daniel-rusu Your proposal is consisting of 3 parts:

a new way to measure array elements
a measurement cache
a field cache

Regarding 1, I answered in the #66 ticket.
Regarding 2 and 3, we did experiment with them and choose to stay away from them.
For 2 the main reason is that it is not needed on 17 (in practice our JMH benchmarks showed worst performance) unless you use the Unsafe or Specification strategy. Do you have a use case where you are forced to use those strategies and get better performance?
For 3 it is pretty hard to evaluate the amount of data that will be loaded in memory.
I am fine revisiting those choices in case we have users that have real life performance problem with the library that need to be solved.
Our goal was to create a library easy to use without nasty side effects that could hit you by surprise and to document it as well as possible so that the special cases were clear from the start.
Running Jamm with Java 17 and Instrumentation should give you the most correct results and a pretty good performance without having to add any kind of caching.

Do you have a use case were Jamm performance is not good enough for you? If yes, could you describe your use case and share numbers?

daniel-rusu · 2023-10-24T16:05:39Z

@blerer a few clarifications and then I'll include my use-case at the bottom:

The proposal doesn't require all 3 parts as any of them can provide improvements on their own.
I measured a meaningful improvement for all current strategies as the caching (2 & 3) also makes the instrumentation strategy meaningfully faster when measuring a collection of similar objects which should be common.
2 & 3 could be a measurement cache and a field cache however it's more efficient in time & space to bundle them together in a single class metadata cache.
This doesn't require sacrificing correctness as this also meaningfully improves the performance of the instrumentation strategy so you would never switch away from the instrumentation strategy as that will provide the best performance when caching is enabled.
This doesn't introduce any "nasty" side effects as the cache would be temporary memory similar to how Jamm also requires some temporary memory for the measurement stack. I updated ticket Cut memory usage in half by iterating arrays on the fly #66 to clarify that it does actually cut the memory usage of analyzing arrays in half including a reference to the snippet of code that's responsible for that. Worst-case scenario, this could be behind a flag although most people would probably want it enabled by default.
You might be surprised to find that combining all 3 improvements would result in a reduction (not increase) of memory usage for many common use cases that include collections even after accounting for the cache.
Regarding whether performance is "good enough", I'm sure that everyone would welcome performance improvements. In my case, performance affects the accuracy of my results which sounds wrong at first (see below).

My use-case:

I'm using Jamm to analyze the memory efficiency of custom algorithms to see how they scale as most of these algorithms need to pre-process the data into a custom data structure. To accurately evaluate the memory efficiency of an algorithm, I need to run Monte Carlo simulations with randomized datasets over many variations of parameters that affect the shape and distribution of the dataset. The more simulations that I can run, the lower the variance of the results so I can present higher-accuracy results with lower margins of error. Since I have so many variations to consider in a limited amount of time, the performance is currently limiting the number of runs that I can perform resulting in lower accuracy due to higher variability and margin of error in the distributions.

blerer · 2023-10-26T12:10:33Z

We use a single MemoryMeter instance for different measurements inside Apache Cassandra. So the application can run for months and due to some bug can sometime end up measuring a significant amount of classes. I am not so much worried by short memory usage those are fine for us. The issue is with more and more memory being used overtime and never released.
Feel free to propose a patch. It is always easier to measure and discuss things with a concrete proposal.

daniel-rusu · 2023-10-31T07:52:23Z

Thanks, submitted PR:
#68

The cache is temporarily created for each invocation of measureDeep so that should address the issue with my initial proposal about it growing unbounded.

I also included another benchmark with collections that contain somewhat average-looking data along with benchmark results.

daniel-rusu linked a pull request Oct 31, 2023 that will close this issue

Double the performance of measuring collections #68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please create CachedMemoryMeterStrategy that caches shallow sizes of classes #65

Please create CachedMemoryMeterStrategy that caches shallow sizes of classes #65

daniel-rusu commented Oct 20, 2023

daniel-rusu commented Oct 20, 2023

blerer commented Oct 20, 2023 •

edited

Loading

daniel-rusu commented Oct 20, 2023 •

edited

Loading

daniel-rusu commented Oct 20, 2023 •

edited

Loading

daniel-rusu commented Oct 21, 2023 •

edited

Loading

blerer commented Oct 24, 2023 •

edited

Loading

daniel-rusu commented Oct 24, 2023

blerer commented Oct 26, 2023

daniel-rusu commented Oct 31, 2023

Please create CachedMemoryMeterStrategy that caches shallow sizes of classes #65

Please create CachedMemoryMeterStrategy that caches shallow sizes of classes #65

Comments

daniel-rusu commented Oct 20, 2023

daniel-rusu commented Oct 20, 2023

blerer commented Oct 20, 2023 • edited Loading

daniel-rusu commented Oct 20, 2023 • edited Loading

daniel-rusu commented Oct 20, 2023 • edited Loading

daniel-rusu commented Oct 21, 2023 • edited Loading

blerer commented Oct 24, 2023 • edited Loading

daniel-rusu commented Oct 24, 2023

My use-case:

blerer commented Oct 26, 2023

daniel-rusu commented Oct 31, 2023

blerer commented Oct 20, 2023 •

edited

Loading

daniel-rusu commented Oct 20, 2023 •

edited

Loading

daniel-rusu commented Oct 20, 2023 •

edited

Loading

daniel-rusu commented Oct 21, 2023 •

edited

Loading

blerer commented Oct 24, 2023 •

edited

Loading