Insert iTableWalk in _interfaceSlotsUnavailable #22171

rmnattas · 2025-07-02T16:02:34Z

This PR inserts iTableWalk for POWER in IPIC _interfaceSlotsUnavailable to be performed before calling the VM helper for better interface call performance when cache-miss.

The reason of choosing _interfaceSlotsUnavailable is because its the target called by the snippet after the 2 cache slots in the compilation snippet is populated and was a cache-miss. Hence interface calls use already existing paths to find and populate the caches (_interfaceCallHelper and _interfaceCompeteSlot2) and then the snippet call is patched to _interfaceSlotsUnavailable where the iTableWalk is performed.

New implementation path (after cache is populated):

Mainline cache check for the 2 snippet cache slots
IPIC lastITableCheck (checks the J9Class.lastITable cache)
IPIC iTableWalk (checks J9Class.iTable[0..N-1]) NEW
Call VM helper

_interfaceSlotsUnavailable layout:

<lastItable check
    b hitCall if hit>
<iTableWalk
    b vmhelperCall if all-miss>
<hitCall>

<vmhelperCall>

WIP:

SPECjbb2015 runs passes, ~~running CI tests~~ and measuring performance.
- CI tests pass
- Performance done
Current implementation includes N=4 iTableWalk iterations. Final value of N is TBD for best performance.
- N=4 is better than N=6.
~~iTableWalk depends on lastITableCheck being enabled, determining if separating them (reordering code) to be able to have iTableWalk without lastITableCheck is worth it.~~

rmnattas · 2025-07-04T02:33:52Z

Performance shows no significant effect in SPECjbb2015 and 3.4% improvement in MicroBenchmark.

SPECjbb2015, 7x runs:
Within noise range

Driver	max-jOPS	Diff
Base	61610.00	100.00%
iTableWalk N=4	61311.71	99.52%
iTableWalk N=6	61522.83	99.86%

No critical-jOPS as I ran SPEC starting from 70% of high-bound with 1% steps for faster execution time.

MicroBenchmark, 4x runs:
Created a benchmark that exercises the code by having multiple implementors for an interface and calling with different receiver objects [1].

Driver	Throughput	Diff
Base	232259444.00	100.00%
iTableWalk N=4	240349604.00	103.48%
iTableWalk N=6	240162936.00	103.40%

Looking at perf profile for iTableWalk N=6 :

  75.62%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/EmptyBench.doBatch(J)J_scorching                                                                   
  17.98%  main             libj9jit29.so       [.] _interfaceSlotsUnavailable                                                                                                               
   1.72%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/EmptyBench.doBatch(J)J_warm                                                                        
   1.31%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/EmptyBench.doBatch(J)J_very-hot                                                                    
   0.84%  main             libj9vm29.so        [.] bytecodeLoopCompressed                                                                                                                   
   0.34%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/C4.play()V_warm                                                                                   
   0.26%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/C5.play()V_warm

[1]

interface I1 { public void play(); }
class C1 implements I1 { public void play(){} }
class C2 implements I1 { public void play(){} }
class C3 implements I1 { public void play(){} }
class C4 implements I1 { public void play(){} }
class C5 implements I1 { public void play(){} }
class C6 implements I1 { public void play(){} }

        public static C1 c1 = new C1();
        public static C2 c2 = new C2();
        public static C3 c3 = new C3();
        public static C4 c4 = new C4();
        public static C5 c5 = new C5();
        public static C6 c6 = new C6();
        public static I1[] arr = {c1,c2,c3,c4,c5,c6};
        
        I1 getI(long i) { return (I1)arr[(int)(i%6)]; } 

        protected long doBatch(long numIterations) throws InterruptedException {
                for (long i = 0; i < numIterations; i++){
                        I1 I = getI(i);
                        I.play();
                }
                return numIterations;
        }

rmnattas · 2025-07-04T02:34:17Z

fyi @zl-wang @vijaysun-omr

zl-wang · 2025-07-15T16:32:14Z

in order to differentiate the performance with this work, the test case needs to be somewhat more complicated. think about this way: as it is, each class only implements the single existing interface. there is little benefit in walking 4 or 6 iTable entries, since existing lastITableCheck already caught it all. making each class implement multiple interfaces (chained super-offspring interfaces or separate are both fine, i think) will lead to some difference. also, making calls with different interfaces surely will lead to lastITableCheck not be able to catch all, e.g. calls from 4 different interfaces ... lastITableCheck at best will be good for 25% of the slot-cache miss. you can think of lastITable as another layer of cache.

vijaysun-omr · 2025-07-15T16:40:56Z

At least in @a7ehuo implementation (that @BradleyWood is now shepherding) and @ehsankianifar implementation (I believe), we have logic that I proposed originally, which was to choose the number of cached iTable entries dynamically, i.e. on the basis of the CHTable and the expected/seen numbers of interfaces/classes.

rmnattas · 2025-07-17T15:44:35Z

POWER uses snippet cache slots as another layer that x86 doesn't have, hence reaching the iTableWalk is lower probability when both a) snippet caches and b) lastITable fails.
Also having it as IPIC code, although not having dynamic number of iterations, it doesn't affect code-cache usage.
If added in compilation-body it will be before lastITable check given current POWER implementation.

`pmd` results

Running 5 runs each and removing outliers.

	Result(ms)	Diff
Base	4361.20	100.00%
N=4	3883.51	89.05%
N=6	4681.68	107.35%

iTableWalk with N=4 is 11% better and with N=6 is 7% worse than base.

rmnattas · 2025-07-17T15:44:45Z

ready for merge

zl-wang · 2025-07-17T15:58:50Z

impressive performance gain @vijaysun-omr

zl-wang · 2025-07-17T16:04:07Z

@vijaysun-omr i don't remember how much improvement was on x86. maybe comparable to 11%?

vijaysun-omr · 2025-07-17T16:11:21Z

Impressive gains for sure.

I believe @BradleyWood saw ~6% on x86 and I believe @ehsankianifar saw ~20% on IBM Z I believe,

BradleyWood · 2025-07-17T16:18:31Z

I think ~10% was first seen on PMD (unknown machine configuration), but I think Annabelle used only 3 runs. I ran with 10 runs for each trial and saw ~6%. I had to limit to 4 or 6 cores to observe that. The effect was not detectable if all cores were enabled, likely from a significant increase in standard deviation.

vijaysun-omr · 2025-07-17T16:49:32Z

I also feel that the original x86 overhead was partly related to the whole AVX/SSE transition cost. Once that overhead was reduced via other fixes, the interface lookup helpers were slightly less expensive (about 10%) than they were before (about 20%). So the iTable walk helped gain back some of that performance overhead of 10% spent in the helpers.

I'm not sure why the helpers might be more expensive on Power and IBM Z (based on the larger gains seen). Maybe more registers on those other platforms to save/restore ? Also mentioning @0xdaryl @knn-k and @r30shah for their awareness.

r30shah · 2025-07-17T18:08:17Z

I believe, @ehsankianifar 's improvement of 20% is coming from walking all slots - Also it is done in the JIT compiled code in OOL. @ehsankianifar correct me if I am wrong, also did we do similar runs that Abdul has done on Z with your change?

zl-wang · 2025-07-17T18:54:23Z

@vijaysun-omr VM helpers tended to be more heavy-weight on Power, since close to 30 registers have to be saved/restored.
@r30shah @ehsankianifar's 20% was gotten running some micro-bench? instead of Dacapo's PMD sub-bench

r30shah · 2025-07-17T19:04:24Z

It was PMD large dataset size. @ehsankianifar Can you share the results from your run here?

runtime/compiler/p/runtime/PicBuilder.spp

rmnattas · 2025-07-20T16:41:47Z

CI Testing passes, good to merge.

zl-wang

LGTM

zl-wang · 2025-07-21T13:57:19Z

Jenkins test sanity aix,plinux jdk8,jdk21

rmnattas force-pushed the p-itablewalk branch from 10ce32b to c8009f2 Compare July 17, 2025 15:44

rmnattas marked this pull request as ready for review July 17, 2025 15:44

rmnattas changed the title ~~WIP: Insert iTableWalk in _interfaceSlotsUnavailable~~ Insert iTableWalk in _interfaceSlotsUnavailable Jul 17, 2025

rmnattas force-pushed the p-itablewalk branch from c8009f2 to b7cda59 Compare July 17, 2025 15:47

zl-wang requested changes Jul 17, 2025

View reviewed changes

runtime/compiler/p/runtime/PicBuilder.spp Outdated Show resolved Hide resolved

rmnattas force-pushed the p-itablewalk branch 2 times, most recently from 253a957 to ddccf30 Compare July 18, 2025 13:20

Insert iTableWalk in _interfaceSlotsUnavailable in POWER

7910c6a

rmnattas force-pushed the p-itablewalk branch from ddccf30 to 7910c6a Compare July 18, 2025 13:24

zl-wang approved these changes Jul 21, 2025

View reviewed changes

zl-wang merged commit f6940e2 into eclipse-openj9:master Jul 21, 2025
15 checks passed

Insert iTableWalk in _interfaceSlotsUnavailable #22171

Insert iTableWalk in _interfaceSlotsUnavailable #22171

Uh oh!

Conversation

rmnattas commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmnattas commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmnattas commented Jul 4, 2025

Uh oh!

zl-wang commented Jul 15, 2025

Uh oh!

vijaysun-omr commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmnattas commented Jul 17, 2025

pmd results

Uh oh!

rmnattas commented Jul 17, 2025

Uh oh!

zl-wang commented Jul 17, 2025

Uh oh!

zl-wang commented Jul 17, 2025

Uh oh!

vijaysun-omr commented Jul 17, 2025

Uh oh!

BradleyWood commented Jul 17, 2025

Uh oh!

vijaysun-omr commented Jul 17, 2025

Uh oh!

r30shah commented Jul 17, 2025

Uh oh!

zl-wang commented Jul 17, 2025

Uh oh!

r30shah commented Jul 17, 2025

Uh oh!

Uh oh!

rmnattas commented Jul 20, 2025

Uh oh!

zl-wang left a comment

Choose a reason for hiding this comment

Uh oh!

zl-wang commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

rmnattas commented Jul 2, 2025 •

edited

Loading

rmnattas commented Jul 4, 2025 •

edited

Loading

vijaysun-omr commented Jul 15, 2025 •

edited

Loading

`pmd` results