Skip to content

Insert iTableWalk in _interfaceSlotsUnavailable #22171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 21, 2025

Conversation

rmnattas
Copy link
Contributor

@rmnattas rmnattas commented Jul 2, 2025

This PR inserts iTableWalk for POWER in IPIC _interfaceSlotsUnavailable to be performed before calling the VM helper for better interface call performance when cache-miss.

The reason of choosing _interfaceSlotsUnavailable is because its the target called by the snippet after the 2 cache slots in the compilation snippet is populated and was a cache-miss. Hence interface calls use already existing paths to find and populate the caches (_interfaceCallHelper and _interfaceCompeteSlot2) and then the snippet call is patched to _interfaceSlotsUnavailable where the iTableWalk is performed.

New implementation path (after cache is populated):

  1. Mainline cache check for the 2 snippet cache slots
  2. IPIC lastITableCheck (checks the J9Class.lastITable cache)
  3. IPIC iTableWalk (checks J9Class.iTable[0..N-1]) NEW
  4. Call VM helper

_interfaceSlotsUnavailable layout:

<lastItable check
    b hitCall if hit>
<iTableWalk
    b vmhelperCall if all-miss>
<hitCall>

<vmhelperCall>

WIP:

  • SPECjbb2015 runs passes, running CI tests and measuring performance.
    • CI tests pass
    • Performance done
  • Current implementation includes N=4 iTableWalk iterations. Final value of N is TBD for best performance.
    • N=4 is better than N=6.
  • iTableWalk depends on lastITableCheck being enabled, determining if separating them (reordering code) to be able to have iTableWalk without lastITableCheck is worth it.

@rmnattas
Copy link
Contributor Author

rmnattas commented Jul 4, 2025

Performance shows no significant effect in SPECjbb2015 and 3.4% improvement in MicroBenchmark.

SPECjbb2015, 7x runs:
Within noise range

Driver max-jOPS Diff
Base 61610.00 100.00%
iTableWalk N=4 61311.71 99.52%
iTableWalk N=6 61522.83 99.86%

No critical-jOPS as I ran SPEC starting from 70% of high-bound with 1% steps for faster execution time.

MicroBenchmark, 4x runs:
Created a benchmark that exercises the code by having multiple implementors for an interface and calling with different receiver objects [1].

Driver Throughput Diff
Base 232259444.00 100.00%
iTableWalk N=4 240349604.00 103.48%
iTableWalk N=6 240162936.00 103.40%

Looking at perf profile for iTableWalk N=6 :

  75.62%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/EmptyBench.doBatch(J)J_scorching                                                                   
  17.98%  main             libj9jit29.so       [.] _interfaceSlotsUnavailable                                                                                                               
   1.72%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/EmptyBench.doBatch(J)J_warm                                                                        
   1.31%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/EmptyBench.doBatch(J)J_very-hot                                                                    
   0.84%  main             libj9vm29.so        [.] bytecodeLoopCompressed                                                                                                                   
   0.34%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/C4.play()V_warm                                                                                   
   0.26%  main             [JIT] tid 373038    [.] net/adoptopenjdk/bumblebench/examples/C5.play()V_warm                                                                                    
[1]
interface I1 { public void play(); }
class C1 implements I1 { public void play(){} }
class C2 implements I1 { public void play(){} }
class C3 implements I1 { public void play(){} }
class C4 implements I1 { public void play(){} }
class C5 implements I1 { public void play(){} }
class C6 implements I1 { public void play(){} }

        public static C1 c1 = new C1();
        public static C2 c2 = new C2();
        public static C3 c3 = new C3();
        public static C4 c4 = new C4();
        public static C5 c5 = new C5();
        public static C6 c6 = new C6();
        public static I1[] arr = {c1,c2,c3,c4,c5,c6};
        
        I1 getI(long i) { return (I1)arr[(int)(i%6)]; } 

        protected long doBatch(long numIterations) throws InterruptedException {
                for (long i = 0; i < numIterations; i++){
                        I1 I = getI(i);
                        I.play();
                }
                return numIterations;
        }

@rmnattas
Copy link
Contributor Author

rmnattas commented Jul 4, 2025

fyi @zl-wang @vijaysun-omr

@zl-wang
Copy link
Contributor

zl-wang commented Jul 15, 2025

in order to differentiate the performance with this work, the test case needs to be somewhat more complicated. think about this way: as it is, each class only implements the single existing interface. there is little benefit in walking 4 or 6 iTable entries, since existing lastITableCheck already caught it all. making each class implement multiple interfaces (chained super-offspring interfaces or separate are both fine, i think) will lead to some difference. also, making calls with different interfaces surely will lead to lastITableCheck not be able to catch all, e.g. calls from 4 different interfaces ... lastITableCheck at best will be good for 25% of the slot-cache miss. you can think of lastITable as another layer of cache.

@vijaysun-omr
Copy link
Contributor

vijaysun-omr commented Jul 15, 2025

At least in @a7ehuo implementation (that @BradleyWood is now shepherding) and @ehsankianifar implementation (I believe), we have logic that I proposed originally, which was to choose the number of cached iTable entries dynamically, i.e. on the basis of the CHTable and the expected/seen numbers of interfaces/classes.

@rmnattas
Copy link
Contributor Author

POWER uses snippet cache slots as another layer that x86 doesn't have, hence reaching the iTableWalk is lower probability when both a) snippet caches and b) lastITable fails.
Also having it as IPIC code, although not having dynamic number of iterations, it doesn't affect code-cache usage.
If added in compilation-body it will be before lastITable check given current POWER implementation.

pmd results

Running 5 runs each and removing outliers.

Result(ms) Diff
Base 4361.20 100.00%
N=4 3883.51 89.05%
N=6 4681.68 107.35%

iTableWalk with N=4 is 11% better and with N=6 is 7% worse than base.

@rmnattas rmnattas marked this pull request as ready for review July 17, 2025 15:44
@rmnattas
Copy link
Contributor Author

ready for merge

@rmnattas rmnattas changed the title WIP: Insert iTableWalk in _interfaceSlotsUnavailable Insert iTableWalk in _interfaceSlotsUnavailable Jul 17, 2025
@zl-wang
Copy link
Contributor

zl-wang commented Jul 17, 2025

impressive performance gain @vijaysun-omr

@zl-wang
Copy link
Contributor

zl-wang commented Jul 17, 2025

@vijaysun-omr i don't remember how much improvement was on x86. maybe comparable to 11%?

@vijaysun-omr
Copy link
Contributor

Impressive gains for sure.

I believe @BradleyWood saw ~6% on x86 and I believe @ehsankianifar saw ~20% on IBM Z I believe,

@BradleyWood
Copy link
Member

I think ~10% was first seen on PMD (unknown machine configuration), but I think Annabelle used only 3 runs. I ran with 10 runs for each trial and saw ~6%. I had to limit to 4 or 6 cores to observe that. The effect was not detectable if all cores were enabled, likely from a significant increase in standard deviation.

@vijaysun-omr
Copy link
Contributor

I also feel that the original x86 overhead was partly related to the whole AVX/SSE transition cost. Once that overhead was reduced via other fixes, the interface lookup helpers were slightly less expensive (about 10%) than they were before (about 20%). So the iTable walk helped gain back some of that performance overhead of 10% spent in the helpers.

I'm not sure why the helpers might be more expensive on Power and IBM Z (based on the larger gains seen). Maybe more registers on those other platforms to save/restore ? Also mentioning @0xdaryl @knn-k and @r30shah for their awareness.

@r30shah
Copy link
Contributor

r30shah commented Jul 17, 2025

I believe, @ehsankianifar 's improvement of 20% is coming from walking all slots - Also it is done in the JIT compiled code in OOL. @ehsankianifar correct me if I am wrong, also did we do similar runs that Abdul has done on Z with your change?

@zl-wang
Copy link
Contributor

zl-wang commented Jul 17, 2025

@vijaysun-omr VM helpers tended to be more heavy-weight on Power, since close to 30 registers have to be saved/restored.
@r30shah @ehsankianifar's 20% was gotten running some micro-bench? instead of Dacapo's PMD sub-bench

@r30shah
Copy link
Contributor

r30shah commented Jul 17, 2025

It was PMD large dataset size. @ehsankianifar Can you share the results from your run here?

@rmnattas rmnattas force-pushed the p-itablewalk branch 2 times, most recently from 253a957 to ddccf30 Compare July 18, 2025 13:20
@rmnattas
Copy link
Contributor Author

CI Testing passes, good to merge.

Copy link
Contributor

@zl-wang zl-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zl-wang
Copy link
Contributor

zl-wang commented Jul 21, 2025

Jenkins test sanity aix,plinux jdk8,jdk21

@zl-wang zl-wang merged commit f6940e2 into eclipse-openj9:master Jul 21, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants