-
Notifications
You must be signed in to change notification settings - Fork 767
Insert iTableWalk in _interfaceSlotsUnavailable #22171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Performance shows no significant effect in SPECjbb2015 and 3.4% improvement in MicroBenchmark. SPECjbb2015, 7x runs:
No critical-jOPS as I ran SPEC starting from 70% of high-bound with 1% steps for faster execution time. MicroBenchmark, 4x runs:
Looking at perf profile for
[1]
|
in order to differentiate the performance with this work, the test case needs to be somewhat more complicated. think about this way: as it is, each class only implements the single existing interface. there is little benefit in walking 4 or 6 iTable entries, since existing lastITableCheck already caught it all. making each class implement multiple interfaces (chained super-offspring interfaces or separate are both fine, i think) will lead to some difference. also, making calls with different interfaces surely will lead to lastITableCheck not be able to catch all, e.g. calls from 4 different interfaces ... lastITableCheck at best will be good for 25% of the slot-cache miss. you can think of lastITable as another layer of cache. |
At least in @a7ehuo implementation (that @BradleyWood is now shepherding) and @ehsankianifar implementation (I believe), we have logic that I proposed originally, which was to choose the number of cached iTable entries dynamically, i.e. on the basis of the CHTable and the expected/seen numbers of interfaces/classes. |
POWER uses snippet cache slots as another layer that x86 doesn't have, hence reaching the iTableWalk is lower probability when both a) snippet caches and b) lastITable fails.
|
Result(ms) | Diff | |
---|---|---|
Base | 4361.20 | 100.00% |
N=4 | 3883.51 | 89.05% |
N=6 | 4681.68 | 107.35% |
iTableWalk with N=4 is 11% better and with N=6 is 7% worse than base.
ready for merge |
impressive performance gain @vijaysun-omr |
@vijaysun-omr i don't remember how much improvement was on x86. maybe comparable to 11%? |
Impressive gains for sure. I believe @BradleyWood saw ~6% on x86 and I believe @ehsankianifar saw ~20% on IBM Z I believe, |
I think ~10% was first seen on PMD (unknown machine configuration), but I think Annabelle used only 3 runs. I ran with 10 runs for each trial and saw ~6%. I had to limit to 4 or 6 cores to observe that. The effect was not detectable if all cores were enabled, likely from a significant increase in standard deviation. |
I also feel that the original x86 overhead was partly related to the whole AVX/SSE transition cost. Once that overhead was reduced via other fixes, the interface lookup helpers were slightly less expensive (about 10%) than they were before (about 20%). So the iTable walk helped gain back some of that performance overhead of 10% spent in the helpers. I'm not sure why the helpers might be more expensive on Power and IBM Z (based on the larger gains seen). Maybe more registers on those other platforms to save/restore ? Also mentioning @0xdaryl @knn-k and @r30shah for their awareness. |
I believe, @ehsankianifar 's improvement of 20% is coming from walking all slots - Also it is done in the JIT compiled code in OOL. @ehsankianifar correct me if I am wrong, also did we do similar runs that Abdul has done on Z with your change? |
@vijaysun-omr VM helpers tended to be more heavy-weight on Power, since close to 30 registers have to be saved/restored. |
It was PMD large dataset size. @ehsankianifar Can you share the results from your run here? |
253a957
to
ddccf30
Compare
CI Testing passes, good to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Jenkins test sanity aix,plinux jdk8,jdk21 |
This PR inserts iTableWalk for POWER in IPIC
_interfaceSlotsUnavailable
to be performed before calling the VM helper for better interface call performance when cache-miss.The reason of choosing
_interfaceSlotsUnavailable
is because its the target called by the snippet after the 2 cache slots in the compilation snippet is populated and was a cache-miss. Hence interface calls use already existing paths to find and populate the caches (_interfaceCallHelper
and_interfaceCompeteSlot2
) and then the snippet call is patched to_interfaceSlotsUnavailable
where the iTableWalk is performed.New implementation path (after cache is populated):
J9Class.lastITable
cache)J9Class.iTable[0..N-1]
) NEW_interfaceSlotsUnavailable
layout:WIP:
running CI testsand measuring performance.N=4
iTableWalk iterations. Final value of N is TBD for best performance.iTableWalk depends on lastITableCheck being enabled, determining if separating them (reordering code) to be able to have iTableWalk without lastITableCheck is worth it.