Known Issues
half8==and!=operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)boolvectors generated from operations on non-(s)bytevectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficultiesfloat8min()andmax()functions don't handle NaNs the same way Unity.Mathematics does
Fixes
- Fixed XML documentation not showing descriptions for valid
Promiseflags - Fixed
cminmaxdocumentation bitmask64withnumBitsequal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2- Fixed
uint8tofloat8type conversion if compiling for AVX2 - Fixed incorrect
modimplementations - (ISSUE #16) Fixed
floatanddouble(r)cbrtedge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vectorfloatimplementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new validPromiseflags allow for call-site selection of faster code paths
Additions
Divider<T>
Divider<T> is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks.
For any divisor of any scalar- or vector integer type T, a Divider<T> instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors.
Divider<T> was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code.
Divider<T> is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable i is the divisor in a for-loop).
Numerous Promise flags allow for faster operations, provided that the Divider<T> instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM.
The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate typeof(T) often requires explicit initialization (example: new Divider<byte>((byte)42)). DEBUG only validity checks ensure correct initialization and usage.
The current Divider API consists of...:
/and%operators:- LHS: scalar <> RHS: Divider(scalar): requires both scalars to be of the same type; returns a scalar of the that type
- LHS: vector <> RHS: Divider(vector): requires both vectors to be of the same type; returns a vector of the that type
- LHS: scalar <> RHS: Divider(vector): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
- LHS: vector <> RHS: Divider(scalar): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
DivRemmember methodsEvenlyDividesmember methodsT Divisoras a readonly propertypublic constPromises withinDivider<T>, documenting valid promise flags with appropriate naming, starting with "PROMISE_"Get/SetInnerDivider<U>methods: get or set a scalar- or vectorDivider<U>within aDivider<T>- Component shuffles:
Divider<T>.wzxyswizzle "operators" as properties.
NOTE: Get/SetInnerDivider<U> methods and Divider<T>.[a][b][c][d] properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a Divider<int2> from a Divider<int4>. Get/SetInnerDivider<U> are placeholderholders both for these operations as well as for the v[a]_[b] properties for vectors with 8 or more components. C# will at some point get more complex type extension language support, at which point this API will change.
quadruple (PREVIEW)
Analogous to (U)Int128, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 112 mantissa bits format.
NOTE: quadruple is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all maxmath functions available for it at this time.
The following functions have been implemented: ToString and Parse (no perfect roundtrip guaranteed), All constants (example: PI_QUAD), Random128 NextQuadruple (optionally with min and max values), all type conversions except for decimal, -(unary), +(binary), -(binary), *, /, %, ==, !=, <, <=, >, >=, fmod, mad, msub, rcp, isnan, isinf, isfinite, isnormal, issubnormal, round, floor, ceil, trunc, roundtoint (and all other integer variations), fastsqrt, (r)sqrt, (r)cbrt, isinrange, approx, select, compareto, min, max, copysign, nextgreater, nextsmaller, nexttoward, radians, degrees, chgsign
Functions
- Added
isnormalandissubnormalfunctions for floating point types - Added
hypotandinthypotfunctions for calculating[int]sqrt(a * a + b * b)without overflow, unless an optionalPromiseparameter with itsNoOverflowflag set is passed as a compile time constant argument - Added
roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding towards the nearest integer - Added
corandcxor. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element - Split
approxinto two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead - Added
roundmultiple(x, m),floormultiple(x, m),ceilmultiple(x, m)andtruncmultiple(x, m)for all types, rounding x to the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x to the nearest greater multiple of m) - Added a whole stack of bit manipulation functions for all scalar- and vector integer types:
parityodd,parityeven,countzerobits,l1cnt,t1cnt,lzmask,tzmask,l1mask,t1mask,bits_extractlowest0,bits_masktolowest,bits_masktolowest0,bits_maskfromlowest,bits_maskfromlowest0,bits_setlowest,bits_surroundlowestandbits_surroundlowest0
Global Compilation Options
- Added Global Compilation Options for
OptimizeFor,FloatModeandFloatPrecision. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded toOptimizeFor.Performance,FloatMode.DefaultandFloatPrecision.Standard, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs
Improvements
Meta
- This library now fully supports ARM CPUs' SIMD instructions (huge!). It utilizes SSE2NEON and SIMDe to convert x86 SIMD instructions to ARM SIMD instructions or instruction sequences. Because of this, generated ARM code will sometimes remain slightly unoptimized, because the author is unable to verify correctness of ARM specific optimizations with unit tests in most cases.
Performance
- Implemented optimized
(u)longvector tofloatvector type convesion operators - Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2)
- Implemented an
AssumeRangeAttributeequivalent for all vectorized functions with known return value ranges - Implemented more optimal
(U)Int128comparison operators - Implemented optimal
(U)Int128multiplication operations with- and division and modulo operations by compile time constants - Implemented optimal
(U)Int128division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizingUnity.Burst.CompilerServices.Constant.IsConstantExpression<T>() - Increased valid
Promise.Unsafe0range for(u)longintcbrtfrom [0, 2^46} to [0, 2^48] - Added an optional
Promiseparameter togamma - Added an optional
Promiseparameter toerf(c) - Added an optional
Promiseparameter togcdandlcm - Added
quarterandhalfscalar- and vector function overloads formin,max,minmax,clamp,saturate,isinrange,trunc,round,ceil,floorandsign - Removed the only non-optimizing branch in vector code in the entire library within the
long2/3/4>>operator if the shift amount is not a compile time constant for a ~30% performance gain - Reduced branch predictor penalty of
gcdby moving the loop condition to an earlier part of the loops code - Reduced latency of internal
bytetouint/floatandushorttoulong/doublevector zero-extension conversions (and back) if an entire SIMD register is converted, by ~6 cycles. This operation is especially relevant inbytevectorshlandshroperations, which are very common throughout the library, including within loops. Consequently, this improvement should yield significant performance gains - Reduced latency of
gcdwithin the loop by 1 to 8 clock cycles at the cost of 0 to 9 clock cycles outside the loop if compiling for SSE4 or higher, by adding 1 to 6 instructions. - Reduced latency of
combfor all scalar and vector types.(s)byte,(u)shortand scalar(u)intoverloads now make use of an algorithm that only requires one division per loop iteration instead of two, if the type fits into a single hardware register when cast to the next wider integer type. This algorithm is more than twice as fast.- Implemented
Divider<T>into both algorithms. This yields a massive performance gain of 2.3x to 20x (!), at the cost of an immense amount of code size - The first eight loop iterations of both algorithms have been extracted from the loop and optimized by hand. This yields a performance gain between ~15 (16 bit) and ~600 (64 bit) clock cycles.
Both ii. and iii. are disabled if the global compilation tag OptimizeFor is set to OptimizeFor.Size.
- Reduced latency of
half{X}to(s)byte[X},(u)short{X},(u)long{X}, andhalf8to(u)int8type conversion from 10 down to 4 or 5 cycles, also affecting managed C# fallback performance. Unity.Mathematics' implementation remains suboptimal (for now... hint) - Reduced latency of SSE2 fallback code for
ulongvector<and>operators by 1 cycle and reduced code size by removing 1 instruction - Reduced latency of
(s)bytevectorfloorlog2/ceillog2by 1 cycle and reduced code size by removing 1 to 2 instructions, as well as 1 constant read from RAM - Reduced latency of vector
long(n)absby 0 to 2 cycles (highly CPU specific) and reduced code size by removing 1 to 2 instructions - Reduced latency of
all_difoverloads for all(s)bytevectors andshort16by 10% to 20%; the(s)byteoverloads now use ~35% less constant data read from RAM - Reduced latency of
(s)byte16/32division and modulo operations by 6 cycles by adding 6 instructions - Reduced latency of vector
floatto(u)longtype conversion by 5 to 10 cycles - Reduced latency of
quarter/halfto integer type conversion operations by 6 or more cycles - Reduced latency, code size and constant data read from RAM of SSE2 fallback code for
(s)bytevectorshrl,shraandshldrastically when the shift amount is constant - Reduced latency of vector
rol/rordrastically when the rotation value is constant and a multiple of 8 - Reduced latency of
(u)int8and(u)longvectorl/tzcntby up to 2 cycles and removed 1 instruction - Reduced latency of
float8touint8conversion operator by ~6 cycles if compiling for AVX2 - Reduced latency of scalar- and vector
doublefastrsqrtby 4 cycles and increased accuracy of the result - Reduced latency of
(u)longintsqrtby reducing the latency of each loop interation by 1 (signed) or 2 (unsigned) cycles, at the cost of an extra 5 (unsigned) or 6 (signed) instructions with a latency of 3 cycles outside the loop - Reduced latency of scalar
(u)intintlog10by 1 cycle and removed 1 instruction - Reduced latency of vector
doubleto(u)longtype conversion operators by 1 cycle and removed 5 instructions - Reduced latency of previously internal
roundto(u)long(double x)by ~10 cycles, also reducing(u)longvector division latency by ~20 cycles down to ~94 cycles - Reduced latency of squaring
(u)longvectors by 2 to 3 cycles and removed 2 instructions, also reducing(u)longvectorinpowlatency, as squaring is part of the loop - Reduced latency of all floating point
nexttowardfunctions by 1 cycle - Reduced latency of
(u)longvector multiplication by scalar- or vector integer types with bit-width less than or equal to 32 by 2 cycles and removed 1 instruction - Reduced latency of signed 8 bit integer division by 1 cycle if compiling for SSSE3 or higher
- Reduced latency of managed- and SSE2 fallback code paths for integer vector
isinrange(x, min, max)function overloads by ~2 cycles - Reduced latency of floating point
isinrange(x, min, max)function overloads by ~7 cycles - Reduced latency of
float/doubleto(U)Int128type conversion drastically (hundreds of cycles...!) - Reduced code size of vector
bits_depositparallelandbits_extractparallelby removing 1 instruction if compiling for SSE4 or higher - Reduced code size of vectorized
permby removing 3 instructions - Reduced code size of vectorized
intpowby removing 5 or more instructions - Reduced code size of ushort8 to float8 type conversion by removing 1 instruction
- Added SSE2 fallback code for
(u)long2/3/4to(s)byte2/3/4type conversion - Added SSE2 fallback code for
countbits - Added SSE2 fallback code for
mulsaturatedforint2/3/4 - Added AVX fallback code for
float8anddouble3/4gamma - Added AVX fallback code for
float8anddouble3/4erf(c) - Added AVX fallback code for
double3/4(r)cbrt
Changes
- Bumped C# Dev Tools dependency to version 1.0.9
- Bumped Unity.Mathematics dependency to version 1.3.1 and implemented
squarefor all types,chgsignfor floating point types and the constantsTAU(present since 2020, updated documentation),PI2,PIHALF,TODEGREESandTORADIANS - Bumped Unity.Burst dependency to version 1.8.18
- Moved
Promisetype declaration fromMaxMath.maxmath(class) toMaxMath(namespace) - Two-parameter
all_difoverloads are no longer a wrapper formath.all(a != b); these now return true if and only if both vectors do not share any components with each other doubleto(u)longtype conversion outside of the(u)longdomain is now undefined behavior (just like in the C standard) - the Mono runtime changed its behavior with an upgrade to some Unity 2022 version and is more performance intensive to follow "correctly" now. Reduced latency by 2 to 4 cycles, removed up to 5 instructions and 3 pieces of constant data read from RAMrol/rorreturn values are now undefined if the rotation amount is outside [0, BITS) (compliant with Unity.Mathematics). Reduced latency by up to 2 cycles if the rotation amount is not a compile time constant
Fixed Oversights
- Added
asbyte(sbyte{X}),assbyte(byte{X}),asshort(ushort{X}),asushort(short{X}),asint(uint8),asuint(int8),aslong(ulong{X}),asulong(long{X}) - Added
minmax(a, b, out min, out max)for scalar types; Improved managed fallback performance also - Added
minmaxmag(a, b, out minmag, out maxmag)for scalar types; Improved managed fallback performance also - Added
bitmaskoverloads forbool2andbool3, an oversight by the Unity.Mathematics team - Added missing
(u)longmatrix from/to(u)intmatrix conversion operators - Added C# unboxing casts to
(U)Int128Equals(object other) countbits(uint8 x)now returns anint8
Upcoming
The next release, 3.0, is going to be a major deviation from the previous premise of this library being an extension to Unity.Mathematics. It is going to become a wrapper library instead, replacing Unity.Mathematics entirely, while retaining the things that work well and cannot be done with MaxMath and Burst alone.
The most important guideline for development of said wrapper library is an easy transition from either Unity.Mathematics alone or in combination with MaxMath. The maxmath class will be renamed to math; a simple case sensitive find&replace ("Unity.Mathematics" -> "MaxMath" and "maxmath" -> "math") will be enough to migrate to this new version.
Reasons for this are:
- (Performance:) Integer division/mod is not vectorized in Unity.Mathematics - just as an example. More dramatically Unity.Mathematics'
boolvectors perform poorly and a solution is at hand. - (Performance:) Some Unity.Mathematics vector-functions are naively auto-vectorized instead of hand optimized to utilize specialized CPU instructions. An example would be
sign(int4), which is about three times slower than it needs to be on X86. - Unity.Mathematics is not really updated anymore - at least not as much as it was around 2020, when feedback was still taken into consideration. This is not the case anymore (I alone have submitted multiple now year-old bug reports which have not been addressed) thus we cannot expect certain fixes, like
halftypes not being IEEE754 compliant at all, or the performance bugs mentioned above. We can also not expect API updates like implementingint4 >> int4once Unity supports C#10
Release 3.0 is going to be released as both an extension- and as a wrapper library. It is only going to focus on extra test coverage and possible bugfixes, i.e. there are not going to be any improvements or additions.
Questions are welcome. Please use the issue tracker or seek contact on Discord