Known Issues

half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)
bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
float8 min() and max() functions don't handle NaNs the same way Unity.Mathematics does

Fixes

Fixed XML documentation not showing descriptions for valid Promise flags
Fixed cminmax documentation
bitmask64 with numBits equal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2
Fixed uint8 to float8 type conversion if compiling for AVX2
Fixed incorrect mod implementations
(ISSUE #16) Fixed float and double (r)cbrt edge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vector float implementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new valid Promise flags allow for call-site selection of faster code paths

Additions

`Divider<T>`

Divider<T> is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks.
For any divisor of any scalar- or vector integer type T, a Divider<T> instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors.
Divider<T> was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code.
Divider<T> is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable i is the divisor in a for-loop).
Numerous Promise flags allow for faster operations, provided that the Divider<T> instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM.
The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate typeof(T) often requires explicit initialization (example: new Divider<byte>((byte)42)). DEBUG only validity checks ensure correct initialization and usage.
The current Divider API consists of...:

/ and % operators:
- LHS: scalar <> RHS: Divider(scalar): requires both scalars to be of the same type; returns a scalar of the that type
- LHS: vector <> RHS: Divider(vector): requires both vectors to be of the same type; returns a vector of the that type
- LHS: scalar <> RHS: Divider(vector): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
- LHS: vector <> RHS: Divider(scalar): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
DivRem member methods
EvenlyDivides member methods
T Divisor as a readonly property
public const Promises within Divider<T>, documenting valid promise flags with appropriate naming, starting with "PROMISE_"
Get/SetInnerDivider<U> methods: get or set a scalar- or vector Divider<U> within a Divider<T>
Component shuffles: Divider<T>.wzxy swizzle "operators" as properties.

NOTE: Get/SetInnerDivider<U> methods and Divider<T>.[a][b][c][d] properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a Divider<int2> from a Divider<int4>. Get/SetInnerDivider<U> are placeholderholders both for these operations as well as for the v[a]_[b] properties for vectors with 8 or more components. C# will at some point get more complex type extension language support, at which point this API will change.

`quadruple` (PREVIEW)

Analogous to (U)Int128, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 112 mantissa bits format.
NOTE: quadruple is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all maxmath functions available for it at this time.
The following functions have been implemented: ToString and Parse (no perfect roundtrip guaranteed), All constants (example: PI_QUAD), Random128 NextQuadruple (optionally with min and max values), all type conversions except for decimal, -(unary), +(binary), -(binary), *, /, %, ==, !=, <, <=, >, >=, fmod, mad, msub, rcp, isnan, isinf, isfinite, isnormal, issubnormal, round, floor, ceil, trunc, roundtoint (and all other integer variations), fastsqrt, (r)sqrt, (r)cbrt, isinrange, approx, select, compareto, min, max, copysign, nextgreater, nextsmaller, nexttoward, radians, degrees, chgsign

Functions

Added isnormal and issubnormal functions for floating point types
Added hypot and inthypot functions for calculating [int]sqrt(a * a + b * b) without overflow, unless an optional Promise parameter with its NoOverflow flag set is passed as a compile time constant argument
Added roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding towards the nearest integer
Added cor and cxor. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element
Split approx into two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead
Added roundmultiple(x, m), floormultiple(x, m), ceilmultiple(x, m) and truncmultiple(x, m) for all types, rounding x to the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x to the nearest greater multiple of m)
Added a whole stack of bit manipulation functions for all scalar- and vector integer types: parityodd, parityeven, countzerobits, l1cnt, t1cnt, lzmask, tzmask, l1mask, t1mask, bits_extractlowest0, bits_masktolowest, bits_masktolowest0, bits_maskfromlowest, bits_maskfromlowest0, bits_setlowest, bits_surroundlowest and bits_surroundlowest0

Global Compilation Options

Added Global Compilation Options for OptimizeFor, FloatMode and FloatPrecision. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded to OptimizeFor.Performance, FloatMode.Default and FloatPrecision.Standard, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs

Improvements

Performance

Implemented optimized (u)long vector to float vector type convesion operators
Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2)
Implemented an AssumeRangeAttribute equivalent for all vectorized functions with known return value ranges
Implemented more optimal (U)Int128 comparison operators
Implemented optimal (U)Int128 multiplication operations with- and division and modulo operations by compile time constants
Implemented optimal (U)Int128 division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizing Unity.Burst.CompilerServices.Constant.IsConstantExpression<T>()
Increased valid Promise.Unsafe0 range for (u)long intcbrt from [0, 2^46} to [0, 2^48]
Added an optional Promise parameter to gamma
Added an optional Promise parameter to erf(c)
Added an optional Promise parameter to gcd and lcm
Added quarter and half scalar- and vector function overloads for min, max, minmax, clamp, saturate, isinrange, trunc, round, ceil, floor and sign
Removed the only non-optimizing branch in vector code in the entire library within the long2/3/4 >> operator if the shift amount is not a compile time constant for a ~30% performance gain
Reduced branch predictor penalty of gcd by moving the loop condition to an earlier part of the loops code
Reduced latency of internal byte to uint/float and ushort to ulong/double vector zero-extension conversions (and back) if an entire SIMD register is converted, by ~6 cycles. This operation is especially relevant in byte vector shl and shr operations, which are very common throughout the library, including within loops. Consequently, this improvement should yield significant performance gains
Reduced latency of gcd within the loop by 1 to 8 clock cycles at the cost of 0 to 9 clock cycles outside the loop if compiling for SSE4 or higher, by adding 1 to 6 instructions.
Reduced latency of comb for all scalar and vector types.
1. (s)byte, (u)short and scalar (u)int overloads now make use of an algorithm that only requires one division per loop iteration instead of two, if the type fits into a single hardware register when cast to the next wider integer type. This algorithm is more than twice as fast.
2. Implemented Divider<T> into both algorithms. This yields a massive performance gain of 2.3x to 20x (!), at the cost of an immense amount of code size
3. The first eight loop iterations of both algorithms have been extracted from the loop and optimized by hand. This yields a performance gain between ~15 (16 bit) and ~600 (64 bit) clock cycles.

Both ii. and iii. are disabled if the global compilation tag OptimizeFor is set to OptimizeFor.Size.

Reduced latency of half{X} to (s)byte[X}, (u)short{X}, (u)long{X}, and half8 to (u)int8 type conversion from 10 down to 4 or 5 cycles, also affecting managed C# fallback performance. Unity.Mathematics' implementation remains suboptimal (for now... hint)
Reduced latency of SSE2 fallback code for ulong vector < and > operators by 1 cycle and reduced code size by removing 1 instruction
Reduced latency of (s)byte vector floorlog2/ceillog2 by 1 cycle and reduced code size by removing 1 to 2 instructions, as well as 1 constant read from RAM
Reduced latency of vector long (n)abs by 0 to 2 cycles (highly CPU specific) and reduced code size by removing 1 to 2 instructions
Reduced latency of all_dif overloads for all (s)byte vectors and short16 by 10% to 20%; the (s)byte overloads now use ~35% less constant data read from RAM
Reduced latency of (s)byte16/32 division and modulo operations by 6 cycles by adding 6 instructions
Reduced latency of vector float to (u)long type conversion by 5 to 10 cycles
Reduced latency of quarter/half to integer type conversion operations by 6 or more cycles
Reduced latency, code size and constant data read from RAM of SSE2 fallback code for (s)byte vector shrl, shra and shl drastically when the shift amount is constant
Reduced latency of vector rol/ror drastically when the rotation value is constant and a multiple of 8
Reduced latency of (u)int8 and (u)long vector l/tzcnt by up to 2 cycles and removed 1 instruction
Reduced latency of float8 to uint8 conversion operator by ~6 cycles if compiling for AVX2
Reduced latency of scalar- and vector double fastrsqrt by 4 cycles and increased accuracy of the result
Reduced latency of (u)long intsqrt by reducing the latency of each loop interation by 1 (signed) or 2 (unsigned) cycles, at the cost of an extra 5 (unsigned) or 6 (signed) instructions with a latency of 3 cycles outside the loop
Reduced latency of scalar (u)int intlog10 by 1 cycle and removed 1 instruction
Reduced latency of vector double to (u)long type conversion operators by 1 cycle and removed 5 instructions
Reduced latency of previously internal roundto(u)long(double x) by ~10 cycles, also reducing (u)long vector division latency by ~20 cycles down to ~94 cycles
Reduced latency of squaring (u)long vectors by 2 to 3 cycles and removed 2 instructions, also reducing (u)long vector inpow latency, as squaring is part of the loop
Reduced latency of all floating point nexttoward functions by 1 cycle
Reduced latency of (u)long vector multiplication by scalar- or vector integer types with bit-width less than or equal to 32 by 2 cycles and removed 1 instruction
Reduced latency of signed 8 bit integer division by 1 cycle if compiling for SSSE3 or higher
Reduced latency of managed- and SSE2 fallback code paths for integer vector isinrange(x, min, max) function overloads by ~2 cycles
Reduced latency of floating point isinrange(x, min, max) function overloads by ~7 cycles
Reduced latency of float/double to (U)Int128 type conversion drastically (hundreds of cycles...!)
Reduced code size of vector bits_depositparallel and bits_extractparallel by removing 1 instruction if compiling for SSE4 or higher
Reduced code size of vectorized perm by removing 3 instructions
Reduced code size of vectorized intpow by removing 5 or more instructions
Reduced code size of ushort8 to float8 type conversion by removing 1 instruction
Added SSE2 fallback code for (u)long2/3/4 to (s)byte2/3/4 type conversion
Added SSE2 fallback code for countbits
Added SSE2 fallback code for mulsaturated for int2/3/4
Added AVX fallback code for float8 and double3/4 gamma
Added AVX fallback code for float8 and double3/4 erf(c)
Added AVX fallback code for double3/4 (r)cbrt

Changes

Bumped C# Dev Tools dependency to version 1.0.9
Bumped Unity.Mathematics dependency to version 1.3.1 and implemented square for all types, chgsign for floating point types and the constants TAU (present since 2020, updated documentation), PI2, PIHALF, TODEGREES and TORADIANS
Bumped Unity.Burst dependency to version 1.8.18
Moved Promise type declaration from MaxMath.maxmath (class) to MaxMath (namespace)
Two-parameter all_dif overloads are no longer a wrapper for math.all(a != b); these now return true if and only if both vectors do not share any components with each other
double to (u)long type conversion outside of the (u)long domain is now undefined behavior (just like in the C standard) - the Mono runtime changed its behavior with an upgrade to some Unity 2022 version and is more performance intensive to follow "correctly" now. Reduced latency by 2 to 4 cycles, removed up to 5 instructions and 3 pieces of constant data read from RAM
rol/ror return values are now undefined if the rotation amount is outside [0, BITS) (compliant with Unity.Mathematics). Reduced latency by up to 2 cycles if the rotation amount is not a compile time constant

Fixed Oversights

Added asbyte(sbyte{X}), assbyte(byte{X}), asshort(ushort{X}), asushort(short{X}), asint(uint8), asuint(int8), aslong(ulong{X}), asulong(long{X})
Added minmax(a, b, out min, out max) for scalar types; Improved managed fallback performance also
Added minmaxmag(a, b, out minmag, out maxmag) for scalar types; Improved managed fallback performance also
Added bitmask overloads for bool2 and bool3, an oversight by the Unity.Mathematics team
Added missing (u)long matrix from/to (u)int matrix conversion operators
Added C# unboxing casts to (U)Int128 Equals(object other)
countbits(uint8 x) now returns an int8

Upcoming

The next release, 3.0, is going to be a major deviation from the previous premise of this library being an extension to Unity.Mathematics. It is going to become a wrapper library instead, replacing Unity.Mathematics entirely, while retaining the things that work well and cannot be done with MaxMath and Burst alone.
The most important guideline for development of said wrapper library is an easy transition from either Unity.Mathematics alone or in combination with MaxMath. The maxmath class will be renamed to math; a simple case sensitive find&replace ("Unity.Mathematics" -> "MaxMath" and "maxmath" -> "math") will be enough to migrate to this new version.
Reasons for this are:

(Performance:) Integer division/mod is not vectorized in Unity.Mathematics - just as an example. More dramatically Unity.Mathematics' bool vectors perform poorly and a solution is at hand.
(Performance:) Some Unity.Mathematics vector-functions are naively auto-vectorized instead of hand optimized to utilize specialized CPU instructions. An example would be sign(int4), which is about three times slower than it needs to be on X86.
Unity.Mathematics is not really updated anymore - at least not as much as it was around 2020, when feedback was still taken into consideration. This is not the case anymore (I alone have submitted multiple now year-old bug reports which have not been addressed) thus we cannot expect certain fixes, like half types not being IEEE754 compliant at all, or the performance bugs mentioned above. We can also not expect API updates like implementing int4 >> int4 once Unity supports C#10

Release 3.0 is going to be released as both an extension- and as a wrapper library. It is only going to focus on extra test coverage and possible bugfixes, i.e. there are not going to be any improvements or additions.
Questions are welcome. Please use the issue tracker or seek contact on Discord

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2.9.0

Known Issues

Fixes

Additions

`Divider<T>`

`quadruple` (PREVIEW)

Functions

Global Compilation Options

Improvements

Meta

Performance

Changes

Fixed Oversights

Upcoming

Uh oh!

v2.9.0

Known Issues

Fixes

Additions

Divider<T>

quadruple (PREVIEW)

Functions

Global Compilation Options

Improvements

Meta

Performance

Changes

Fixed Oversights

Upcoming

Uh oh!

`Divider<T>`

`quadruple` (PREVIEW)