Use `LoopVectorization.@turbo` in dynamic expression evaluation scheme #9

MilesCranmer · 2022-10-25T16:56:52Z

With JuliaSIMD/LoopVectorization.jl#431, this should potentially work for arbitrary user-defined operators, since it will fall back to @inbounds @simd when an operator cannot be SIMD-ified. It gets the evaluation even faster for me - now evaluation about 30% faster than a simple handwritten function, and even a little bit faster than a manually-written SIMD loop (which does not use @turbo)

Let's see if this works.

cc @chriselrod

MilesCranmer · 2022-10-25T17:06:53Z

Hm, it's getting stack overflows again on this test, even with the safe mode:

Test tree construction and scoring: Error During Test at /home/runner/.julia/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:25
  Got exception outside of a @test
  LoadError: StackOverflowError:
  Stacktrace:
       [1] vabs(v::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29
       [2] abs(a::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/base_defs.jl:46
  --- the last 2 lines are repeated 39990 more times ---
   [79983] vabs(v::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29

Maybe the safe operators break some of the assumptions?

DynamicExpressions.jl/test/test_params.jl

Line 4 in 093b6b1

safe_log(x::T) where {T} = (x <= 0) ? T(NaN) : log(x)

. Or maybe it's an issue with Float16 operations?

chriselrod · 2022-10-25T17:32:08Z

Define safe_log(x::AbstractSIMD) = log(x); these are already safe
yeah, Float16 isn't supported well/tested much.
can_turbo checks assuming Int, rather than the element type

Feel free to either define LoopVectorization.check_type(::Type{Float16}) = false, or make PRs to VectorizationBase/wherever is appropriate to fix things like abs that are broken.
Probably wouldn't require too much work to get a lot of this working:
https://github.com/JuliaSIMD/VectorizationBase.jl/blob/d39b4a04bc00b5485d7f95d490ce5433cf921cde/src/llvm_intrin/intrin_funcs.jl#L138-L139

MilesCranmer · 2022-10-25T19:02:43Z

Thanks! I can certainly also make it so that only Float32 and Float64 are attempted to be @turboed; all other types use @inbounds @simd.

How hard do you think it would be for me to build a macro in this package that does a test vectorization on the given type and operator, and, if it fails for whatever reason, it falls back to @inbounds @simd? I'm hoping to have this be as robust as possible for all user-defined operators.

Maybe a simpler but less general solution would just be to have a pre-defined set of operators that are known to work, and only @turbo on those.

MilesCranmer · 2022-10-25T19:24:55Z

Here's an example. Could something like this be done in a macro instead of a generated function? Maybe even as a new @turbo_with_fallback macro in LoopVectorization.jl?

using DynamicExpressions
using LoopVectorization

@generated function unary_operator_kernel!(
    x::AbstractArray{T}, y::AbstractArray{T}, ::Val{op_idx}, operators::OperatorEnum
) where {T,op_idx}
    # First, we try to @turbo and eval an example array:
    num_samples = 32
    _x = similar(x, num_samples);
    _y = similar(x, num_samples);
    # Get operator from type:
    unaops = operators.parameters[2]
    op = unaops.parameters[op_idx].instance
    can_turbo = try
        @turbo for i in indices(_x)
            _y[i] = op(_x[i])
        end
        true
    catch  # Catch ALL errors.
        false
    end
    if can_turbo
        quote
            @turbo for i in indices(x)
                y[i] = $op(x[i])
            end
        end
    else
        quote
            @inbounds @simd for i in indices(x)
                y[i] = $op(x[i])
            end
        end
    end
end

x = randn(Float16, 100);
y = similar(x);
operators = OperatorEnum(;unary_operators=[abs])
unary_operator_kernel!(x, y, Val(1), operators)

Even though abs is not defined for Float16 types, this will still work - the test @turbo will get a stack overflow, so @inbounds @simd will be used in the generator function instead.

chriselrod · 2022-10-25T19:38:10Z

Thanks! I can certainly also make it so that only Float32 and Float64 are attempted to be @turboed; all other types use @inbounds @simd.

My suggestion was to PR LoopVectorization to disable it for Float16 in general, or (better) fix it so it works.

Another option is to actually forward and promote type information to the can_turbo calls.

MilesCranmer · 2022-10-26T16:25:24Z

Thanks, will consider! Would that be the one remaining issue or do you foresee others?

I am thinking if I simply want to add an opt-in “use_turbo” argument to the evaluation function, so that users are not surprised by any errors that come out of evaluation.

MilesCranmer · 2022-10-26T18:10:39Z

Weird, also seeing this for the derivative ops with Float32 variables. Just to be robust maybe it's best I leave it optional for now with a turbo::Bool kwarg to the evaluation functions.

Testing derivatives with respect to variables, with type=Float32.
ERROR: LoadError: TypeError: non-boolean (VectorizationBase.Mask{4, UInt8}) used in boolean context
Stacktrace:
  [1] _pow_grad_x(x::VectorizationBase.Vec{4, Float32}, p::VectorizationBase.Vec{4, Float32}, y::VectorizationBase.Vec{4, Float32})
    @ ChainRules ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:323
  [2] power_pullback
    @ ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:201 [inlined]
 ...
 [20] #eval_diff_tree_array#33
    @ ~/Documents/DynamicExpressions.jl/src/EvaluateEquationDerivative.jl:49 [inlined]
 [21] #307
    @ ./none:0 [inlined]
 [22] iterate
    @ ./generator.jl:47 [inlined]
 [23] collect(itr::Base.Generator{UnitRange{Int64}, var"#307#313"{Matrix{Float32}, OperatorEnum{Tuple{typeof(+), typeof(*), typeof(-), typeof(/), typeof(pow_abs2)}, Tuple{typeof(custom_cos), typeof(exp), typeof(sin)}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(+)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(*)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(-)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(/)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(pow_abs2)}}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(custom_cos)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(exp)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(sin)}}}, Bool, Node{Float32}}})
    @ Base ./array.jl:787
 [24] top-level scope
    @ ~/Documents/DynamicExpressions.jl/test/test_derivatives.jl:85
 [25] include(fname::String)
    @ Base.MainInclude ./client.jl:476
 [26] top-level scope
    @ REPL[21]:1
in expression starting at /Users/mcranmer/Documents/DynamicExpressions.jl/test/test_derivatives.jl:45

MilesCranmer · 2022-10-26T18:29:30Z

Okay, with turbo set to optional, I can get the tests to pass. I'm seeing a really nice speed boost!!

using DynamicExpressions

X = randn(Float32, 3, 5_000);

# Feature nodes:
x1, x2 = Node(;feature=1), Node(;feature=2)
# Dynamically construct the expression:
tree = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)

# Evaluate:
@btime tree(X)
# 128 us
@btime tree(X; turbo=true)
# 57 us

Wow that is fast!

coveralls · 2022-10-26T18:32:21Z

Pull Request Test Coverage Report for Build 3331534317

61 of 68 (89.71%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.6%) to 92.451%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/Utils.jl	3	10	30.0%

Totals
Change from base Build 3308712281:	0.6%
Covered Lines:	894
Relevant Lines:	967

💛 - Coveralls

MilesCranmer · 2022-10-26T18:35:37Z

Still not as nice as a handwritten kernel though (the more complex the expression, the faster a single kernel will be compared to the recursive evaluation):

function f(X)
    y = Array{Float32}(undef, size(X, 2))
    @turbo for i in indices(y)
        x1 = X[1, i]
        x2 = X[2, i]
        y[i] = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)
    end
    y
end

@btime f(X)
# 5.2 us

Incorporate LoopVectorization.@turbo into evaluation scheme

093b6b1

Make @turbo parameterizable by user

5718dd3

MilesCranmer added 4 commits October 26, 2022 14:21

Test turbo in evaluation with different types

6ce0d3a

Clean up derivative tests

a1f1c90

Use Val for turbo mode to speed up

e0196b2

Turn off @turbo for derivatives; as Zygote won't fix

f45ade4

MilesCranmer requested a review from kazewong October 26, 2022 18:23

Fix for edge case

cc41e61

MilesCranmer merged commit f3d6be0 into master Oct 27, 2022

MilesCranmer deleted the loop-vectorization branch October 27, 2022 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use `LoopVectorization.@turbo` in dynamic expression evaluation scheme #9

Use `LoopVectorization.@turbo` in dynamic expression evaluation scheme #9

Uh oh!

MilesCranmer commented Oct 25, 2022

Uh oh!

MilesCranmer commented Oct 25, 2022

Uh oh!

chriselrod commented Oct 25, 2022

Uh oh!

MilesCranmer commented Oct 25, 2022

Uh oh!

MilesCranmer commented Oct 25, 2022 •

edited

Loading

Uh oh!

chriselrod commented Oct 25, 2022 •

edited

Loading

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

coveralls commented Oct 26, 2022

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

Uh oh!

Use LoopVectorization.@turbo in dynamic expression evaluation scheme #9

Use LoopVectorization.@turbo in dynamic expression evaluation scheme #9

Uh oh!

Conversation

MilesCranmer commented Oct 25, 2022

Uh oh!

MilesCranmer commented Oct 25, 2022

Uh oh!

chriselrod commented Oct 25, 2022

Uh oh!

MilesCranmer commented Oct 25, 2022

Uh oh!

MilesCranmer commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chriselrod commented Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

coveralls commented Oct 26, 2022

Pull Request Test Coverage Report for Build 3331534317

💛 - Coveralls

Uh oh!

MilesCranmer commented Oct 26, 2022

Uh oh!

Uh oh!

Use `LoopVectorization.@turbo` in dynamic expression evaluation scheme #9

Use `LoopVectorization.@turbo` in dynamic expression evaluation scheme #9

MilesCranmer commented Oct 25, 2022 •

edited

Loading

chriselrod commented Oct 25, 2022 •

edited

Loading