Skip to content

Use LoopVectorization.@turbo in dynamic expression evaluation scheme #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 27, 2022

Conversation

MilesCranmer
Copy link
Member

With JuliaSIMD/LoopVectorization.jl#431, this should potentially work for arbitrary user-defined operators, since it will fall back to @inbounds @simd when an operator cannot be SIMD-ified. It gets the evaluation even faster for me - now evaluation about 30% faster than a simple handwritten function, and even a little bit faster than a manually-written SIMD loop (which does not use @turbo)

Let's see if this works.

cc @chriselrod

@MilesCranmer
Copy link
Member Author

Hm, it's getting stack overflows again on this test, even with the safe mode:

Test tree construction and scoring: Error During Test at /home/runner/.julia/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:25
  Got exception outside of a @test
  LoadError: StackOverflowError:
  Stacktrace:
       [1] vabs(v::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29
       [2] abs(a::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/base_defs.jl:46
  --- the last 2 lines are repeated 39990 more times ---
   [79983] vabs(v::VectorizationBase.Vec{8, Float16})
         @ VectorizationBase ~/.julia/packages/VectorizationBase/MOxkH/src/llvm_intrin/unary_ops.jl:29

Maybe the safe operators break some of the assumptions?

safe_log(x::T) where {T} = (x <= 0) ? T(NaN) : log(x)
. Or maybe it's an issue with Float16 operations?

@chriselrod
Copy link

  1. Define safe_log(x::AbstractSIMD) = log(x); these are already safe
  2. yeah, Float16 isn't supported well/tested much.
  3. can_turbo checks assuming Int, rather than the element type

Feel free to either define LoopVectorization.check_type(::Type{Float16}) = false, or make PRs to VectorizationBase/wherever is appropriate to fix things like abs that are broken.
Probably wouldn't require too much work to get a lot of this working:
https://github.com/JuliaSIMD/VectorizationBase.jl/blob/d39b4a04bc00b5485d7f95d490ce5433cf921cde/src/llvm_intrin/intrin_funcs.jl#L138-L139

@MilesCranmer
Copy link
Member Author

Thanks! I can certainly also make it so that only Float32 and Float64 are attempted to be @turboed; all other types use @inbounds @simd.

How hard do you think it would be for me to build a macro in this package that does a test vectorization on the given type and operator, and, if it fails for whatever reason, it falls back to @inbounds @simd? I'm hoping to have this be as robust as possible for all user-defined operators.

Maybe a simpler but less general solution would just be to have a pre-defined set of operators that are known to work, and only @turbo on those.

@MilesCranmer
Copy link
Member Author

MilesCranmer commented Oct 25, 2022

Here's an example. Could something like this be done in a macro instead of a generated function? Maybe even as a new @turbo_with_fallback macro in LoopVectorization.jl?

using DynamicExpressions
using LoopVectorization

@generated function unary_operator_kernel!(
    x::AbstractArray{T}, y::AbstractArray{T}, ::Val{op_idx}, operators::OperatorEnum
) where {T,op_idx}
    # First, we try to @turbo and eval an example array:
    num_samples = 32
    _x = similar(x, num_samples);
    _y = similar(x, num_samples);
    # Get operator from type:
    unaops = operators.parameters[2]
    op = unaops.parameters[op_idx].instance
    can_turbo = try
        @turbo for i in indices(_x)
            _y[i] = op(_x[i])
        end
        true
    catch  # Catch ALL errors.
        false
    end
    if can_turbo
        quote
            @turbo for i in indices(x)
                y[i] = $op(x[i])
            end
        end
    else
        quote
            @inbounds @simd for i in indices(x)
                y[i] = $op(x[i])
            end
        end
    end
end

x = randn(Float16, 100);
y = similar(x);
operators = OperatorEnum(;unary_operators=[abs])
unary_operator_kernel!(x, y, Val(1), operators)

Even though abs is not defined for Float16 types, this will still work - the test @turbo will get a stack overflow, so @inbounds @simd will be used in the generator function instead.

@chriselrod
Copy link

chriselrod commented Oct 25, 2022

Thanks! I can certainly also make it so that only Float32 and Float64 are attempted to be @turboed; all other types use @inbounds @simd.

My suggestion was to PR LoopVectorization to disable it for Float16 in general, or (better) fix it so it works.

Another option is to actually forward and promote type information to the can_turbo calls.

@MilesCranmer
Copy link
Member Author

Thanks, will consider! Would that be the one remaining issue or do you foresee others?

I am thinking if I simply want to add an opt-in “use_turbo” argument to the evaluation function, so that users are not surprised by any errors that come out of evaluation.

@MilesCranmer
Copy link
Member Author

Weird, also seeing this for the derivative ops with Float32 variables. Just to be robust maybe it's best I leave it optional for now with a turbo::Bool kwarg to the evaluation functions.

Testing derivatives with respect to variables, with type=Float32.
ERROR: LoadError: TypeError: non-boolean (VectorizationBase.Mask{4, UInt8}) used in boolean context
Stacktrace:
  [1] _pow_grad_x(x::VectorizationBase.Vec{4, Float32}, p::VectorizationBase.Vec{4, Float32}, y::VectorizationBase.Vec{4, Float32})
    @ ChainRules ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:323
  [2] power_pullback
    @ ~/.julia/packages/ChainRules/2ql0h/src/rulesets/Base/fastmath_able.jl:201 [inlined]
 ...
 [20] #eval_diff_tree_array#33
    @ ~/Documents/DynamicExpressions.jl/src/EvaluateEquationDerivative.jl:49 [inlined]
 [21] #307
    @ ./none:0 [inlined]
 [22] iterate
    @ ./generator.jl:47 [inlined]
 [23] collect(itr::Base.Generator{UnitRange{Int64}, var"#307#313"{Matrix{Float32}, OperatorEnum{Tuple{typeof(+), typeof(*), typeof(-), typeof(/), typeof(pow_abs2)}, Tuple{typeof(custom_cos), typeof(exp), typeof(sin)}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(+)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(*)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(-)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(/)}, DynamicExpressions.OperatorEnumConstructionModule.var"#diff_op#11"{typeof(pow_abs2)}}, Tuple{DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(custom_cos)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(exp)}, DynamicExpressions.OperatorEnumConstructionModule.var"#735#diff_op#13"{typeof(sin)}}}, Bool, Node{Float32}}})
    @ Base ./array.jl:787
 [24] top-level scope
    @ ~/Documents/DynamicExpressions.jl/test/test_derivatives.jl:85
 [25] include(fname::String)
    @ Base.MainInclude ./client.jl:476
 [26] top-level scope
    @ REPL[21]:1
in expression starting at /Users/mcranmer/Documents/DynamicExpressions.jl/test/test_derivatives.jl:45

@MilesCranmer MilesCranmer requested a review from kazewong October 26, 2022 18:23
@MilesCranmer
Copy link
Member Author

Okay, with turbo set to optional, I can get the tests to pass. I'm seeing a really nice speed boost!!

using DynamicExpressions

X = randn(Float32, 3, 5_000);

# Feature nodes:
x1, x2 = Node(;feature=1), Node(;feature=2)
# Dynamically construct the expression:
tree = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)

# Evaluate:
@btime tree(X)
# 128 us
@btime tree(X; turbo=true)
# 57 us

Wow that is fast!

@coveralls
Copy link

Pull Request Test Coverage Report for Build 3331534317

  • 61 of 68 (89.71%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.6%) to 92.451%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Utils.jl 3 10 30.0%
Totals Coverage Status
Change from base Build 3308712281: 0.6%
Covered Lines: 894
Relevant Lines: 967

💛 - Coveralls

@MilesCranmer
Copy link
Member Author

Still not as nice as a handwritten kernel though (the more complex the expression, the faster a single kernel will be compared to the recursive evaluation):

function f(X)
    y = Array{Float32}(undef, size(X, 2))
    @turbo for i in indices(y)
        x1 = X[1, i]
        x2 = X[2, i]
        y[i] = cos(x1 - 3.2) * 1.5 - 2.0 / (sin(x2) * sin(x2) + 0.01)
    end
    y
end

@btime f(X)
# 5.2 us

@MilesCranmer MilesCranmer merged commit f3d6be0 into master Oct 27, 2022
@MilesCranmer MilesCranmer deleted the loop-vectorization branch October 27, 2022 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants