-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EnsembleGPUKernel
+ Texture Memory Support
#224
Comments
This looks great! |
|
Inspecting this kernel with Cthulhu ( • %419 = invoke ODEFunction(::SVector{6, Float32},::Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64},::Float32)::Union{} That immediately shows the problem: your kernels still contain CPU datastructures (CuTexture, CuTextureArray) which first need to be converted to their GPU counterparts (CuDeviceTextire). Normally this conversion happens automatically when passing such objects to a kernel. In the case of structs containing GPU objects you need to define Adapt rules. It seems that ODEProblem already does so, because the
However, the problematic atsit5 invocation below passes a GPU vector of problems, and the CPU-to-GPU object conversion does not happen automatically for array elements (because it would otherwise require a download from GPU->CPU, perform the conversion there, allocate a new array, upload again; making GPU kernel launches unacceptably expensive):
The simplest solution here would be to pass a tuple of ODEProblems, because for tuples we can do the conversion efficiently. I'm not sure where that change would be needed, but @utkarsh530 or @ChrisRackauckas probably know where this comes from. |
Hmm, passing these ODE problems as a tuple isn't going to work because of their size: using Adapt
# HACK: force a GPU array of ODE problems to be passed as a tuple
Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem}) =
tuple(adapt.(Ref(to), Array(x))...) ... yields function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
# first convert the contained ODE problems
y = CuArray(adapt.(Ref(to), Array(x)))
# continue doing what the default method does
Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end And with that, the kernel launches and runs 🙂 |
The array of problems are built here: Lines 598 to 606 in 70b0820
cu(probs) dispatch is here: Lines 690 to 696 in 70b0820
But yes, I understood your point as to why that wasn't working. We can definitely try to figure out how "expensive" it is, but it works fine currently |
Here's the comparison with Script
Benchmarking:
|
@utkarsh530 is that the right script? There is no |
Sorry, I just updated it. |
I am trying to test this on an A100 and I get the following on all the solves. julia> esol_gpu = solve(eprob_interp, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
ERROR: UndefKeywordError: keyword argument dt not assigned
Stacktrace:
[1] batch_solve_up_kernel(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, DataInterpolations.LinearInterpolation{SMatrix{4, 64, Float32, 256}, LinRange{Float32, Int64}, true, Float32}}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_gpu), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
@ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:382
[2] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#13#19", LinRange{Float32, Int64}}}})
@ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:345
[3] macro expansion
@ ./timing.jl:382 [inlined]
[4] __solve(ensembleprob::EnsembleProblem{ODEProblem{SVector{6, Float32}, Tuple{Float32, Float32}, false, Tuple{SVector{3, Float32}, CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}}, Float32, Int64}, ODEFunction{false, SciMLBase.AutoSpecialize, typeof(ballistic_t), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#15#16", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::GPUTsit5, ensemblealg::EnsembleGPUKernel; trajectories::Int64, batch_size::Int64, unstable_check::Function, adaptive::Bool, kwargs::Base.Pairs{Symbol, LinRange{Float32, Int64}, Tuple{Symbol}, NamedTuple{(:saveat,), Tuple{LinRange{Float32, Int64}}}})
@ DiffEqGPU ~/.julia/packages/DiffEqGPU/CiiCq/src/DiffEqGPU.jl:254
[5] #solve#33
@ ~/.julia/packages/DiffEqBase/Lq1gG/src/solve.jl:851 [inlined]
[6] top-level scope
@ ~/GPUODEBenchmarks/GPU_ODE_SciML/Texture/wind.jl:132 |
nvm, |
A100 results for comparison.
|
@utkarsh530 I am trying to benchmark the Texture memory for different number of trajectories (above only uses 100) trajectories. However, I am noticing when doing GPUTsit5() for large numbers, e.g. 8388608. I have essentially 0% GPU usage according to |
Generally, that's the case with the |
scriptusing Pkg; cd(@__DIR__); Pkg.activate(".")
using CUDA, DiffEqGPU, OrdinaryDiffEq, Plots, Serialization, StaticArrays, Distributions, LinearAlgebra, Adapt
import DataInterpolations
const DI = DataInterpolations
function ballistic(u, p, t)
CdS, mass, g = p[1]
interp = p[2]
zmax = p[3]
N = p[4]
vel = @view u[4:6]
wind, ρ = get_weather(interp, u[3], zmax, N)
airvelocity = vel - wind
airspeed = norm(airvelocity)
accel = -(ρ * CdS * airspeed) / (2 * mass) * airvelocity - mass*SVector{3}(0f0, 0f0, g)
return SVector{6}(vel..., accel...)
end
@inline function get_weather(tex, z, zmax, N)
idx = (1f0-1f0/N)*z/zmax + 0.5f0/N # normalized input for table lookup based on https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#table-lookup
weather = tex[idx]
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
@inline function get_weather(itp::DI.LinearInterpolation, z, zmax, N)
weather = itp(z)
wind = SVector{3}(weather[2], weather[3], 0f0)
ρ = weather[4]
wind, ρ
end
function Adapt.adapt_structure(to::CUDA.Adaptor, x::CuArray{<:ODEProblem})
# first convert the contained ODE problems
y = CuArray(adapt.(Ref(to), Array(x)))
# continue doing what the default method does
Base.unsafe_convert(CuDeviceArray{eltype(y),ndims(y),CUDA.AS.Global}, y)
end
# build interpolants
data = deserialize(joinpath(@__DIR__,"data","forecast.txt"))
N = length(data.altitude)
def_zmax = data.altitude[end]
weather_sa = map(data.altitude, data.windx, data.windy, data.density) do alt, wx, wy, ρ
SVector{4}(alt, wx, wy, ρ)
end
weather_sa = SVector{length(weather_sa)}(weather_sa)
interp = DI.LinearInterpolation{true}(hcat(weather_sa...),data.altitude)
weather = map(weather_sa) do w
(w...,)
end
weather_TA = CuTextureArray(weather)
texture = CuTexture(weather_TA; address_mode = CUDA.ADDRESS_MODE_CLAMP, normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())
### Simulation parameters
trajectories = 10_000
u0 = @SVector [0.0f0, 0.0f0, 10000.0f0, 0f0, 0f0, 0f0]
tspan = (0.0f0, 40.0f0)
saveat = LinRange(tspan..., 100)
p = @SVector [25f0, 225f0, 9.807f0]
p_tx = (p, texture, def_zmax, N)
p_di = (p, interp, def_zmax, N)
CdS_dist = Normal(0f0, 1f0)
prob_func = (prob, i, repeat) -> remake(prob, p = (p + SVector{3}(rand(CdS_dist), 0f0, 0f0), prob.p[2:end]...))
### Texture Solve Test
# High Level
prob_tx = ODEProblem(ballistic, u0, tspan, p_tx)
eprob_tx = EnsembleProblem(prob_tx; prob_func, safetycopy = false)
@time esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
# Low Level
@time begin
probs_tex = map(1:trajectories) do i
prob_func(prob_tx, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); saveat)
end
### DI Solve Test
# High Level
prob_di = ODEProblem(ballistic, u0, tspan, p_di)
eprob_di = EnsembleProblem(prob_di; prob_func, safetycopy = false)
@time esol_di = solve(eprob_di, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories, saveat)
# Low Level
@time begin
probs_di = map(1:trajectories) do i
prob_func(prob_di, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); saveat)
end
# trajs = 8*4 .^(0:9)
using BenchmarkTools
BenchmarkTools.DEFAULT_PARAMETERS.samples = 3
trajs = 8*4 .^(0:8)
times = map(trajs[end]) do traj
@show traj
tx_lo = @benchmark @CUDA.sync begin
probs_tex = map(1:$traj) do i
prob_func(prob_tx, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_tex, prob_tx, GPUTsit5(); $saveat)
end
display(tx_lo)
di_lo = @benchmark @CUDA.sync begin
probs_di = map(1:$traj) do i
prob_func(prob_di, i, false)
end |> cu
ts,us = DiffEqGPU.vectorized_asolve(probs_di, prob_di, GPUTsit5(); $saveat)
end
display(di_lo)
tx = @benchmark @CUDA.sync esol_gpu = solve(eprob_tx, GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat)
display(tx)
di = @benchmark @CUDA.sync esol_gpu = solve(eprob_di , GPUTsit5(), EnsembleGPUKernel(0.0); trajectories = $traj, saveat = $saveat)
display(di)
dicpu = @benchmark esol_cpu = solve(eprob_di , Tsit5(), EnsembleThreads(); trajectories = $traj, saveat = $saveat)
display(dicpu)
(tx = minimum(tx.times) / 1e6,
di = minimum(di.times) / 1e6,
dicpu = minimum(dicpu.times) / 1e6,
tx_lo = minimum(tx_lo.times) / 1e6,
di_lo = minimum(di_lo.times) / 1e6)
end
using Plots
begin
plt = plot(trajs, getindex.(times, :tx), label = "Texture", marker = :utriangle, legend = :topleft, yaxis = :log, xaxis=:log)
plot!(trajs, getindex.(times, :di), label = "Software", marker = :ltriangle)
plot!(trajs, getindex.(times, :dicpu), label = "CPU", marker = :square)
plot!(trajs, getindex.(times, :tx_lo), label = "Texture Low-Level", marker = :dtriangle, legend = :topleft)
plot!(trajs, getindex.(times, :di_lo), label = "Software Low-Level", marker = :rtriangle)
xlabel!("Trajectories")
ylabel!("(ms)")
plt
end The low-level times includes the time to create and copy the probs. CPU times is using For say 524288 trajectories, if I time just the solve with the low level interface, I get
During this low-level solve I get 100% CPU usage on a single core and 0% usage on the GPU until right before the solve completes where it blips to ~30%. |
I am not sure, but the 100% CPU usage could be due to expensive GPU kernel launches requiring multiple uploads to GPU and CPU due to the reason Tim pointed out earlier. Even the scaling of the plot with trajectories seems a bit off, compared to plain ODE solves benchmarks. IMHO, we should only compare the time spent on solving the ODE rather than setup (which is generating GPU arrays of |
I think @maleadt might be able to comment better here. |
After reviewing @maleadt's earlier comments, I wonder if makes sense to have a way to pass some parameters that are "global" to all ensembles or as Refs, so that the conversion needs to only happen once. |
So, I've been experimenting with different options for passing the NOTE: I am trying to avoid using the Adapt rule above due to the huge conversion overhead. Here is a MWE demonstrating the issue First, create a texture that is just 0 everywhere and verify that it can be used in a closure w/ proper conversions. This works as expected. using CUDA, DiffEqGPU, OrdinaryDiffEq, StaticArrays, Adapt
data = map(1:5000) do w
(zeros(Float32, 4)...,)
end
texture = CuTexture( CuTextureArray(data); address_mode = CUDA.ADDRESS_MODE_CLAMP,
normalized_coordinates = true, interpolation = CUDA.LinearInterpolation())
idx_gpu = CuArray(LinRange(0f0, 1f0, 4000))
dst_gpu = CuArray{NTuple{4, Float32}}(undef, size(idx_gpu))
function interp!(dst, idx, tex)
dst .= getindex.(Ref(tex), idx)
end
cl_let = let tex = texture
(d,i) -> interp!(d, i, tex)
end
cl_let(dst_gpu, idx_gpu); Next, lets define a simple ODE that accepts a texture as an argument but does nothing with it and solves over a closure of function eom(u, p, t, tex)
return @SVector zeros(Float32, 4)
end
trajectories = 8192
u0 = @SVector zeros(Float32, 4)
tspan = (0.0f0, 40.0f0)
saveat = LinRange(tspan..., 100)
prob_func = (prob, i, repeat) -> remake(prob, u0 = @SVector randn(Float32, 4))
cl = let tex = texture
(x,p,t)->eom(x, p, t, tex)
end
prob = ODEProblem(cl, u0, tspan)
eprob = EnsembleProblem(prob; prob_func, safetycopy = false)
esol = solve(eprob, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat) This solves with no issue. Note: no adapt rule was defined. Next, lets solve similarly but with an ODE that uses the texture function eom_tex(u, p, t, tex)
@inbounds w = tex[0.5] #NTuple of zeros
return SVector(w...)
end
cl_tex = let tex = texture
(x,p,t)->eom_tex(x, p, t, tex)
end
prob_tex = ODEProblem(cl_tex, u0, tspan)
eprob_tex = EnsembleProblem(prob_tex; prob_func, safetycopy = false)
esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat) leading to julia> esol_tex = solve(eprob_tex, GPUTsit5(), EnsembleGPUKernel(0.0); adaptive= true, trajectories, saveat)
ERROR: InvalidIRError: compiling kernel #atsit5_kernel(CuDeviceVector{ODEProblem{false,SVector{4, Float32},Tuple{Float32, Float32},…}, 1}, CuDeviceMatrix{SVector{4, Float32}, 1}, CuDeviceMatrix{Float32, 1}, Float32, CallbackSet{Tuple{}, Tuple{}}, Nothing, Float32, Float32, CuDeviceVector{Float32, 1}, Val{false}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to (f::ODEFunction)(args...) in SciMLBase On inspection I see we still have a • %2 = invoke eom_tex(::SVector{4, Float32},::SciMLBase.NullParameters,::Float32,::CuTexture{NTuple{4, Float32}, 1, CuTextureArray{NTuple{4, Float32}, 1}})::Union{} Why? It seems like the proper conversion occurred in the other examples using a closure over the texture. If I add the adapt rule, then it works. However it spends almost all the time converting. I don't understand why it is needed though if the texture is not an explicit parameter and is used in a closure instead. |
Sorry for the slow response, I needed some time to catch up with my GH notifications 🙂
That's because in those cases you were invoking a closure directly, and Adapt.jl has rules ( Here, however, a kernel is invoked with an EnsembleProblem argument, which contains an ODEProblem, which contains an ODEFunction, which contains a closure that captures a CuTexture and a CuTextureArray. Although CUDA will use Adapt to try and convert such an argument to a device-compatible representation, there are no Adapt rules defined for these types of objects, so the conversion is a no-op. So basically, there would need to be Adapt rules for each of these types so that kernel conversion recurses into the objects when the EnsembleProblem is passed to a kernel. Alternatively, the code constructing an EnsembleProblem could manually call One simple way to add Adapt rules is to use julia> Adapt.@adapt_structure EnsembleProblem
julia> Adapt.@adapt_structure ODEProblem
julia> Adapt.@adapt_structure ODEFunction
julia> cudaconvert(eprob_tex)
ERROR: MethodError: no method matching ODEFunction(::var"#14#15"{CuDeviceTexture{NTuple{4, Float32}, 1, CUDA.ArrayMemory, true, CUDA.LinearInterpolation}}, ::LinearAlgebra.UniformScaling{Bool}, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::typeof(SciMLBase.DEFAULT_OBSERVED), ::Nothing, ::Nothing) |
@ChrisRackauckas @utkarsh530
Per our conversation here is an example demonstrating how texture memory interpolation could be use.
I would like to be able to leverage CUDA.jl's texture memory support for interpolation of data in the EOM and/or in a callback. A use case could be dropping a ball in a wind field with ground impact termination for a non-flat terrain. Here, one would want to interpolate the wind field as a function of state in the eom as a forcing term and an elevation map as a function of altitude.
Below is an initial prototype. This includes a CPU implementation that leverages DataInterpolations.jl to demonstrate the functionality desired using this data forecast.txt I also included an initial non-working prototype using texture memory.
No interpolation
Working model for CPU and GPU w/o interpolation
DataInterpolations.jl CPU Example
This demonstrates the basic capability I would like to replicate in w/
EnsembleGPUKernel
usingCUDA.CuTexture
GPU Texture Interpolation Validation
Demonstrate usage of CuTexture for interpolation. Note, here I index into the texture memory by broadcasting over a
CuArray{Float32}
of indices viadst_gpu .= getindex.(Ref(texture), idx_tlu)
EnsembleGPUKernel
+CuTexture
prototypeNon-working prototype. Note here the
get_weather
function is indexing the texture at a single index for a single trajectory which isn't supported byCUDA.jl
. Although this is scalar indexing it should actually be occurring for each trajectory in the ensemble.ContinousCallback Prototype
The above example only does interpolation in the eom. However, interpolation could also occur in evaluating a callback. e.g. something like this
The text was updated successfully, but these errors were encountered: