Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DifferentiationInterface for the jacobian #258

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

ErikQQY
Copy link
Member

@ErikQQY ErikQQY commented Nov 21, 2024

Fix: #218

MIRK, FIRK, MIRKN, and Ascher methods are using DifferentiationInterface.jl to support sparse Jacobian exploitation, shooting methods are working in progress.

Changing SparseDiffTools.jl to DifferentiationInterfce.jl doesn't cost much pain, but the performance drops comparing with the current master branch, I suspect this maybe caused by some inappropriate matrix coloring and sparsity detection usage of DifferentiationInterface.jl and SparseMatrixColorings.jl in this implementation, for example, the jac_prototype only need the basic structure of jacobian, but didn't find similar functionalities in DI.jl, the construction of ColoringProblem and **Algorithm in SparseMatrixColorings.jl introduce allocations, so cc Guillaume for some advice.

@gdalle
Copy link

gdalle commented Nov 21, 2024

Hi @ErikQQY, thanks for pinging me!

the performance drops comparing with the current master branch

Could you maybe give me a standalone benchmark MWE, even if it requires checking out the package from this branch?

for example, the jac_prototype only need the basic structure of jacobian, but didn't find similar functionalities in DI.jl

If you already know the Jacobian structure, you want to use KnownJacobianSparsityDetector from ADTypes.jl as the sparsity_detector in ADTypes.AutoSparse.

the construction of ColoringProblem and **Algorithm in SparseMatrixColorings.jl introduce allocations

That is unavoidable, but luckily you should be able to do it outside of the hot loop since it is encapsulated in the preparation step. The mantra of DI is "prepare once, reuse plenty of times". I'll check out your code but I suspect you may not be taking advantage of preparation.

@gdalle
Copy link

gdalle commented Nov 21, 2024

Review in progress, but I think you may have forgotten to specify a sparsity_detector in AutoSparse.
Possible choices include:

@gdalle
Copy link

gdalle commented Nov 21, 2024

Also I'm not sure why you perform differentiation with get_dense_ad(diffmode) instead of just diffmode?
To use sparse AD, you must pass an object of type AutoSparse to DI's preparation and differentiation operators. See the tutorial and the explanation in the docs.

Copy link
Contributor

github-actions bot commented Nov 21, 2024

Benchmark Results

master 1880155... master/1880155715ce48...
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK2() 6.45 ± 0.16 ms 6.44 ± 0.17 ms 1
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK3() 2.43 ± 0.12 ms 2.44 ± 0.12 ms 0.999
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK4() 0.857 ± 0.061 ms 0.864 ± 0.057 ms 0.991
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK5() 0.903 ± 0.061 ms 0.903 ± 0.072 ms 1
Simple Pendulum/IIP/BoundaryValueDiffEqMIRK.MIRK6() 1.01 ± 0.088 ms 1.01 ± 0.091 ms 1
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = false) 1.97 ± 0.57 ms 1.99 ± 0.56 ms 0.989
Simple Pendulum/IIP/MultipleShooting(10, Tsit5; grid_coarsening = true) 3.65 ± 0.85 ms 3.66 ± 0.87 ms 0.997
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = false) 0.0555 ± 0.0094 s 0.0555 ± 0.007 s 0.999
Simple Pendulum/IIP/MultipleShooting(100, Tsit5; grid_coarsening = true) 0.0709 ± 0.009 s 0.0711 ± 0.0087 s 0.997
Simple Pendulum/IIP/Shooting(Tsit5()) 0.24 ± 0.078 ms 0.257 ± 0.08 ms 0.934
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK2() 0.0389 ± 0.0029 s 0.0389 ± 0.003 s 0.998
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK3() 11.4 ± 0.81 ms 11.4 ± 0.77 ms 0.999
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK4() 3.28 ± 0.21 ms 3.31 ± 0.2 ms 0.991
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK5() 3.42 ± 0.25 ms 3.48 ± 0.27 ms 0.983
Simple Pendulum/OOP/BoundaryValueDiffEqMIRK.MIRK6() 3.48 ± 0.26 ms 3.49 ± 0.29 ms 0.997
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = false) 3.52 ± 2.6 ms 3.52 ± 2.6 ms 1
Simple Pendulum/OOP/MultipleShooting(10, Tsit5; grid_coarsening = true) 6.29 ± 5.1 ms 6.33 ± 4.7 ms 0.994
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = false) 0.107 ± 0.0042 s 0.108 ± 0.0036 s 0.992
Simple Pendulum/OOP/MultipleShooting(100, Tsit5; grid_coarsening = true) 0.129 ± 0.0098 s 0.128 ± 0.0088 s 1
Simple Pendulum/OOP/Shooting(Tsit5()) 0.612 ± 0.063 ms 0.607 ± 0.062 ms 1.01
time_to_load 7.35 ± 0.044 s 7.34 ± 0.068 s 1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link

@gdalle gdalle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the first few files because I think there are a few misunderstandings to clarify before going further

lib/BoundaryValueDiffEqAscher/Project.toml Outdated Show resolved Hide resolved
lib/BoundaryValueDiffEqAscher/src/ascher.jl Outdated Show resolved Hide resolved
lib/BoundaryValueDiffEqCore/src/utils.jl Show resolved Hide resolved
lib/BoundaryValueDiffEqAscher/src/ascher.jl Outdated Show resolved Hide resolved
lib/BoundaryValueDiffEqCore/src/BoundaryValueDiffEqCore.jl Outdated Show resolved Hide resolved
lib/BoundaryValueDiffEqFIRK/src/firk.jl Outdated Show resolved Hide resolved
Comment on lines 462 to 464
colored_matrix = __generate_sparse_jacobian_prototype(
cache, cache.problem_type, y, y, cache.M,
N, get_dense_ad(jac_alg.nonbc_diffmode))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you use this for?

@ChrisRackauckas
Copy link
Member

There is a banded matrix with a known bandsize, and a potentially sparse set of rows on top. The current algorithm gets the color vector by knowing the bandsize analytical solution and does only sparsity on the extra rows. Then it does batched forward mode for the banded matrix and choice for the top (forward, reverse, etc).

I think the difficulty here is that the representation of the coloring in DI makes it hard to do this by hand. The question is then how to get this mixed mode automatically or semi-automatically.

@gdalle
Copy link

gdalle commented Nov 21, 2024

So if I try to rephrase, you have

  • a function f1(x) whose Jacobian J1 is sparse
  • a function f2(x) whose Jacobian J2 is banded

and you're interested in the Jacobian of f(x) = vcat(f1(x), f2(x))?

Perhaps the easiest way would be to have backend1 and backend2 if you want to keep the flexibility? Although I think both should be instances of AutoSparse, possibly with different settings.
For instance, if you know the optimal coloring for f2 you can indeed use ConstantColoringAlgorithm from ADTypes.
When you say "batched forward mode for the banded matrix", do you mean it's a dense Jacobian computation?

Note that this would also be a good opportunity to try out AutoSparse(DI.MixedMode(...)) (but at the moment it works best with a random coloring order, because we're missing a way to handle neutral colors). Still rather experimental, and the docs are pretty barebones.

@adrhill
Copy link

adrhill commented Nov 21, 2024

Let me know if there are any bottlenecks in SCT's performance. We can most likely add additional methods for speed-ups.

@ErikQQY
Copy link
Member Author

ErikQQY commented Nov 21, 2024

and you're interested in the Jacobian of f(x) = vcat(f1(x), f2(x))?

Yes

Note that this would also be a good opportunity to try out AutoSparse(DI.MixedMode(...)) (but at the moment it works best with a random coloring order, because we're missing a way to gdalle/SparseMatrixColorings.jl#122). Still rather experimental, and the docs are pretty barebones.

Now we are using BVPJacobianAlgorithm with bc_diffmode and nonbc_diffmode to compute the Jacobian separately on the bc part(sparse) and collocation part(dense) in MRIK and FIRK algorithms, for these algorithms don't specify this we can just use diffmode for the Jacobian, I can try out DI.MixedMode for the almost banded situation

@gdalle
Copy link

gdalle commented Nov 21, 2024

I don't understand, is the banded part dense or sparse?

@ErikQQY
Copy link
Member Author

ErikQQY commented Nov 21, 2024

Sorry, the banded part is dense, the bc part is sparse

@gdalle
Copy link

gdalle commented Nov 22, 2024

What kind of storage do you use for this stacked Jacobian? Do you store both parts separately? Or do you put them both inside a Matrix? Or a SparseMatrixCSC?

@ErikQQY
Copy link
Member Author

ErikQQY commented Nov 22, 2024

Now we just store both parts in one matrix(it only matters in the multi-points case, since in two-points case the Jacobian is just a banded matrix which doesn’t need this kind of storage), for example the first several rows of our Jacobian are the sparse Jacobian of bc part(sparse because the evaluation of boundary conditions is only on a few points), and the other rows of the Jacobian have a banded structure. We store this special structure as almost banded matrix using FastAlmostBandedMatrices.jl.

@gdalle
Copy link

gdalle commented Nov 22, 2024

Okay, we'll see how fast decompression is, but there might be some tweaks necessary there

@gdalle
Copy link

gdalle commented Nov 25, 2024

How are we doing around here @ErikQQY? Need any more guidance?

@ErikQQY
Copy link
Member Author

ErikQQY commented Nov 25, 2024

@gallery Thanks a lot! Now I first need to ensure the correct syntax for the sparse Jacobian and the refactored solvers don’t break CI. Still have some subpackages to do, but should be ready very soon. After that, some reviews on the improvements and would be nice😄.

@gdalle
Copy link

gdalle commented Nov 25, 2024

I wonder why this PR ends up adding more lines than it removes? Did we fail with our grand design of DI making life easier for package developers?

@ErikQQY
Copy link
Member Author

ErikQQY commented Nov 26, 2024

I wonder why this PR ends up adding more lines than it removes? Did we fail with our grand design of DI making life easier for package developers?

This PR is kind of still working in progress, I haven't removed the sparse Jacobian stuff from SparseDiffTool.jl since the shooting methods exploit this in a special way(but will delete them once correctly configured!), and this is also mixed with some refactors so the git diff may not reflect the real status.

@gdalle
Copy link

gdalle commented Nov 26, 2024

Okay thanks! But do keep it in mind while you make the changes: if DI is not making your life easier, there is probably something wrong that we can fix 😊

@gdalle
Copy link

gdalle commented Nov 30, 2024

The failure for Ascher seems to be caused by the cache object not adapting to the element type of the input. During sparsity detection, SparseConnectivityTracer.jl passes a custom number type (not unlike ForwardDiff.jl), and there is apparently a forced conversion to Float64 in your code somewhere?

@ErikQQY
Copy link
Member Author

ErikQQY commented Nov 30, 2024

The failure for Ascher seems to be caused by the cache object not adapting to the element type of the input

Yeah, I am not quite satisfied with the current implementation of the collocation part of Ascher methods, and I am doing the overhaul these days. But I believe other methods like MIRK, FIRK, MIRKN, and Shooting methods are surely done now

@gdalle
Copy link

gdalle commented Nov 30, 2024

Any feedback on things we could do in DI to make your life easier?

@ErikQQY
Copy link
Member Author

ErikQQY commented Dec 1, 2024

Any feedback on things we could do in DI to make your life easier?

It is quite delightful to refactor from the SparseDiffTools.jl to DI, and the refactor does make the code simpler, but I guess DI is still a young package compared to others and I stumbled over several bugs(speaking of which, a new issue coming😅). But on the bright side, encountering problems and reporting them are also a way to help DI grow better🤠.

@gdalle
Copy link

gdalle commented Dec 1, 2024

Oh there are definitely a lot of bugs and edge cases to hunt down! The present application to BoundaryValueDiffEq is the most demanding one so far, which means your feedback is very valuable.

@ErikQQY
Copy link
Member Author

ErikQQY commented Dec 5, 2024

@gdalle @adrhill OK, the MIRK, FIRK, Shooting, and MIRKN methods are done now, the other methods need further optimization(need a redesign and don't involve DI, leave it out for now), so could you please give some advice on the current refactorization of changing SparseDiffTools.jl to DI? You can only focus on the changes in files: mirk.jl, firk.jl, mirkn.jl, single_shooting.jl, multiple_shooting.jl and their corresponding sparse_jacobians.jl where the matrix coloring structure is predetermined. Thanks in advance!!!

@adrhill
Copy link

adrhill commented Dec 5, 2024

I have some big picture feedback that might be more relevant to the design of ADTypes.jl:

It looks like you are internally overwriting the sparsity_detector in provided AutoSparse backends, e.g. in mirk.jl L315-347. Maybe ADTypes should provide a detector/trait that can be passed to packages like BoundaryValueDiffEq.jl such that they can provide their own patterns? Basically an empty ADTypes.KnownJacobianSparsityDetector.

On the other hand, I understand that your choice here is the right one and that users most likely always want to use known patterns when available, regardless of what they specify.

lib/BoundaryValueDiffEqFIRK/src/firk.jl Outdated Show resolved Hide resolved
lib/BoundaryValueDiffEqFIRK/src/firk.jl Outdated Show resolved Hide resolved
lib/BoundaryValueDiffEqFIRK/src/firk.jl Outdated Show resolved Hide resolved
Comment on lines +804 to +807
DI.jacobian!(
loss_bc, resid_bc, @view(J[1:L, :]), bc_diffcache, bc_diffmode, x, Constant(p))
DI.jacobian!(loss_collocation, resid_collocation, @view(J[(L + 1):end, :]),
nonbc_diffcache, nonbc_diffmode, x, Constant(p))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if you ever work with SparseMatrixCSC formats (I think it is not the case here), views of sparse matrices are very inefficient

AutoSparse(get_dense_ad(jac_alg.nonbc_diffmode);
sparsity_detector = ADTypes.KnownJacobianSparsityDetector(colored_result.A),
coloring_algorithm = ConstantColoringAlgorithm{ifelse(
ADTypes.mode(jac_alg.nonbc_diffmode) isa ADTypes.ReverseMode,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there are other modes than ForwardMode and ReverseMode: ADTypes.jl also defines SymbolicMode and ForwardOrReverseMode. I don't know whether that is relevant here.

# for underdetermined systems we don't have banded qr implemented. use sparse
J₁ < J₂ && return ColoredMatrix(sparse(J), matrix_colors(J'), matrix_colors(J))
return ColoredMatrix(J, matrix_colors(J'), matrix_colors(J))
J₁ < J₂ && return coloring(sparse(J), problem, algo)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function coloring converts the matrix to SparseMatrixCSC anyway, so the sparse here is superfluous unless you need to recover a SparseMatrixCSC when you do sparsity_pattern(coloring_result) (aka coloring_result.A)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we do need to recover a SparseMatrixCSC for the KnownJacobianSparsityDetector, so I think the sparse here is fine

jac_prototype = init_jacobian(jac_cache)
diffmode = if alg.jac_alg.diffmode isa AutoSparse
AutoSparse(get_dense_ad(alg.jac_alg.diffmode);
# Local because iszero require primal values
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the sparsity pattern will not necessarily generalize across input points. To increase robustness, you probably want to evaluate it at x = rand(n) rather than, say, x = zeros(n).

ensemblealg, prob, jac_cache, ::DiffCacheNeeded,
diffmode, alg, nshoots, u; kwargs...)
if diffmode isa AutoSparse
xduals = reshape(jac_cache.pushforward_prep.xdual_tmp[1:length(u)], size(u))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is private API of DifferentiationInterface, it can change at any time without notice, so you shouldn't touch it. The details of preparation objects are not public-facing.

Why do you need to manipulate dual numbers directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Shooting methods, the loss function is the evaluation of an ODE integrator in boundary conditions, for example, after initializing an ODE problem, we would have an ODE integrator, and we use Newton methods to find the point where the residual satisfies our tolerance, in this process, we have to re-initialize the integrator(re-initializing points are iteration points) until Newton methods achieve convergence. OK, this procedure is basically the shooting methods, but we want to use AD to utilize the Jacobian, so the answer to the Jacobian of this kind of loss function is to build another integrator whose u is Dual number, and in the process of re-initialization, we need Dual number iteration points. Here is an example illustrating how shooting methods work:

using OrdinaryDiffEq, SciMLBase, ForwardDiff, DifferentiationInterface, ADTypes
const DI = DifferentiationInterface

backend = AutoForwardDiff()
function lorenz!(du, u, p, t)
    du[1] = 10.0(u[2] - u[1])
    du[2] = u[1] * (28.0 - u[3]) - u[2]
    du[3] = u[1] * u[2] - (8 / 3) * u[3]
end
u1 = [1.0; 0.0; 0.0]
tspan = (0.0, 100.0)
prob = ODEProblem(lorenz!, u1, tspan)
ode_cache = SciMLBase.__init(prob, Tsit5())
function loss(du, u, p)
    SciMLBase.reinit!(ode_cache, u)
    sol = solve!(ode_cache)
    du .= sol[end]
end
u1 = [2.0; 2.0; 3.0]
jac_cache = DI.prepare_jacobian(nothing, similar(u1), backend, u1)#ForwardDiff.JacobianConfig(nothing, u1, Chunk(2))
xduals = jac_cache.config.duals[2]
fill!(xduals, 0)
ode_cache_jac = SciMLBase.__init(remake(prob, u0 = xduals, tspan = eltype(xduals).(prob.tspan)), Tsit5())
function loss_jac(du, u)
    SciMLBase.reinit!(ode_cache_jac, u)
    sol = solve!(ode_cache_jac)
    du .= sol[end]
end

jac(J, y, p) = DI.jacobian!(loss_jac, y, J, jac_cache, backend, y)
A = zeros(Float64, 3, 3)
b = [1.0, 2.0, 3.0]
jac(A, b, nothing)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the issue. What would be the right abstraction to add to DI here? Something like DI.overloaded_inputs for operator-overloading backends?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't fully understand that adding operator-overloading in DI has something to do here, could you please elaborate on this? The Dual numbers are required for the initialization of the loss function so that we can build Jacobian no matter whether we are using AutoForwardDiff or AutoSparse(AutoForwardDiff), that's why we use jac_cache.config.duals and jac_cache.pushforward_prep.xdual_tmp to obtain the Dual type and enable AD. So I wonder if it's possible to have an easy way to get the Dual number in Jacobian preparation?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, and I'm looking for a way to put this in DI's API so that it becomes more than a ForwardDiff-specific functionality. For most AD backends based on operator overloading, the functions are called on a different input type (like Dual), so I thought returning that input type could make sense in general

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, returning the input type could be so much more convenient to build hacky AD-compatible functions(at least in our ForwardDiff use case :P), that should be easy to implement for operator overloading AD backends, I can add that to DI, and I am thinking we can just provide a simple function like DI.overloaded_inputs you previously proposed to do this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened JuliaDiff/DifferentiationInterface.jl#668 to keep track

if diffmode isa AutoSparse
xduals = reshape(jac_cache.pushforward_prep.xdual_tmp[1:length(u)], size(u))
elseif diffmode isa AutoForwardDiff
xduals = reshape(jac_cache.config.duals[2][1:length(u)], size(u))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here except that it is private API of ForwardDiff (less likely to break though)

@gdalle
Copy link

gdalle commented Dec 5, 2024

Maybe ADTypes should provide a detector/trait that can be passed to packages like BoundaryValueDiffEq.jl such that they can provide their own patterns? Basically an empty ADTypes.KnownJacobianSparsityDetector.

What do you mean by that? Packages are already free to subtype AbstractSparsityDetector if they want

@adrhill
Copy link

adrhill commented Dec 5, 2024

Maybe ADTypes should provide a detector/trait that can be passed to packages like BoundaryValueDiffEq.jl such that they can provide their own patterns? Basically an empty ADTypes.KnownJacobianSparsityDetector.

What do you mean by that?

Maybe users should be provided the choice between:

  • UserKnownJacobianSparsityDetector1: The user defines the pattern. This essentially is the current KnownJacobianSparsityDetector, a struct that holds the known pattern.
  • PackageKnownJacobianSparsityDetector1: The user wants to let the package define the pattern. This would be an empty struct to dispatch on. This is what this PR does for any AutoSparse backend.

Packages are already free to subtype AbstractSparsityDetector if they want

Of course, every package could their own version of PackageKnownJacobianSparsityDetector, just like they could all define their own AutoEnzyme variant, but providing a common interface for (sparse) AD it is the entire purpose of ADTypes.jl.

Footnotes

  1. Disregard the awful naming of the two types, I just wanted to make clear that the current KnownJacobianSparsityDetector is ambiguous as to whether (a) the user or (b) the package knows the pattern. 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Start using DifferentiationInterface
4 participants