Replies: 4 comments 13 replies
-
Hi @jalvesz You might like to take a look at the code here which is a more 'practical' example (benchmarking of running inference on multiple columns of an atmospheric model). We have a setup routine to import the model and create the tensors that is run once at the start of the program, and a destruction routine to clean up that is run at the end. This reduces the overheads required at each iteration when running the model. The net in the example I linked above operates on a single 'atmospheric column' of the numeric model, but we pass in multiple columns as a batch input for inference, but you may already be doing this. Let me know if that is useful, and if you have further questions. |
Beta Was this translation helpful? Give feedback.
-
Hi @jatkinson1000 thank for the advise! so, what I see as the main difference is that instead of using "torch_tensor_from_array" you used "torch_tensor_from_blob" in the benchmark, right? but you also set and delete the pointers within the loop. I tried to simplify my test as follows using your saved_simplenet_model_cpu.pt test model: Click to openprogram main
use iso_fortran_env, only: sp=> real32, dp => real64
use ftorch
implicit none
type(torch_module) :: model ! Generate an object to hold the Torch model
integer, parameter :: wp = sp
! Set up Fortran data structures
real(wp), target :: in_data(5)
real(wp), target :: out_data(5)
integer, parameter :: n_inputs = 1
integer :: tensor_layout(1) = [1]
! Set up Torch data structures
type(torch_tensor) :: in_tensor(1)
type(torch_tensor) :: out_tensor
real(dp), allocatable :: array_dp(:,:)
integer :: i, j
real(dp) :: time_start, time_finish
!===========================================================================
! Initialize the model from a TorchScript (.pt) file.
model = torch_module_load('saved_simplenet_model_cpu.pt')
! Create Torch input/output tensors from the above arrays
!in_tensor(1) = torch_tensor_from_array(in_data, tensor_layout, torch_kCPU)
!out_tensor = torch_tensor_from_array(out_data, tensor_layout, torch_kCPU)
!> first dummy loop to mimic a heavy iterative inference loop
allocate( array_dp(5,10000) , source = 0._dp )
call CPU_TIME(time_start)
do j = 1,100
array_dp(1:5,1) = [0.0, 1.0, 2.0, 3.0, 4.0]
do i = 2, 10000
in_data = real( array_dp(1:5,i-1) )
call torchscript_eval( in_data , out_data )
array_dp(1:5,i) = dble( out_data ) - array_dp(1:5,i-1)
end do
end do
call CPU_TIME(time_finish)
print *, 'time 1: ',time_finish - time_start
deallocate( array_dp )
!> second dummy loop to mimic a heavy iterative inference loop
allocate( array_dp(5,10000) , source = 0._dp )
call CPU_TIME(time_start)
do j = 1,100
array_dp(1:5,1) = [0.0, 1.0, 2.0, 3.0, 4.0]
do i = 2, 10000
in_data = real( array_dp(1:5,i-1) )
call torchscript_eval_blob( in_data , out_data )
array_dp(1:5,i) = dble( out_data ) - array_dp(1:5,i-1)
end do
end do
call CPU_TIME(time_finish)
print *, 'time 2: ',time_finish - time_start
deallocate( array_dp )
! Clean up
call torch_module_delete(model)
contains
subroutine torchscript_eval(tensor_in,tensor_out)
! -- External Variables
real(wp), target :: tensor_in(:)
real(wp), target :: tensor_out(:)
! -- Internal Variables
!-------------------------------------------------
!> wrap tensors
in_tensor(1) = torch_tensor_from_array(tensor_in, tensor_layout, torch_kCPU)
out_tensor = torch_tensor_from_array(tensor_out, tensor_layout, torch_kCPU)
!> Run model and Infer
call torch_module_forward(model, in_tensor, n_inputs, out_tensor)
call torch_tensor_delete(in_tensor(1))
call torch_tensor_delete(out_tensor)
end subroutine
subroutine torchscript_eval_blob(tensor_in,tensor_out)
! -- External Variables
real(wp), target :: tensor_in(:)
real(wp), target :: tensor_out(:)
! -- Internal Variables
integer(c_int) :: ndims, layout(1)
integer(c_int64_t) :: tensor_shape(1)
!-------------------------------------------------
!> wrap tensors
ndims = 1
tensor_shape = [5]
layout = [1]
in_tensor(1) = torch_tensor_from_blob(c_loc(tensor_in) , ndims, tensor_shape, layout, torch_kFloat32, torch_kCPU)
out_tensor = torch_tensor_from_blob(c_loc(tensor_out), ndims, tensor_shape, layout, torch_kFloat32, torch_kCPU)
!> Run model and Infer
call torch_module_forward(model, in_tensor, n_inputs, out_tensor)
call torch_tensor_delete(in_tensor(1))
call torch_tensor_delete(out_tensor)
end subroutine
end program Measuring the times of the two loops (one using torch_tensor_from_array and the other using torch_tensor_from_blob) in the main program give the same time: around 25.5 seconds each. I also tried to analyze with VTune, I saw large times and memory being wasted somewhere but it has trouble showing where as I'm linking a "RelWithDebugInfo" against the release binaries of libtorch. My intuition tells me that, ideally, call torch_module_forward( ), should be able to infer on the data but avoid creating new tensors within. Which is not the case as I see the memory ramping-up: ! Create Torch input/output tensors from the above arrays
in_tensor(1) = torch_tensor_from_array(in_data, tensor_layout, torch_kCPU)
out_tensor = torch_tensor_from_array(out_data, tensor_layout, torch_kCPU) I'm trying to look more closely as what is happening inside but for the moment it is not clear to me. |
Beta Was this translation helpful? Give feedback.
-
@jalvesz do you have a mwe or is your code open source? If so, I could take a look at it to see what's going on with the memory. |
Beta Was this translation helpful? Give feedback.
-
Hi @jalvesz we haven't heard anything and are unable to reproduce, so I'm closing this as outdated. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I would like to ask for guidance on how could I improve an interface to calling a model.
The Fortran code is a FEM solver, and I'm calling the model to update fields on gauss points. The issue I'm facing is that in the current implementation, I'm doing a call to the inference on each point instead of assembling the inputs for all points and doing an inference on an array of Tensors.
The workflow I'm doing is more or less the following:
I load the model, and I built an interface for inference like this:
derived type to contain the model and tensors
This function is being called within a do loop for all the gauss points.
So I did notice that wrapping and destroying the tensors with torch_tensor_from_array/torch_tensor_delete looks quite bad for a call that should be done within such a do loop... But if I pre-wrap the tensors and call the delete at the end, I saw the memory growing wildly, so I put it back in. The issue is that this does run veeery slowly.
This is my initial design, I have to iterate on that but just wanted to have your feedback if possible on using the interface to get the best performance possible.
Thanks,
Beta Was this translation helpful? Give feedback.
All reactions