-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid NSID #26
Comments
I'm not sure to be honest. My initial suspicion is that there may be some issue with peer-to-peer. Does this only happen with the CUDA benchmark? Does the |
So when I run nvm-latency-bench like: So I guess it only happens with the cuda benchmark. |
I just tested on a completely different system with a different SSD and it still doens't work. |
It seems that you ran the As for the calculations of chunks, pages and threads: Yes, it definitively is a bit iffy and I haven't tested it properly and I suspect that there may be some bugs/issues with offsets there. I've usually tested with powers of two for the number of threads, the number of pages and chunks set to 1 should be okay. Is there anything showing up in the system log ( |
So when I run: I get the following output:
If I remove the very and the infile I get the following output:
No messages in dmesg. |
Have you been able to replicate the issue? |
Are you using these for the creation of the queue memory? for the prp list memory? By the way I am using the code from the master branch. Although i have tried the other branch, gives same affect. |
Hi, I am at my cabin right now so I don't have any access to hardware at the moment. I will be back at the office at the beginning of next week.
It's good that the using a GPU with the verify option works, that rules out any issues with PCIe peer-to-peer. Just to explain what's going on here:
So I'm fairly sure that at least that works.
As for the weird percentiles output, that's a known bug. It's caused by an arithmetic overflow because the number of repetitions is too low. I'll fix this at some point, but it's just an annoyance so it's not a high priority.
Yes. There might be something buggy going on there. But with chunks and pages set to 1, the calculation should be fairly straight forward despite any calculation errors. Have you tried setting the number of threads to 1? Out of interest, what GPU and disk are you using? Maybe I can try to reproduce when I get back. |
Is this issue by any chance related to your experience from #25 ? |
I have tried 1 thread 1 chunk and 1 page and that's fine. I guess it's the same issue as #25 |
I see. I'll try to reproduce the issue later this week then. Thank you for reporting it. At this point, I believe some offset calculation is very likely the culprit. |
I believe I have confirmed that there is some issue with illegal memory accessing for some parameters. I will try to look into it as soon as I am able to. |
If/when you know where this is happening (or with what structure) could you please let me know? Thanks. |
Hi, Any updates on this? |
I'm currently unable to reproduce this, I've tried with the different combinations applied in #25 but it seems to work. However, I see that in my SISCI branch, I've made a restriction to only allow thread count as power of two. The bug I thought I was able to reproduce was something unrelated. |
Ok thank you for looking into it. |
Other details that may be of use: |
I'll look into it some more, I haven't ruled out that there is some form of alignment/overlap issue that doesn't happen on my system but may happen on other systems. I'm not aware of any BIOS setting that might affect it. In the past I've tested with Samsung Evo 960 and 970 in the past, some non-Optane Intel disks I don't recall the model names of ATM, Intel Optane 900P and Intel Optane 4800X. I've only used the two Optane disks when trying to reproduce this issue though, so I can try using one of the other disks. I'll see what I can do in order to try to reproduce it, but I'm pretty swamped with other stuff the next couple of weeks. Just out of curiosity, have you tried both branches? You may have to run make clean and even cmake again after switching to the other branch. |
Yes I have tried both branches and I get the same result. |
One more question, what distro, kernel version, cuda version, and nvidia driver version do you use? |
I've tested CentOS and Fedora in the past and with different CUDA versions (and it has worked), but I've tried replicating this issue using Ubuntu 18.04.2 with CUDA 10.1. The driver version is 418.40.04, as reported by To clarify, it's just some combinations of arguments that hangs, right? Or does all hang now? |
Yes it is just some combinations, many others work. But I am concerned that since some combinations don't work, there is something fundamentally wrong going on (like some overlap or miscalculated indices) |
I totally understand, and I will try to look into it when I have the time. For those combinations that appear to work it should be possible to verify output from disk by using the While this does not guarantee that ranges aren't overlapping, it at least should provide some sort of confirmation that the entire range is covered and that all chunks are read at least once. Maybe not very assuring, but it at least confirms that data is read from the disk. |
So that is what I have been doing, i write data to the disk and then use the cuda benchmark with the output flag and I get correct values, at least as far as I remember, for the combinations that work. The data read is correct. I will do some more extensive testing over the weekend. |
I tested the output for configs that work and the output seems to be correct. Is there a limit on how many dma mappings can be creating using GPUDirect? |
There's no limitation on the number of mappings, but most GPUs have a limitation on how much memory can be "pinned" and exposed for third-party devices. This limitation is usually around 128-256 MB. Depending on the GPU, there is also the memory alignment requirement (that pinned memory must be 64 K aligned). |
The limitation should be what is reported by the nvidia-smi -q command, according to the following link, right?: |
That's correct. You can also see the BAR size using |
I am using a volta gpu for my testing, if that matters. Also, when the SSD does and DMA into GPU memory, does it invalidate any cached lines (for the region being written to) in the GPU's caches? |
I am running the cuda benchmark from your codebase, with the following output for the controller and command line configuration:
The problem is the thread never finishes polling for the first chunk. So I exit out, reload the regular nvme driver and check the device's error log.
When I check the device's error log, I see the following entry for each time I try to run the benchmark:
The nvme ssd has only 1 namespace (NSID: 1) and its the one being used for all commands in the codebase. So what could be the issue? Any help in this matter will be appreciated.
The text was updated successfully, but these errors were encountered: