error by using cuda-aware-mpi-example, bandwidth was wrong (Issue #41)

wafer 发表于 2023-1-20 19:55:35

Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. ENV_LOCAL_RANK defined the wrong way (but you seem to have that right) or that on the system you are using CUDA_VISIBLE_DEVICES is set in a funky way. As this code has not been updated for quite some time can you try with https://github.com/NVIDIA/multi-gpu-programming-models (also a Jacobi solver but with a simpler code that I regularly use in tutorials).
I also checked the math for the bandwidth: The formula used does not consider caches, see https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/Host.c#L291 which explains why you are seeing too large memory bandwidths.

—
Reply to this email directly, view it on GitHub, or unsubscribe.

页: [1]

几何尺寸与公差论坛's Archiver

error by using cuda-aware-mpi-example, bandwidth was wrong (Issue #41)