NVIDIA Enhances cuML Accessibility by Reducing CUDA Binary Size for PyPI Distribution

NVIDIA has announced a significant improvement for users of its cuML library by reducing the size of CUDA binaries, enabling direct distribution on PyPI. This marks a pivotal step in making cuML more accessible, especially for those in corporate environments who rely on internal PyPI mirrors, according to NVIDIA’s blog.

Streamlined Installation Process

With the release of version 25.10, cuML wheels are now pip-installable directly from PyPI, eliminating the need for complex installation steps or managing Conda environments. Users can now install cuML with a simple pip command, akin to any other Python package, which greatly simplifies the process.

Challenges in Binary Size Reduction

The primary hurdle NVIDIA faced was the large size of CUDA C++ libraries, which previously exceeded PyPI’s hosting capabilities. To address this, NVIDIA collaborated with the Python Software Foundation (PSF) to reduce the binary size sufficiently for hosting on PyPI. This collaboration has made it possible for users to install cuML directly, enhancing both accessibility and user experience.

Installation Guidance

For users installing cuML, NVIDIA has provided specific pip commands based on the CUDA version:

For CUDA 13: pip install cuml-cu13 (Wheel size: ~250 MB)
For CUDA 12: pip install cuml-cu12 (Wheel size: ~470 MB)

Binary Size Optimization Techniques

To reduce the binary size by approximately 30%, NVIDIA employed several optimization techniques. These included identifying and eliminating excess in the CUDA C++ codebase, which led to a reduction of the CUDA 12 libcuml dynamic shared object from 690 MB to 490 MB. The optimization not only facilitates faster downloads and reduced storage but also lowers bandwidth costs and accelerates container builds for deployment.

Understanding CUDA Compilation

CUDA binaries are inherently large due to the inclusion of numerous kernels, which are cross-products of template parameters and supported GPU architectures. NVIDIA’s approach involved separating kernel function definitions from their declarations, ensuring each kernel is compiled in one Translation Unit (TU), thereby reducing duplication and binary size.

Future Prospects

By making these improvements, NVIDIA aims to assist other developers working with CUDA C++ libraries in managing binary sizes effectively. This initiative not only benefits cuML users but also encourages a broader adoption of CUDA C++ libraries by making them more manageable and accessible.

For further insights on CUDA programming and optimization techniques, developers can refer to NVIDIA’s CUDA Programming Guide.

Image source: Shutterstock

Source: https://blockchain.news/news/nvidia-enhances-cuml-accessibility-reducing-cuda-binary-size-pypi-distribution