GTC 2024

https://www.nvidia.com/gtc/session-catalog/?regcode=so-link-702705&ncid=so-link-702705&tab.allsessions=1700692987788001F1cG#/

March 2024

GTC 2024 Keynote

Jensen Huang (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog?search=S62542&tab.allsessions=1700692987788001F1cG

Science

Earth-2: Updates on kilometer-scale visualization, simulation, digital twinning, and AI super-resolution

Karthik Kashinath (NVIDIA), Mike Pritchard (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1695928571617001yi1j

Diffision models for high resolution emulation.

Earth-2 Mission #1 - Next-gen weather and climate prediction

Eatch-2 Mission #2 - Interacti with predictions with low latency e.g. Q&A

3 miracles (Huang at Berlin Sumit for climate simulation 2023):

Km-scale simulation (efficiently)
AI emulation for any region for any time period
Visualization using cloud technology

Currently partner with companies to run climate models on GPU. See ALPS talk.

Best resolution is around 25 km e.g. ERA5.

Mining the data assimilation states from ERA5, like training ML to make 1080p video

WeatherBench2.0 shows AI models skill over traiditonal NWP

Models capture idealized physics

FourCastNet does multi-month roll out e.g. AFNO -> SFNO (spherical harmonics)

Shows stability e.g. can run for 10 years and matches expected signal

Need to measure metrics carefully e.g. different for each variable

Modulus offers different moels

Earth2MIP - Open source (https://github.com/NVIDIA/earth2mip)

Hard to compare against full ECMWF members.

Create lagged ensembles with diff IC.

Need higher res for impacts e.g. wildfires

sub-KM is spare e.g. radar, HRRR, WRF etc.

GenAI (diffusion models) for super resolution

CorrDiff e.g. on Tainwan regioanl model 25 km -> 2km. Does better than other models,

Fine-tune to run anywhere. Scale global. fast

Partner with the weather company to generate km scale training datasets. e.g. weather stations network. wind sensors along power lines

Apply to future climate

Services available see build.nvidia.com

Sub-seasonal and Seasonal Forecasting with a Deep Learning Earth-System Model

Dale Durran

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1694185449933001XH7K

DLWP-HPX

CNN using 2D spherical shells; HEALPix (Hierarchial Equal Area isoLatitude Pixelazation) mesh (common in atmosheric science, E-to-R mesh)

Novel ConvNet - inverted channel depth, recurrence in the latest space, dilation give large receptive field.

Loss function is the sum of MSE over 24 hrs

https://arxiv.org/abs/2401.15305 - A Practical Probabilistic Benchmark for AI Weather Models

Avoid pystical parametrizaton

No geo-specific NN weghts (CNN is translation invariant)

Determins ppt from other physcical fields - works well

Run in coupled atmosphere-ocean

Captures a ETC, ENSO, amplitude is too low but coarse model

https://essopenarchive.org/doi/full/10.22541/essoar.169603505.58030377 - Advancing Parsimonious Deep Learning Weather Prediction using the HEALPix Mesh

Huge Ensembles of Weather Extremes using NVIDIA's Fourier Forecasting Neural Network (FourCastNet)

William Collins (Berkeley)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1694024798083001ttCU

Low-Likeihood High Impact Extremes (LLHIs) - heatwaves, atmospheric rivers, TC, ETC, flooding, Snow events, Wildfires

Data-Driven Weather Prediction (DDWP)

Deuben & Bauer (2018) 6 deg, MLP

Rasp et al (2020) 5 deg, CNN

Weyn et al (2019) 2.5 deg, ConvLSTM

Weyn et al (2020) 2 deg, CNN

Keisler et al (2022) 1 deg, GNN - https://github.com/openclimatefix/graph_weather https://arxiv.org/abs/2202.07575

Pathak et al (2022) 0.25 deg, VIT+AFNO (FourCastNet) - https://arxiv.org/abs/2202.11214

FourCastNet - Full Atmosphere AI Surrogate, Fourier Neural Operator, trained on 100 GPU hours, inference in 3 seconds for a 2-week forecast. 10k-100k speed up

Fourier transform for global convolution. Learns solutions operator, mesh and resolution invariant. Trained using ERA5 1975-2015, valid on 2016-2017 an held out 2018 onwards

Medium-range forecast skill is comparable to IFS

Modulus-Makani (massively parallel training of ML based weather and climate models)

Earth2-MIP (Experimentation with AI models for weather and climate)

ECMWF AI Lab

Able to predict extremes such as TCs

Preduct moisture variables

Speed of inference enables massive ensembles (>10k) - capture long-tail events

How? Perturb IC, Perturb the model, Inject noise during model integration

The spread of an ensemble prediction should grow same rate as its error.

Scales efficiently up to ~400 GPUs and peak performance is 140.8 petaFLOPS (https://arxiv.org/abs/2208.05419 - FourCastNet: Accelerating Global High-Resolution Weather Forecasting using Adaptive Fourier Neural Operators)

Spherical harmonics - https://github.com/NVIDIA/torch-harmonics - WIP

Can DDWP fill gap between weather and climate?

https://arxiv.org/abs/2310.02074 - ACE: A fast, skillful learned global atmospheric model for climate prediction

Toward Km scale emulation - fourcastnet ~2.5B Parameter FCN (5km Resolution)

Next steps - Data Driven climate prediction in IPCC; these can supplement traditional forecast cetner models

How AI and Accelerated Computing are Revolutionizing Oceanographic Data Processing

Jann Wendt (North.io), Shilpa Kolhatkar (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1693928059937001X6CY

https://north.io/en/

Geophysical data e.g. scans, sonar, Work with point clouds. Sound velocity under water, munition detection

Harnessing GPUs for Accelerated Air Quality Simulations in the NASA Earth System Model

Peter Ivatt (University of Maryland)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1694114300182001zI1v

VIDEO NOT WORKING.

Old data:

https://developer.nvidia.com/blog/nasa-and-nvidia-collaborate-to-accelerate-scientific-data-science-use-cases-part-1/ Use XGBoost for chemical solver

https://github.com/christophkeller/gc-xgb

A ROMS-Compatible, Ocean Numerical Model for the GPU

Jose Ondina (University of Florida), Oden Green (NVIDIA), Zoe Ryan (NVIDIA), Ron Fick (University of Florida), Maitane Olabarrieta (University of Florida)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1698700376998001S4vw

Sailfish. Use cupy. 35-70x faster.

A New Generation of Global Climate Models Augmented by AI

Laure Zanna (New York University)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1695654260050001pd2L

NOT RECORDED. https://m2lines.github.io https://m2lines.github.io/code/

Global Strategies: Startups, Venture Capital, and Climate Change Solutions

Karthik Kashinath (NVIDIA), Thomas Debass (U.S. Department of State), Shimon Elkabetz (tomorrow.io), Joanna Lichter (Emerson Collective)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696445353484001C0A2

Emerson Collective - material investments, energy efficient systems

Department of State - Climate change, fund start-ups.

Tomorrow.io - nowcasting, global simulated radar. Climate security is going to be like cyber security.

Bridging the Compute Divide to Mitigate Climate Risk

Kate Kallot (Amini), Geoffrey Levene (NVIDIA), Bod Pette (NVIDIA), Jim Nottingham (HP), Jonathan Reid (Barbados government)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1704846663493001yfgV

Predicting CO2 Plume Migration in Carbon Storage Projects using Graph Neural Networks

Chung Shih (National Energy Technology Lab), Paul Holcomb (National Energy Technology Lab),

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1694206291662001AN14

Building a Lower-Carbon Future With HPC and AI in Energy

Marc Spieler (NVIDIA), Charlie Fazzino (ExxonMobil), Shashi Menon (Schlumberger), Otavio Cirbelli Borges (Petrobas), Vibhor Aggarwal (Shell)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1694550684312001N8Iy

Energy-Efficient GPU Computing With Mixed-Precision Modeling for Climate/Weather Applications

Sameh Abdulah (KAUST), Hatem Ltaief (KAUST)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1693907011017001KgVA

Early Science with Grace Hopper at Scale on Alps

Thomas Schulthess (ETH Zurich)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog?search=S62157&tab.allsessions=1700692987788001F1cG

Genomic Analysis at Scale: Mapping Irregular Computations to Advanced Architectures

Kathy Yelick (Berkeley)

https://www.nvidia.com/gtc/session-catalog/?search=data%20science&regcode=no-ncid&ncid=no-ncid&tab.allsessions=1700692987788001F1cG&search=data+science#/session/1696000163801001Lprz

Used https://graphblas.org/

ML

Enterprise MLOps 101

William Benton (NVIDIA), Michael Balint (NVIDIA)

https://www.nvidia.com/gtc/session-catalog/?search=data%20science&regcode=no-ncid&ncid=no-ncid&tab.day=20240318&search=data+science#/session/1694206314067001Ccvd

NVIDIA AI Workbench ; TensorRT-LLM

XGBoost is All You Need

Bojan Tunguz (NVIDIA)

https://www.nvidia.com/gtc/session-catalog/?search=S62960&tab.allsessions=1700692987788001F1cG#/session/1701818485299001aBaz

NN for tabular is still unsolved. Tree based methods work our of the box. Handling missing values, not sensitive t outliers

No good for predicting values outside of the data.

What approach on what dataset?

< 100 points - stats
100-5,000 - linear/logistic regression
5,000 - 10,000 - GBM/SVM
10,000 - 1,000,000,000 - GBT
> 1,000,000,000 - NN
Mission: accessible accelearated computing (painless CPU -> Multi GPU)

https://github.com/tunguz/TabularBenchmarks/tree/main/datasets/Porto_Seguro

GPUTreeShap

Use XGBoost for unsupervision e.g. tSNE

LLMOps: The New Frontier of Machine Learning Operations

Nik Spirin (NVIDIA), Michael Ballint (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696265081919001S25C

Navigating the Large Language Models Frontier: Practical Strategies for Building Enterprise Applications Powered by LLMs

Harrison Chase (LangChain), Jerry Liu (LlamIndex), Arvin Jain (Glean), Farshad Saberi Movahed (NVIDIA), Joey Conway (NVIDIA), Jane Polak Scowcrift (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1697845950296001Mi6T

Insights from Kaggle Grandmasters and Experts on Competitive AI and LLM Frontiers

David Austin (NVIDIA), Jiwei Liu (NVIDIA), Kazuki Onodera (NVIDIA), Chris Deotte (NVIDIA), Laura Leal-Taixe (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1698426184394001zwjl

Leveraging GPU-Efficient Vector Search for Large-Scale Ads Pipelines

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696024161282001Lbey

Benjamin Karsin (NVIDIA) Arpan Jain (Microsoft)

Lightning Fast with Thunder, a New Extensible Deep Learning Compiler for PyTorch

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696294424486001JD3i

Luca Antiga (Lightning AI), Thomas Viehmann (Lightning AI), Mike Ruberry (NVIDIA)

Large-Scale Production Deployment of RAG Pipelines

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1702668294942001QSt8

Mariem Bendris (NVIDIA)

Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1705547418542001z5AR

Zhiyu Cheng (NVIDIA), Asma Kuriparambil Thekkempate (NVIDIA)

Accelerated LLM Model Alignment and Deployment in NeMo, TensorRT-LLM, and Triton Inference Server

Bharat Giddwani (NVIDIA), Utkarsh Uppal (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1694182045858001bG5z

RAPIDS

RAPIDS in 2024: Accelerated Data Science Everywhere

Dante Gama Dessavre (NVIDIA), Nick Becker (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1697766189600001T2p3

NVIDIA AI Enterprise (developer tools, Cloud Native Management and Orchestration, Infrastructure Optimization)

Accelerated computing swim lands (higher is easier to use and lower is maximum performance):

Zero code change (cudf.pandas, nx-cugraph, RAPIDS-spark, Array-API for scikit-learn)
Minimal changes (Pytorch, XGBoost, cuml-CPU)
GPU code (cuDF, cuML, cuGraph, cuVS, RMM, CuPy, Numba, OpenAI Trition)
Python/CUDA libraries (Hybrid Python/CUDA code) (CuPy RawKernels, Numba, Cython wrappers for CUDA)
C++/CUDA high level (RAFT, CCCL Cub, etc.)
CUDA Toolkit (cuBLAS, cuDNN, CuSolver, cuSPARSE)

cupdf.pandas (see above)

https://colab.research.google.com/drive/12tCzP94zFG2BRduACucn5Q_OcX1TUKY3#scrollTo=KP0oc3PboQDv

Can accelerate stuff like ibis.set_backend("pandas")

nx-cugraph (see above)

https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.betweenness_centrality.html#networkx.algorithms.centrality.betweenness_centrality has a note that says:

Additional backends implement this function

cugraphGPU-accelerated backend.

weight parameter is not yet supported.

Numba CUDA

Shared memory across the ecosystem. Prevents unnecessary memory contention across ecosystem.

DLPack and CUDA Array interface. Zero copy between ecosystem e.g. cudf and torch:

import torch

import cudf

s = cudf.Series([0, 1, 1, 2, 3])

tensor = torch.from_dlpack(s.to_dlpack())

tensor([0, 1, 1, 2, 3], device='cuda:0')

cudf.Series(tensor)

For dask you can do

dask.config.set({"dataframe.backend": "cudf"}) which will allow you can now use dask.dataframe like dask_cudf

dask-expr to cudf to speed and reduce memory (WIP?)

Rapids spark is easy to use - Add jar to classpath and set spark.plugins config

from spark_rapids_ml.clustering import Kmean

PySpark MLib API, cuML MNMG classes, RAFT primitives and NCCL comms.

Offers speed up and cost savings

XGBoost on rapidsai channel offers RMM-enabled build for efficient memory pool sharing. Column-based split for federated learning (train next to data and don't share data0 with GPU NVFlare. Multi-target trees with vector-leaf outputs. Device parameter for each GPU config. Approx tree method is now GPU accelerated. Improved learning-to-rank. Quantile regression, improved memory support. Improved pyspark significant GPU speed up over multi-CPU node. UCX networking speed ups.

GNN libraries - cuGraph-DGL (extends https://docs.dgl.ai/index.html). cuGraph-PyG (extends https://pyg.org/ ). Model and Storage Backends (cuGraph-ops/pylibcuGraph-ops - accelerated GNN fwd/bwd layer kernels) WholeGraph (distributed feature/kv store). CuGraph equivariant

Vector Search - cuVS. Algorithms support - CAGRA. Built on top of RAPIDS RAFT. Distances (e.g. pairwise distances), Cluster (e.g. K-means)

NeMo curator - data mining modules for training LLMS

Installation and packaging - Devcontainers for devs wanting to contribute.

NVDashboard - like dask dashboard for GPU

NVIDIA Nsight Systems - Profile GPU code

NVIDIA Tools Extension (NVTX) - use @nvtx.annotate to get GPU profiling

Get started with NVIDIA AI workbench - Streamlined Setup, start locally and scale to a data center,

NVIDIA LaunchPad - for enterprises

RAPIDS (ecosystem) vs legate (distributed run time).

Accelerating Pandas with Zero Code Change using RAPIDS cuDF

Ashwin Srinath (NVIDIA)

https://www.nvidia.com/en-us/on-demand/session/gtc24-s62168/

Extisting pandas code can run orders of magnitude faster on the GPU with 0 effort

10-100 x faster than pandas.

Build using libcudf C++/CUDA library

%load_ext cudf.pandas or python -m cudf.pandas script.py

Pandas is everywhere. ChatGPT's default it to return pandas code if asking a python question.

Alternatives: Polars, DuckDB, Modin, spark, dask, rapids, xorbits

Supports 100% of the pandas API, fall back to CPU

Help pass to third-party libraries e.g. seaborn

Import a proxy module with proxy types and functions which are dispatched to cuDF or pandas

>100 x faster join, 40 x faster groupby on H20.ai benchmark

Used tools like langchain and pandasai to help generate pandas code. %load_ext cudf.pandas can accelerate this up to 4-10 x faster with agent evaluation.

cpu env:

uv pip install ipython langchain langchain-experimental langchain-openai pandas tabulate

gpu env:

uv pip install --extra-index-url=https://pypi.nvidia.com cudf-cu12==24.2.* ipython langchain langchain-experimental langchain-openai tabulate

#%load_ext cudf.pandas

import pandas as pd

from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

from langchain_openai import OpenAI

df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")

agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)

expected = int(agent.invoke("how many rows are there?")["output"])

actual = len(df)

df1 = df.copy()

df1["Age"] = df1["Age"].fillna(df1["Age"].mean())

agent = create_pandas_dataframe_agent(OpenAI(temperature=0), [df, df1], verbose=True)

agent.invoke("how many rows in the age column are different?")

When to use cudf.pandas vs. cudf?

cudf.pandas if you have pandas code and want to run run on GPU now

cudf if you don't want CPU fallback (expensive) and has faster algos or extends API compared to pandas.

Every PR on cudf runs against pandas test (latest release e.g. 2.2.1) and passing 94%. Also turn on cudf.set_option('mode.pandas_compatible', True) which does things like gets ordering same as pandas which slows things down a little.

User %%cudf.pandas.profile (or from cudf.pandas import Profiler; with Profiler() as p: pas) (there is also %%cudf.pandas.line_profile) to see what's running on the GPU and what's running on the CPU. e.g. indexer_between_time not supported in cudf.

GPU memory is limited use sparingly.

Where possible use built-in API over a custom UDF.

Accelerating NetworkX: The Future of Easy Graph Analytics

Mridul Seth (NetworkX), Rick Ratzel (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog?search=S61674&tab.allsessions=1700692987788001F1cG

All data can be put into a graph if you try hard enough :)

NetworkX is everywhere. It's THE graph library and chatGPT will suggest code for.

NetworkX is mostly a dictionary or dictionaries. Struggles at million node scale.

User-facing API + pluggable backends (dispatching and conversions):

nx-cugraph
graphblas-algorithms (CPU + OpenMP; GPU coming soon)
nx-parallel (joblib)
scipy.sparse (possibility)

nx-cugraph code runs against NetworkX tests on every PR and passes.

Customizable pluggable back-ends. Reminds me of xarray's open_dataset.

nx-cugraph has 60 graph algorithms, 42 accelerated graph generators

U.S. patent dataset has 3.7M nodes and 16.5 edges.

Next steps:

networkX will have configuration API, introspection and logging, more dispatchable APIs
nx-cugraph will have more algorithms and multi-GPU support to networkX calls

Large-Scale Graph GNN Training Accelerated With cuGraph

Joe Eaton (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1691687925299001TO0h

Used at:

Node-level (entity classification, recommenders, part-of-speech)
Edge-level (relationship classification, fraud detection)
Graph-level (molecular reactivity)

Dense embedding vector representation of sparse data

100TB graphs

Property Graph Model

Graph-as-a-service (cuGraph)

Multi-threads multiple-GPUs

DGL (https://www.dgl.ai/) and PyG (https://www.pyg.org/)

See structure of a GNN at https://www.cse.ust.hk/~yqsong/papers/2018-WWW-Text-GraphCNN.pdf

WholeGraph helps with SubGraph sampling (cuGraph-ops, distributed Pytorch) - host and device storage. Efficient memory

What's next? - easier to deploy, customer features, move code to C++ layer,

Integrated containers https://catalog.ngc.nvidia.com/orgs/nvidia/containers/dgl https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg

More cuGraph Features

Getting Started with Large-Scale GNNs using cuGraph Packages for DGL and PyG

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1693241962663001BjYm

Alexandria Barghi (NVIDIA), Vibhu Jawa (NVIDIA)

WORKSHOP.

Accelerate and Scale your Graph Analytics with RAPIDS CuGraph and HPC SDKs

Oded Green (NVIDIA), Chuck Hastings (NVIDIA), Seunghwa Kang (NVIDIA), Rick Ratzel (NVIDIA), Brad Rees (NVIDIA), Erik Welch (NVIDIA)

IN PERSON.

Reducing the Cost of your Data Science Workloads on the Cloud

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog?search=S62211&tab.allsessions=1700692987788001F1cG

Jacob Tomlinson (NVIDIA)

cudf.pandas. Pandas is single-threaded. Not a query engine.

Alternatives: Faster underlying implementation (C++, Rust, CUDA), query engines, SQL-inspired, distributed computing, hardware accelerated (GPUs).

cudf.pandas is 100% of the pandas API. Falls back to CPU is GPU doesn't work.

Deploy methods: shared node: e.g. Triton Inference Server and the Forest Inference Library ; Single Node e.g. a GPU; Multi Node e.g. Dask and Spark.

Deploy on managed notebook platform e.g. SageMAker

Deploy on compute pipelines e.g. Amazon EMR, Cloud Dataproc, Azure Databricks

Virtual machines: EC2

Why reduce cost? $, competitive advantage, reduce context switching, develop faster, environment impact, improve accuracy

20x time cheaper if on GPU than CPU. https://blogs.nvidia.com/blog/spark-rapids-energy-efficiency/

Autoscaling can save costs https://docs.rapids.ai/deployment/stable/examples/rapids-autoscaling-multi-tenant-kubernetes/notebook/

How to deploy GPU Data Science Workloads on the Cloud

Taurean Dyer (NVIDIA), Sheilah Kirui (NVIDIA), Mike McCarty (NVIDIA), Jacob Tomlinson (NVIDIA)

IN PERSON.

More Data, Faster: GPU Memory Management Best Practices in Python and C++

Mark Harris (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696309108571001PiKQ

88% of cuDF time is spent in CUDA memory management (cuaMalloc/cudaFree)

RMM - Common interface for customing device/host memory allocation, a collection of implementations of the interface, data containers that use the interface for the memory allocation

NVIDIA Morpheus Digital Fingerprinting Inference

Perform High-Efficiency Search, Improve Data Freshness, and Increase Recall With GPU-Accelerated Vector Search and RAG Workflows

Charles Xie (Zilliz), Corey Nolet (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696524106641001YwNd

What is a Vector Database? Stores embedding and enables semantic search across different types of unstructured data.

Milvus is an example (https://github.com/milvus-io/milvus )

Retrival Augmented Generation (RAG) - A technique that combines the strength of retrieval-base generative models (improved accuracy an relevance, provide private/domain specific knowledge, reduce hallucination. Has challenges of data refreshness and throughput.

Indexes with high throughput often require more time to construct.

GPU powered Milvus. Many X speed up.

cuVS - Vector Search (Brute force, CAGRA); Distance, Cluster built on top of RAFT (High Performance Machine Learning Primitives)

Unlock the Full Potential of your AI Workflows With Dataiku and NVIDIA RAPIDS

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1703709015383001iPNu

Not much to say. You can run RAPIDS on Dataiku (a workbench tool).

Accelerating Data Analytics on GPUs with the RAPIDS Accelerator for Apache Spark

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1691665908379001PqAU

Matt Ahrens (NVIDIA)

IN PERSON.

RAPIDS Accelerator for Apache Spark Propels Data Center Efficiency and Cost Savings

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1694629646223001YWmJ

Eyal Hirsch (Taboola)

Disk IO bottlenecks

Create issues along the way

Accelerate ETL and Machine Learning in Apache Spark

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1695842163670001l0PO

Sameer Raheja (NVIDIA), Erik Ordentlich (NVIDIA)

221 ZB of data by 2026

Rapids-spark requires zero-code changes. Add jar to classpath and set spark.plugins config (com.nvidia.spark SQL Plugin)

https://github.com/nvidia/spark-rapids-benchmarks

https://github.com/NVIDIA/spark-rapids-tools

Improved reliability with spill framework, improved IO, JSON handling, scaling to 100s of TB

Roadmap: Support Apache Iceberg, IO from cloud storage

https://github.com/NVIDIA/spark-rapids-ml

Package import change. Ues cuML MNMG classes / Raft NCCL communication

Can use on databricks https://github.com/NVIDIA/spark-rapids-ml/tree/main/python/benchmark

Training/fit time is 6 - 100x faster

https://github.com/NVIDIA/spark-rapids-examples/blob/main/examples/XGBoost-Examples/mortgage/notebooks/python/MortgageETL.ipynb

How PayPal Reduced Cloud Costs by up to 70% With Spark RAPIDS

https://www.nvidia.com/en-us/on-demand/session/gtc24-s62506/

Ilaay Chen (PayPal)

PayPal has 430+ million users, 25+ billion transactions every year

Fraud Detection; Recommendation systems, Risk, Credit, Customer Support

Find sweet spot of spark.rapids.sql.concurrentGpuTasks

Reduce machine count 140 -> 30, increase isk counts per node 4 -> 8

Everything, All at Once: Processing Spatial Transcriptomics Data using Accelerated Computing

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1693572452467001bleP

Jonny Hancox (NVIDIA)

IN PERSON

Compute

Breaking Down The Wall: Accelerator-Native Now (Presented by Voltron Data)

Rodrigo Aramburu (Voltron Data)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1707789858569001Es8i

Data systems face the walls as it takes 10x more effort to preprocess data for AI/ML than to train. Issues with Interoperability and speed and scale.

TPC-H 10TB Benchmark - the speed increase flatten after 100 nodes

Theseus is 72x faster, 71x cheaper and requires 100x fewer servers.

Lots of spilling

Accelerator-native: GPU as a co-processor (CPU an GPU work together but shipping data over PCle is slow)

GPU as a core-processor: multiple processes on same data but transfers between nodes is slow.

Accelereate the full-system: Heterogenous compute, Shared memory, IO, networking, GPU Direct Storage

Advances in Optimization AI

Alex Fender (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696279428061001YozJ

cuOpt - vehicle routing optimization. >15k tasks at once. Objectives, Constraints an variants.

Availible as an API. Beats Best Known Solutions (BKS).

Linear programming.

cuOpt agent e.g. use in an LLM.

CUDA: New Features and Beyond

Stephen jones (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696033648682001S1DC

How To Write A CUDA Program: The Ninja Edition

Stephen jones (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696034019866001kPGq

Multi GPU Programming Models for HPC and AI

Jiri Kruas (NVIDIA)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1693575305645001kW6X

https://github.com/NVIDIA/multi-gpu-programming-models

MPI, NVSHEM, NCCL.

Product

From Netflix Recommendations to Conversational Multi-agents: The (R)Evolution of AI-Driven Product Innovation

Xavier Amatrianin (Google)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1693530020305001xWen

The past: Data an algorithms-driven product innovation

Yesterday: ML -> Deep Learning

Today: Product innovation in the Age of Gen AI

Explanations (and UX) matters

Multi-stage sysems

GPT = Genereative (Genearate new content based on a natural language input) + Pretrained (Trained on huge internet size datasets, can learn on the fly and be fine-tuned) + Transformer (DL architecture using attention)

LLM recommendations (e.g. I live these movies can you suggest others), (Ask yes or no questions and i'll recommend artist for you)

Gen AI products are similar to old AI: Importance or product an UX design, importance of evaluation and metrics, importance of domain knowledge

Gen AI proucts are different to old AI: The UX is the AI, new eval metrics and frameworks (e.g. reduce hallucinations), domain knowledge needed less)

Agent = LLM based system that has access to tools and can decide how to ue them

Unlock AI’s Potential: Best Practices for Business-Led Digital Roadmaps and Implementation Challenges

Anne Hecht (NVIDIA), Stefan Goebel (SAP), Giovanni Di Napoli (Medtronic), Stefano Pasquali (BlackRock), Albert Greenberg (Uber)

https://register.nvidia.com/flow/nvidia/gtcs24/attendeeportaldigital/page/sessioncatalog/session/1696216663526001vmPw

GenAI value streams at BlackRock - Client Experience (Help Chat, Navigation, Visualization)

Investments (Summarization, Trading and investment signals)

Productivity (translation, coding assistant)

Medtronic - Patient care. Model building and deploying via NVIDIA GPUs

SAP - Supply chain management, human captial management, sped management, customer relationship management, business technology platform

Uber - docker fix tool, agent, 20 hours would take a developer