Meeting Notes

Note: Notes for older meetings may be added progressively to the website, but the focus will be on adding notes from newer meetings.

Meeting: April 2, 2024  (Canceled due to lack of discussion topics)

Meeting: March 26, 2024

Recording

Zoom Linkhttps://octoai.zoom.us/rec/share/dabKkPC-h9W_tFXVUkr1LI7h--KuMh7Mqn-fEwsIUgXRN5xSPvfEzri12l_GB-xq._PJ9sN6-6JgE0EoG (Passcode: op$i9B#a)

YouTube: https://www.youtube.com/watch?v=AmeXM8lmA6A

Meeting Notes

Meeting: March 19, 2024  (Canceled due to lack of discussion topics.)

Meeting: March 12, 2024

Recording

Zoom: https://octoai.zoom.us/rec/share/nDo8GTDYGHCPfKou2P6rsABAKKtQ3NQj_DMr7p_SIQk4NaiPadaF9ZDjV0fESwwj.v-Fz_8tqNuGaiu_o (Passcode: jC!B2bQ@) 

YouTube: https://www.youtube.com/watch?v=5O43sXELMkc

Notes

Meeting: March 5, 2024 (Canceled due to lack of discussion topics.)

Meeting: February 27, 2024 (Canceled due to lack of discussion topics.)

Meeting: February 20, 2024

Recording

Zoom: https://octoai.zoom.us/rec/share/SQcE-58GjScLX16Yb22GapMYge5Jibw22UtSsu9cbRtfpHafgVEAUZEMF1CdpR-X.DUw86XtVPJ_o2L3l (Passcode: Bd=J7$gf)

YouTube: https://www.youtube.com/watch?v=cM3ovgz7nwY

Notes

Meeting: February 13, 2024

Recording

Zoom: https://octoai.zoom.us/rec/share/eGEUdAIKJm6RFUJWXX8rsmelfbS2pm39kJTmZiKqRp_W2MLEoIFnhIfrckBDZ2po.he0leG8LAhgYoBZq (passcode: r#sq.60i)
YouTube: https://www.youtube.com/watch?v=_LGl4ocC-EQ

Agenda

Notes

Vertical-Focused Meeting: February 6, 2024

Recording

Zoom: https://octoai.zoom.us/rec/share/xPsc9hlk8YQ5afxdbw8IUyPTcEL4uubQtOuXeVojH7J5fBc-i388YLol3OoMSDKS.5RGgG2MsF5H2ZD7U (Passcode: Ns?a8#cN)

YouTube: https://www.youtube.com/watch?v=jlauO0Uaw2I

Agenda

Notes

Meeting: January 30, 2024

Recording

Zoom:  https://octoml-ai.zoom.us/rec/share/vP5991j_AOY_bAQ0SMGxIbrfuAhx91RhVs8eYLog3VAXvA1nPzOzVW2Jv4UZmGkU.bGSiFqBKub7pFpyP  (Passcode: Z1w+j8e$)

YouTube: https://www.youtube.com/watch?v=ac0qfCJ_Y60

Agenda

Notes

Meeting: January 23, 2024

Recording

Zoom link: https://octoml-ai.zoom.us/rec/share/r6u753L8oHuRvjgMok7u3aYxVhuobSDrIIx_R9-ahNPytRQQJMiWSLyjX9Ck--mX.oqVxX9JhoaUgQ9bo (Passcode: +$EmhiD3)

YouTube: https://www.youtube.com/watch?v=7qldsXnweM8 

Agenda

Notes

Meeting: January 9, 2024

Recording

Zoom link: https://octoml-ai.zoom.us/rec/share/CMu05h_lUOah0fAmEjCzTQxQlNc-_9y6r56BEatTcjpVrKRqQa3YqY0QdaKD9Z3l.ffp3qwoPry1ADynX (Passcode: tizq?I%6)

YouTube: https://www.youtube.com/watch?v=WlpKUaPg4QA 

Agenda

Notes

SLM update

Heterogeneous Computation

Appendix: End-to-End Example

# python tests/python/relax/test_vm_multi_device.py

@tvm.testing.requires_gpu

def test_multi_device():

    @I.ir_module

    class Example:

        I.module_global_infos({"vdevice": [I.vdevice("cuda", 0), I.vdevice("llvm"),]})

        

        @R.function

        def foo(

            x: R.Tensor((2, 3), "float32"),

            y: R.Tensor((3, 4), "float32"),

            z: R.Tensor((4, 5), "float32"),

        ) -> R.Tensor((2, 5), "float32"):

            with R.dataflow():

                lv0: R.Tensor((2, 4), "float32", "llvm") = R.matmul(x, y)

                lv1: R.Tensor((2, 4), "float32", "cuda") = R.to_vdevice(lv0, "cuda")

                gv: R.Tensor((2, 5), "float32", "cuda") = R.matmul(lv1, z)

                R.output(gv)

            return gv

# relax/op/base.py

def to_vdevice(data, dst_vdevice) -> Expr

def hint_on_device(data, dst_vdevice) -> Expr


# relax_vm/builtin.cc

vm.builtin.to_device


# test_transform_realize_vdevice.py

def test_insert_to_vdevice():

    @I.ir_module

    class Input:

        @R.function

        def foo(

            x: R.Tensor((2, 3), "float32"),

            y: R.Tensor((2, 3), "float32"),

            z: R.Tensor((2, 3), "float32"),

        ) -> R.Tensor((2, 3), "float32"):

            with R.dataflow():

                lv0 = R.hint_on_device(y, tvm.cpu())

                lv1 = R.add(x, lv0)

                lv2 = R.hint_on_device(lv1, tvm.cuda())

                lv3 = R.add(lv2, lv2)

                lv4 = R.hint_on_device(z, tvm.cuda())

                gv = R.multiply(lv3, lv4)

                R.output(gv)

            return gv


Additional changes for end to end


Pull Request:

Meeting: December 19, 2023

Recording

Zoom link: https://octoml-ai.zoom.us/rec/share/73F6Vm1vPDuKY2pRc9xhgkCl-WAzeTIW7XZZZdSAbUdBj6MiABiGUTG1bQMC7Pi_.nqcikJ6zVPySsezs (Passcode: XbDK3.%$)

YouTube: https://www.youtube.com/watch?v=zsj0Hpg4i9U 

Agenda

Notes


Ending note: Let’s get Unity into the main branch in 2024! In many ways, the unity branch is already the de facto main branch

Vertical-Focused Meeting: December 12, 2023

Recording

Zoom: https://octoml-ai.zoom.us/rec/share/8Wr-shUZIrKUuafGHgilhDitI7RCNh37pA6JJEMm5v0mnq9vyqgowtoW9rZ61cPs.aMuyO7vU6YBCG78G (Passcode: CR7C.T*Q)

YouTube: https://www.youtube.com/watch?v=_TfuuNPKLwI

Notes

Continuous Batching Support (see slides: https://drive.google.com/file/d/1yVvxzBv-E_szCk7LWL1t_sFo4-_AVHxm/view?usp=drive_link)


Q: What does the movement of pipelines like these into production look like? Consider issues like Dockerizing (access of Dockers to GPU)? For example, Docker might have to deal with passthrough of different device-specific APIs?

A: For common GPUs, passthrough is likely to be easy. Less common ones might require more work to have passthrough. We can consider cases of containerized and non-containerized deployment


Q: On batching, is there any benefit when there are only one or two GPUs? Would it help process at a higher token rate or be faster? Or do the benefits only manifest when there are many machines?

A: If there are many queries from many users, batching will indeed increase throughput. The bottlenecks in LLMs tend to come from memory accessing, so batching requests might be more efficient in this regard.


Q: On Mistral, would this pipeline be able to handle MoE (Mixture of Experts) models? Would it be able to handle comparable batch sizes?

A: We are exploring integration of MoE models. We might be able to improve their performance using CUTLASS. MoE involves some operators that we haven’t used in LLMs, so we would have to add support for them in our flow. (E.g., topk, cumsum.) Right now our focus is on improving performance-critical operators. Mistral has some of these features as well (using top2 to choose a branch with the best score and cumsum to do a weighted sum of experts). These operators also use different patterns of matrix multiplication (e.g., batchwise) so we might have to explore how to optimize those.


Q: How is C support handled in nn.SourceModule?

A: The C functions are compiled into TVM PackedFuncs. This capability could be very convenient for enabling other such integrations.


Future efforts: Supporting OpenAI APIs and supporting universal deployment. We have made progress on distributed systems and multi-GPU batching. We hope to stabilize our implementations and make TVM more feature-complete.

Meeting: December 5, 2023

Recording

https://octoml-ai.zoom.us/rec/share/elIBVucMUA_NXtzGxH6JqtuscngSV-ZvkzUVS6h2EgAkqC2g4javqnvs-kepyfgT.n9U2yoQCIyFNoP9n (Passcode: M$^2Mk3q)

YouTube Link: https://www.youtube.com/watch?v=DYp-rVIdTWc

Agenda

Notes

DistIR


Meeting: November 21, 2023

No recording! Nobody in the meeting had permissions to record in Zoom and attempts to use local software to do it all failed (OBS crashed and Audacity on Linux was too hard to get working)

Agenda

Notes

Vertical-Focused Meeting: September 5, 2023

Recording

YouTube link: https://www.youtube.com/watch?v=GcbuODb51Sc

Agendas

Vertical-Focused Meeting: July 25, 2023

Recording

https://octoml-ai.zoom.us/rec/share/tweHayS90DlVHXN_57EHCoW67MBYX2pXICUE3FMTdEV29m8eZjfPXCGSe96t6zsK.ITdBBoO0_N-2LDzg (Passcode: Z@c^2TkZ)

YouTube link: https://www.youtube.com/watch?v=ZpCtJ_0QqgU

Agenda

Notes

Slides: https://docs.google.com/presentation/d/1eNinLTcVrMnPWKKBKUy3DQ5VlrPKNTB5377_kjgTszc/edit#slide=id.p (Archived link)


LLM inference with FasterTransformer kernel: https://gist.github.com/masahi/079a72120cd54fbb0d3ebf29a751fed1


Roadmap Discussion:


Open discussion: