Meeting Notes
Note: Notes for older meetings may be added progressively to the website, but the focus will be on adding notes from newer meetings.
Meeting: April 2, 2024 (Canceled due to lack of discussion topics)
Meeting: March 26, 2024
Recording
Zoom Link: https://octoai.zoom.us/rec/share/dabKkPC-h9W_tFXVUkr1LI7h--KuMh7Mqn-fEwsIUgXRN5xSPvfEzri12l_GB-xq._PJ9sN6-6JgE0EoG (Passcode: op$i9B#a)
YouTube: https://www.youtube.com/watch?v=AmeXM8lmA6A
Meeting Notes
Unknown Purity in Relax
Came up with SLM to Relax translations, difficult to follow Relax conventions for well-formed purity annotations
Turned out to be very difficult to avoid introducing overhead for recursive functions
SLM to Relax conversions are inside a dataflow block, but many operations like print cannot be done inside a DF block!
Optimizations might end up removing incorrectly used impure functions
Possible resolution: Do not use DF blocks and use the DF extraction pass as a post-processing step
Remaining problem: Purity for subroutine calls!
More robust solution: Let’s infer purity from function bodies if it’s unspecified
However, recursion does not allow for this to be done fully locally
Global functions get forward declarations in the module, but this would require users to annotate purity in the signature, which they don’t always do
Defaulting to pure is really assuming a special case rather than requiring the user to tell us that it’s the special case or proving it for ourselves
Proposal: Introduce an “unknown” value for purity and have it be the default. Unknown purity should be taken as “possibly impure.”
This would help with inference since it would allow for distinguishing user intent (“I assert that this is pure” vs “I make no assertion at all”)
The presence of unknowns indicates that inference is not complete
Implementation: Optional[bool]
Possible worst case that we need to handle:
Def f(x): g(x)
Def g(x): f(x) (can add more calls)
We need to make sure we handle weird cases like these!
Can catch with a fixpoint algorithm, but when do we run it?
One proposal is catching it when we get rid of DF blocks because that’s the last chance to catch such an error
Another option is trying to catch these cases earlier in case there’s an error
It’s unlikely we’ll have a lot of mutual recursion so probably fine to catch it later—local inference will suffice in the vast majority of cases
DF block behavior: Don’t throw an error if there’s an unknown because it might be okay
Subtyping: pure <: unknown <: impure. For example, in a DF block, unknown is not an error but impure is an error
Having pure be a subtype of impure can lead to some odd situations, though. For example, if we were checking for impure calls in DF blocks by looking at types, we would consider a pure call to be an error! (This is not how it’s implemented)
The reason we have it this way is so that a high-order function expecting an impure function argument can accept a pure function
Where pure functions are required, we can provide pure or unknown, but it must be checked eventually
Where impure is permitted, pure is also acceptable
Do we still need force-pure? Yes, because it’s meant to handle the case of mutating only local values. We might want to consider trying to detect that case, because 99% of the time (other than dealing with external funcs) it will be easy to do
Speculative topic: Improving operator StructInfo checking
Many ops check StructInfo by asserting specific shapes or that there’s a tensor type
However, if we pass in ObjectStructInfo or unknown shapes, the values might be valid at run time but not at compile time!
We could handle this by normalizing the operators to use MatchCast to check these conditions. This would correctly assign the symbolic values, not affect purity, etc., but remove the dependency on pass ordering
MatchCast = way to check a precondition directly in Relax
We can try to move MatchCast as early as possible in order to hoist preconditions, possibly into the signature
Meeting: March 19, 2024 (Canceled due to lack of discussion topics.)
Meeting: March 12, 2024
Recording
Zoom: https://octoai.zoom.us/rec/share/nDo8GTDYGHCPfKou2P6rsABAKKtQ3NQj_DMr7p_SIQk4NaiPadaF9ZDjV0fESwwj.v-Fz_8tqNuGaiu_o (Passcode: jC!B2bQ@)
YouTube: https://www.youtube.com/watch?v=5O43sXELMkc
Notes
General theme: Have more explicit representations of things that are currently implicit to assist in lowering and other compiler passes, especially of low-level concepts
First example: Reading information out of DLTensors
Relax normally abstracts over this low-level information
This PR exposes functions for accessing various representation parameters at run time
Example: This enables a conditional datatype conversion based on the tensor’s representation. This is occasionally important but Relax would not be able to express this without constructs like these
Expanding the use of primitives: Support for R.Prim(“bool”) in If nodes and assert ops
This change would allow for symbolic values to affect control flow
Similarly allows for more powerful conditionals that relate to low-level information
Syntactic sugar for void types (unit types), which required a normalization change (inlining unit tuples)
Potential issue: What if you have a var bound to ()? You can’t omit those bindings or else the vars are invalid
Solution: Inline the assignments
These changes make Relax easier to use, easier to read, and easier to debug without restricting functionality
Q: Can you have scalars on another device? Do we ever need to move values in order to support conditionals?
A: Generally they’re supposed to be on the host, but not necessarily! TIR PrimFuncs can create scalars on other devices. There is a TIR pass that checks for device movement. The PackedFunc interface that TIR generates will always move *scalar* values over, so this should still work out for using R.Prim values in the control flow (data that is pointed to must be explicitly transferred)
We test this case
Do we want to revisit the issue of the distinction between scalar tensors and PrimValues?
Other frameworks deal with this case, like np.array with zero dimensions versus Python scalars in Numpy
We could say “tensors are for things that are expensive to move, PrimValues are for things that are cheap to move”: Arithmetic with other tensors is for tensor values, but PrimValues are easy to send along
Scalar tensors are useful for propagating results of prior operations, PrimValues are useful for propagating values to the users
Scalar tensors are mutable but PrimValues aren’t. You could use a scalar tensor as a “future” for the result of a later operation versus a PrimValue for something that can be precomputed.
This is important to deal with because pattern matching might be looking for one or the other
There’s a question of ordering as well: Can we push scalar computations earlier or later? If it’s something that could be lifted out from another, more expensive computation, it might be good to express it with PrimValues
We could come up with properties that are good to have for PrimValues: Cheap, can be precomputed, should be treated as an immutable scalars. It becomes a question of should rather than must. Tensors are more flexible and processed more by the compiler. PrimValues are less automated and are exposing lower-level things
This is an option available to the user but should be used with consideration
Q: Are there other corners of the language where we don’t handle PrimValues but we should?
Arithmetic operations: We don’t have any way to, say, divide by the size of an array. It would be good to expose some of this metadata as PrimValues
We should have a good way of converting from a PrimValue to a scalar tensor. Possible principle: Make this automatic and implicit?
When is this a problem? PrimFuncs and PackedFuncs care about the representation of data. For these cases, we can have a manual conversion/assertion—for PackedFuncs, we can’t automate this analysis
Rule we can go with: Implicit promotion from PrimValue to scalar tensor when this can be done, always allow for explicit changes
Possible future direction: Pattern replacement in SLM. Existing compiler optimizations don’t really target the implementations in SLM and target a different interface, very hard for users of SLM to wrap their heads around.
Proposal is to have SLM-to-SLM transformations that are easier for high-level users to write
Have a user provide an optimized version of a network that can be substituted for the original instances as a SLM-to-SLM transform
The transformation can also be lowered into a Relax-to-Relax transform via the dataflow pattern matcher!
Tricky part: Lowering from SLM-to-SLM to Relax-to-Relax makes the pattern more granular (SLM has nn.Module boundaries, Relax doesn’t). We can replicate this by compiling SLM into separate Relax functions versus a single bundled function (this was proposed as an option)
Need to see if this will be useful for SLM’s intended audience (people writing machine learning models). We can solicit feedback from the TVM Discuss forum and discuss in future meetings as needed
Meeting: March 5, 2024 (Canceled due to lack of discussion topics.)
Meeting: February 27, 2024 (Canceled due to lack of discussion topics.)
Meeting: February 20, 2024
Recording
Zoom: https://octoai.zoom.us/rec/share/SQcE-58GjScLX16Yb22GapMYge5Jibw22UtSsu9cbRtfpHafgVEAUZEMF1CdpR-X.DUw86XtVPJ_o2L3l (Passcode: Bd=J7$gf)
YouTube: https://www.youtube.com/watch?v=cM3ovgz7nwY
Notes
List of open PRs that could use review! Courtesy of Eric Lunderberg (many thanks for compiling and categorizing the list!)
Improve TIR analysis
PR-16588: Improvement to arith.analyzer.ConstIntBound. Improve bounds for common dynamic patterns occurring in batched LoRA.
Add new Relax transforms
PR-16596: New optional pass relax.transform.ReorderPermuteDimsAfterConcat. Intended for use after CombineParallelMatmul, to restore R.matmul(state, R.permute_dims(weights)) pattern.
Improve existing Relax transforms. These PRs either add opt-in configuration options to existing relax transforms, or improve the handling of specific IR constructs that may appear in their inputs
PR-16450: Improvement to relax.transform.FuseOps. Avoid extra symbolic variables if they are always used within the same TIR expression.
PR-16589 (requires PR-16588): Improvement to relax.transform.AdjustMatmulOrder. Find optimal ordering of R.linear as well as R.matmul.
PR-16590: Improvement to relax.transform.ExpandMatmulOfSum. In addition to R.matmul(state, R.add(base,lora_contrib)), handle R.linear(state, R.add(base,lora_contrib)).
PR-16591: Improvement to relax.transform.CombineParallelMatmul. Allows dynamic shapes in the matmuls being combined, so long as the result can still use R.op.split.
PR-16594: Improvement to relax.transform.LiftTransformParams. Propagate any human-readable names from the original relax function.
PR-16595 (requires PR-16594): Improvement to relax.transform.LiftTransformParams. If an expression doesn't depend on the model weights (e.g. R.zeros([256], "float16")), it does not need to be lifted.
PR-16597: New argument for relax.transform.BundleModelParams, to specify the name of the bundled param (e.g. Distinct "base_params" and "lora_params" arguments.)
PR-16599: Improvement to relax.transform.EliminateCommonSubexpr(). Eliminate common relax::MatchCast bindings as well as common relax::VarBinding
PR-16602: Improvement to relax.transform.LazyTransformParams. Model weights may be provided as multiple function parameters instead of a single tuple argument.
Bugfixes. These PRs resolve incorrect output or avoid throwing errors from well-formed IR.
PR-16598: Bugfix for relax.transform.FuseOps. Input IRModule may contain calls relax-to-relax function calls, where the subroutine is not marked with attr::kComposite.
PR-16592: Bugfix in legalization of R.nn.attention. Use tir.abs instead of python abs in case the input shape is a symbolic expression.
PR-16584: Bugfix for tir.PrimFunc.specialize. Remove StructInfo if present, in case the PrimFunc inherited annotations from a relax primitive function.
PR-16562: Bugfix for TVMScript. Handle R.match_cast that occurs as the last binding of an if/else block.
Debugging improvements. These PRs do not resolve any specific issue, nor add new functionality, but were implemented as part of investigation and debugging failures in other PRs.
PR-16544: Expand DWARF debug symbols in LLVM codegen. Generate DWARF debug symbols with the PrimFunc's name, parameter names, and internal variable names.
PR-16421: Opportunistically check for SSA violation in relax.transform.FuseOpsByPattern. These errors are the result of malformed input IRModule, and checking for it here is quick to do, and produces much, much clearer error messages than if the SSA violation were allowed to continue propagating downstream.
PR-16585: Usability improvement to relax.analysis.VerifyWellFormed. If an impure call is used in an invalid context, tell the user what the impure call was.
PR-16574: Test coverage for relax.transform.RemoveUnusedParameters. Test removal of parameters that could define symbolic variables, but those symbolic variables are already defined.
MLC-LLM update: https://github.com/mlc-ai/web-llm/pull/300 OpenAI-like API has been PR’d and is ready for review
Well-formedness PR: https://github.com/apache/tvm/pull/16569 Adding in a well-formedness check for PrimFuncs has exposed some potential bugs in test cases. Advice on fixing the underlying bugs might be helpful
Update on documentation?
Documentation for migrating from Relay would be helpful
MLC AI course has a lot of material (some of it might be dated) and could be used as a source of documentation: https://mlc.ai/
Do not hesitate to post questions in the forum! They will draw attention to what is lacking in the docs
Meeting: February 13, 2024
Recording
Zoom: https://octoai.zoom.us/rec/share/eGEUdAIKJm6RFUJWXX8rsmelfbS2pm39kJTmZiKqRp_W2MLEoIFnhIfrckBDZ2po.he0leG8LAhgYoBZq (passcode: r#sq.60i)
YouTube: https://www.youtube.com/watch?v=_LGl4ocC-EQ
Agenda
Standing agenda item: Updates, PRs, and announcements (open floor)
Notes
Updates
FuseTIR and FuseOps now handle in-place operators, though we need more fusion patterns (that is the next step)
PR coming for vectorized sampling
Open Discussion
Issue with removing deprecated dependencies. There are some old dependencies in tests
Example: Tensorflow 1.x is unlikely to be supported in the future, so we shouldn’t keep tests for it
This can go hand-in-hand with documentation updates
Also a cost to run this in CI constantly
Example: MxNet dependency issue https://github.com/apache/tvm/issues/16547
We also have a lot of very old open issues that may not be actionable or relevant, as well as open PRs. It’s likely worth it to close these. The Unity transition is a good time to clean these up
Questions about gather and scatter patterns in in-place PrimFuncs
Gather pattern: result[iterator] = input[f(iterator)]
Scatter pattern: result[f(iterator)] = input[iterator]
See https://discuss.tvm.apache.org/t/discuss-inplace-and-fusion-opportunities-in-llm-runtime/16292 for an example
Rather than check for any specific pattern in the “function,” we should instead check that the other side just uses the iterator in order
Vertical-Focused Meeting: February 6, 2024
Recording
Zoom: https://octoai.zoom.us/rec/share/xPsc9hlk8YQ5afxdbw8IUyPTcEL4uubQtOuXeVojH7J5fBc-i388YLol3OoMSDKS.5RGgG2MsF5H2ZD7U (Passcode: Ns?a8#cN)
YouTube: https://www.youtube.com/watch?v=jlauO0Uaw2I
Agenda
WebLLM v1 roadmap
Recent updates
Roadmap
OpenAI API
Support for more modalities
Chrome plugin – start with gmail/overleaf
Vision
Agent-ify: https://github.com/mlc-ai/mlc-assistant
MLC Serve
Notes
WebLLM v1
WebLLM allows for local running LLMs using the WebGPU API
Testbed for new language models, users can demo new models with it
Very quick turnaround on supporting new models using TVM
Implementation: Using nn.Module in Relax, migrating to a new KVCache interface in Relax
Lately mostly chat demos, but hoping to add more functionality
Moving forward, very important to have AI compatibility with what is being used widely in practice, esp. OpenAI
Interested in more modalities, like vision. Can use OpenAI APIs to embed images and do visual question answering (LLaVA)
May develop new modules for including new embeddings or including audio data
Another area of interest: Making a Chrome plugin. Another experiment: MLC-Assistant, which uses a local REST API (uses MLC-LLM locally, not via WebGPU), but could be made more lightweight with WebLLM (may be hard to use WebGPU in a way that persists across webpages, may also be hard to extract DOM information from certain services)
MLC Serve
https://github.com/mlc-ai/mlc-llm/tree/serving Old branch
Now on main branch! Greater OpenAI API compatibility, improved performance, and more multi-platform support (NVidia, AMD, Apple)
OpenAI API compatibility: JSON format and tool calls support, updating REST API as well (deprecate old code and use methods that are more compatible with OpenAI). Will update documentation as well
Currently mainly supports NVidia, but also adding support for Apple and AMD via native attention libraries
Model support: Available via nn.Module framework (SLM) and do not need to be compiled separately. Llama family and Mistral are main priority for support, but hoping to extent to more model architectures
Performance: Currently ~10% better throughput and latency than vLLM, on par with MLC Chat in single-sequence applications
Q: What is the difference between the APIs?
A: Previous API processed requests sequentially. Current one can batch requests and handle them together, saving time. This is the main change.
Q: Is the JSON format vectorized? Previous attempts to handle it have not done that
A: Requests are processed via a finite state machine. This should be pretty performant, though we will profile it.
Q: Update on TVMCon?
A: Not currently planned, it will likely not be Q1 if it happens this year
Q: What issues do attendees face running MLC in the cloud? Is there advice for running MLC operationally? Is educating the public a priority for the MLC project?
A: Partly a matter of TVM’s documentation, partly about MLC itself. MLC has gone through several rounds of changes so as features stabilize, we will document it further. Updating TVM documentation will be a priority as well, especially since we have merged unity
Response: Hard for users to feel comfortable asking for help given that a lot is changing, but MLC-LLM is very valuable and useful so people will want to try to use it. Hard to debug issues related to specific hardware, specific systems. Tough to ask for help because it’s hard to know if it’s a TVM issue, MLC issue, driver issue, etc., so some guidance would be very helpful.
We would love to have more testing of models on different hardware platforms so that we could improve reliability in these contexts
Maybe MLC lab hours would be valuable in general
Specific technical question: Having some trouble embedding with MLC. Interested in processing user data locally out of privacy concerns, hoping to compute some large-scale embeddings. Have been able to use sentence-level transformer on AWS, but hoping to use e5-mistral-7b-instruct locally. Problem: names for weights tend to be an issue when converting weights. Is that hard-coded anywhere?
A: Yes, there are some manual name mappings in the handling of Mistral models. It may indeed require some changes on the MLC side
May need to slice a padding layer out of the model and do it separately to fix this issue
Q: Why are names not just read from the model files?
A: Likely due to variations among implementations of Mistral. We need to assume some convention, hard to “infer” from just the names
Meeting: January 30, 2024
Recording
Zoom: https://octoml-ai.zoom.us/rec/share/vP5991j_AOY_bAQ0SMGxIbrfuAhx91RhVs8eYLog3VAXvA1nPzOzVW2Jv4UZmGkU.bGSiFqBKub7pFpyP (Passcode: Z1w+j8e$)
YouTube: https://www.youtube.com/watch?v=ac0qfCJ_Y60
Agenda
Now will be called simply the “TVM Open Development Meeting”
In-place opportunities in the LLM Run-time: https://discuss.tvm.apache.org/t/discuss-inplace-and-fusion-opportunities-in-llm-runtime/16292 (posted by Tianqi Chen)
Documentation
Please review the Relax spec RFC: https://github.com/apache/tvm-rfcs/pull/106
Let’s get old tutorials and explanations together and cleaned up for the site
New forum post: https://discuss.tvm.apache.org/t/discuss-tvm-unity-transition-docs-refactor/16325
Revisiting previous issue: Checking well-formedness after parsing (last discussed Oct. 3)
MLC occasionally puts packed calls into DF blocks
call_pure_packed was added to address this when it’s reasonable
DF block extraction can be used to simplify this process too
It would be preferable to check DF blocks after parsing, but let’s ensure it won’t break MLC and other important code
Procedure: How do we wrangle topics for meetings?
Notes
In-place opportunities
Fusion with in-place operations can eliminate many intermediate allocations
Overwriting memory is a real issue though, since it can violate purity. We should expect users to do this explicitly
Q: Is overwriting parameters even when the user does it explicitly a violation of purity that we should be careful about?
A: Yes, it’s dangerous but we should consider call_tir_inplace to consume its input
Yet another ownership issue, we should find some way to express this
Follow-up: https://discuss.tvm.apache.org/t/inplacetransformparams/16323
Doing in-place transformations on parameters to prepare for other operations (too many memory costs from doing it explicitly and allocating more memory)
Goal is to avoid intermediate allocations
Q: How would we deal with a case like doing a tensor and its transpose from the same backing allocation? What about casting from one dtype to another with the same size in bytes?
A: There are some dangers here.
Can accomplish this with storage manipulation (builtin.alloc_storage and bultin.alloc_tensor) but this creates many issues, might be worth experimenting with.
Documentation: Let’s take discussion to the thread
Well-formedness issue: Let’s make a PR to MLC and update the parser, there’s no obstacle really
Gathering topics
We could have a default agenda to ensure that there will be some discussion
Look at recent big PRs
LLM-focused ones: Try to reach out early and often, especially to MLC contributors
We could also try posting farther in advance, e.g., two weeks before ordinary meetings and a month before vertical-focused meetings
Meeting: January 23, 2024
Recording
Zoom link: https://octoml-ai.zoom.us/rec/share/r6u753L8oHuRvjgMok7u3aYxVhuobSDrIIx_R9-ahNPytRQQJMiWSLyjX9Ck--mX.oqVxX9JhoaUgQ9bo (Passcode: +$EmhiD3)
YouTube: https://www.youtube.com/watch?v=7qldsXnweM8
Agenda
General point: Unity is now in main!
We should strive to get some docs up!
Procedural change: Should we start using RFCs to propose changes now? Where do we draw the line?
Phase ordering and dataflow block extraction (Steven Lyubomirsky)
Heterogeneous execution implementation: Correctness conditions (continuation) (Yong Wu)
Notes
Procedural Points
Very important to incorporate changes quickly!
We can continue this meeting in its current form
Documentation is indeed important
Something else to consider: Deprecating older components
Many changes we might want to consider in terms of testing infrastructure too. There are lots of long-running tests and repetitive tests, especially in older files. We might want to triage test cases, find ways to run certain tests less frequently
Right now, we’re not stepping on each other’s toes as far as Relax development is going. We will need to have a policy eventually. Relax is still under a lot of flux and is changing the moving target of generative AI, so we should prioritize being open to change.
Q: Will Relay be deprecated?
A: We encourage adoption of Relax, especially for the latest models, but Relay is continuing to be tested and maintained. If Relax can’t handle something Relay does, that should be pointed out.
Q: What about training?
A: Training in Relax is being worked on and support for gradients is comparable to that in Relay.
Q: Will Relax support 2DTexture as Relay does?
A: It doesn’t right now but it should not be hard to implement this in Relax’s flow. All that would be needed is an additional compilation pass. Relax’s memory planning is based on TVM’s previous allocators so it should not be hard to port over such features if users are interested.
Q: Does Relax have the same end-to-end tests as in Relay?
A: We have some end-to-end tests but the emphasis is more on unit tests, since end-to-end testing is expensive. Unit tests are also easier to maintain and understand.
On documentation
Relax specification RFC is public: https://github.com/apache/tvm-rfcs/pull/106
Links to old tutorials will be appreciated, we will try to put these on the website
Dataflow Block Extraction (continuation from past discussion)
Should DF block extraction be built into the compiler? Users don’t have a reason to specify DF blocks in the front-end code and it’s arguably redundant to have the user manually write out what the compiler is capable of automatically inferring
One opinion: It might be risky to have dataflow block extraction always used by default, so we could leave it as an option in the parser. Will make a PR and we can see if there are any issues with using it
Using DF extraction late in compilation flow: In principle, we can cross this bridge when we get to it, i.e., if we have a concrete example of a pass that needs dataflow information late in compilation, that would be a reason to try to get it all working at that stage. We would have to track purity late in compilation to make this work
Heterogeneous Execution (continuation from past discussion)
We should be careful about stating how VDevices are supposed to work
Some error conditions
All VDevices must be listed in the global module info
Every tensor in the program should have an inferred VDevice if tensors are specified
The hint_on_device operator is used by RealizeVDevice to figure out where tensors are supposed to be located, but using manual type annotations and MatchCast as well
Difference between MatchCast and hint_on_device: MatchCast will fill in all parts of the tensor struct info, but hint_on_device changes only the VDevice
MatchCast only checks conditions and RealizeVDevice might insert copies to move tensors from one device to another
RealizeVDevice is applied early in compilation to propagate VDevice information and should not interfere with other passes or ordinary StructInfo checking
The draft Relax specification has text addressing this subject, so we can carry on discussion in the RFC.
Meeting: January 9, 2024
Recording
Zoom link: https://octoml-ai.zoom.us/rec/share/CMu05h_lUOah0fAmEjCzTQxQlNc-_9y6r56BEatTcjpVrKRqQa3YqY0QdaKD9Z3l.ffp3qwoPry1ADynX (Passcode: tizq?I%6)
YouTube: https://www.youtube.com/watch?v=WlpKUaPg4QA
Agenda
SLM update by Lesheng Jin
Heterogeneous execution implementation by Yong Wu
Notes
SLM update
Slides: https://docs.google.com/presentation/d/1Ap61I1VozBUM2k6MtwQF5CrC4dnsbnA4N3kenZPV_Zc/edit?usp=sharing (original: https://docs.google.com/presentation/d/1j5Dl-4P-RQCT5Ts4xsJRyNKmHyrp1sDL0r7j1qn5qBo/edit#slide=id.g27f8f7d3040_0_2304)
Goal: Improving compilation and quantization (made agile!), via a convenient PT-like interface
Anecdotally, takes less time to define a model using SLM than manually in Relax (figuring out shapes and dtypes from PT models can be a source of pain, so is debugging). Debugging with SLM was relatively painless, able to simply print intermediate values and compare to the Huggingface counterpart like with PyTorch
SLM can be found in mlc-llm
Also led to speedups! Thanks to incorporating better kernels (e.g., applying FlashInfer), which would have been more difficult when using Relax directly
Used to implement support for Mixtral in less than a day, already achieved better performance than the listed benchmark
Easier to implement support for more quantization algorithms, batched serving, etc.
Difficulties in defining a model directly in Relax
Need to know a lot about TVM’s infrastructure
Operators spread out over Relax, TIR, TOPI, TE
Hard to print intermediate values for debugging
No JIT, have to recompile the whole thing between every change
Some of PyTorch’s useful features: Inheriting from nn.Module to avoid filling in lots of boilerplate, defining recursive modules for model layers. This is implemented in SLM
Another convenience: Wrapper over KVCache (otherwise requires builtin functions in Relax). Classed as an Effect (used to implement other kinds of IO and stateful operators)
Modules work like in PyTorch, can be invoked by calling the provided “forward” method
Dtype and shape are accessible through convenient wrappers. Arithmetic functions also have syntactic sugar (+, -, /, *)
Models can be exported into Relax using model.export_tvm function. Relies on having a specification for the IRModule, which defines the function signatures (it’s a manual step, but adds some control over the final IRModule)
Effects are useful for debugging. For example, there’s a print operator that is implemented as an Effect
JIT compiler: Converts the SLM module into a TorchModule for quick testing. TorchModules take inputs as Torch tensors and convert them into TVM values, run the model in TVM, and convert the results back into Torch tensors. Torch tensors have a convenient interface, saving users some work in otherwise converting inputs to TVM’s interfaces
Quantization: New quantization algorithms have emerged recently, such as AutoGPTQ and Auto AWQ. It is important to support new approaches quickly.
In practice, these can be implemented by iterating over the layers and making substitutions of specialized quantized implementations. nn.Mutator performs such substitutions in SLM. These do not require any knowledge of TVM internals to use
Support for multi-GPU settings implemented via integration with Disco! (See past presentation on Disco.) Tensors can be sharded across GPUs via certain operators, with a sharding strategy specified via an attribute
Many operators have been implemented to support cross-layer operators, needed for Mixtral and FlashInfer
Q: There is some syntactic sugar in SLM, like x + y being R.add(x, y). How is this implemented?
A: These are just operator overrides on the Tensor class in SLM. No need for a separate parser like in TVMScript
Heterogeneous Computation
See past discussions on this subject; today we will focused on the implementation of heterogeneous execution
Highlights of changes:
Tensor StructInfo now notes virtual devices that tensors are located on (from list of all devices, given as a module-level attr)
R.to_vdevice operator indicates that a value should be copied from one device to another
R.hint_on_device: Serves as a hint that a tensor should be located on a given device. The RealizeVDevice pass will use this to include the virtual device information in the annotation
RealizeVDevice pass propagates virtual device information throughout the IRModule (inserting annotations where they are missing)
No need to specify a target to the Relax build if there are VDevices given. It is an error to give a “target” if the Module contains virtual device info
Q: Why does hint_on_device need to be an operator as opposed to using StructInfo labels or match_cast?
A: It is easy to look for hint_on_device calls to make changes to VDevices. Using StructInfo directly could work too, but it might be easier to work with
Q: Should RealizeVDevice be part of the normalizer? It updates StructInfo, which is also what the normalizer does
A: RealizeVDevice does some backward propagation, unlike the normalizer. This could cause some issues if we threw it into the normalizer, since it would cause the BlockBuilder (which calls the normalizer constantly, assuming it to work only in the forward direction) to have some issues.
(Followup: Maybe it’s worth including it alongside the normalizer but not in the normalizer since there is also the issue that users would have to know when RealizeVDevice needs to be invoked.)
We will probably continue the discussion, since there are other potential spec issues
Appendix: End-to-End Example
# python tests/python/relax/test_vm_multi_device.py
@tvm.testing.requires_gpu
def test_multi_device():
@I.ir_module
class Example:
I.module_global_infos({"vdevice": [I.vdevice("cuda", 0), I.vdevice("llvm"),]})
@R.function
def foo(
x: R.Tensor((2, 3), "float32"),
y: R.Tensor((3, 4), "float32"),
z: R.Tensor((4, 5), "float32"),
) -> R.Tensor((2, 5), "float32"):
with R.dataflow():
lv0: R.Tensor((2, 4), "float32", "llvm") = R.matmul(x, y)
lv1: R.Tensor((2, 4), "float32", "cuda") = R.to_vdevice(lv0, "cuda")
gv: R.Tensor((2, 5), "float32", "cuda") = R.matmul(lv1, z)
R.output(gv)
return gv
# relax/op/base.py
def to_vdevice(data, dst_vdevice) -> Expr
def hint_on_device(data, dst_vdevice) -> Expr
# relax_vm/builtin.cc
vm.builtin.to_device
# test_transform_realize_vdevice.py
def test_insert_to_vdevice():
@I.ir_module
class Input:
@R.function
def foo(
x: R.Tensor((2, 3), "float32"),
y: R.Tensor((2, 3), "float32"),
z: R.Tensor((2, 3), "float32"),
) -> R.Tensor((2, 3), "float32"):
with R.dataflow():
lv0 = R.hint_on_device(y, tvm.cpu())
lv1 = R.add(x, lv0)
lv2 = R.hint_on_device(lv1, tvm.cuda())
lv3 = R.add(lv2, lv2)
lv4 = R.hint_on_device(z, tvm.cuda())
gv = R.multiply(lv3, lv4)
R.output(gv)
return gv
Additional changes for end to end
Add the kTarget attribute to prim_func during LegalizeOps if the related vdevice is annotated.
Change CallTIRRewrite pass to ModulePass, find and fill the device_index in the emitted alloc_tensor by looking up the global vdevice list of context mod.
Make the target argument of relax.build optional, and use the annotated target in the vdevice list for compilation if it is not specified.
Pull Request:
Meeting: December 19, 2023
Recording
Zoom link: https://octoml-ai.zoom.us/rec/share/73F6Vm1vPDuKY2pRc9xhgkCl-WAzeTIW7XZZZdSAbUdBj6MiABiGUTG1bQMC7Pi_.nqcikJ6zVPySsezs (Passcode: XbDK3.%$)
YouTube: https://www.youtube.com/watch?v=zsj0Hpg4i9U
Agenda
Dataflow block extraction (https://github.com/apache/tvm/pull/16204)
Notes
Bottom line: No need to write DF blocks manually anymore (why would we?)
Main question: Should this be part of the standard compilation flow?
Only reason not to run every time would be if we need preprocessing before the conversion. For example, unrolling a dynamic loop might be something we want to do before DF extraction
Another case: Getting rid of conditionals, which we might want to do earlier rather than later
There’s a phase ordering issue: We want to do the transformation early because we have lots of passes that operate on DF blocks but doing it immediately might make some kinds of preprocessing difficult. We might need to revisit the phase ordering tracking issue
Q: If we have two blocks with an impure operation separating them and we get rid of the operation, do they get combined?
A: Yes, the normalizer does it.
Might we want to run this pass multiple times in the course of compilation in that case? Passes can affect the DF requirements of parts of the program
Perhaps DF blocks should only exist for the passes that require them
One approach might be running this pass for any pass that requires DF blocks. Downside: expense at compile time.
Other approach: Using this pass as a cleanup step if a pass knowingly messes with DF requirements. Downside: Passes need to be aware of this
We can discuss further in 2024
Ending note: Let’s get Unity into the main branch in 2024! In many ways, the unity branch is already the de facto main branch
Vertical-Focused Meeting: December 12, 2023
Recording
Zoom: https://octoml-ai.zoom.us/rec/share/8Wr-shUZIrKUuafGHgilhDitI7RCNh37pA6JJEMm5v0mnq9vyqgowtoW9rZ61cPs.aMuyO7vU6YBCG78G (Passcode: CR7C.T*Q)
YouTube: https://www.youtube.com/watch?v=_TfuuNPKLwI
Notes
Continuous Batching Support (see slides: https://drive.google.com/file/d/1yVvxzBv-E_szCk7LWL1t_sFo4-_AVHxm/view?usp=drive_link)
Serving Branch of MLC-LLM (“serving”)
Implements a serving framework for continuous batching
API compatibility with OpenAI (not yet complete, but making progress)
For chat completions
JSON format
Distributed (“Disco”) runtime used to implement tensor parallelism
Paged KV Cache used to support efficient KV value management
Adopted from the vLLM Project
Supports common-prefix caching
Allows for storing additional context for queries
Also using efficient optimized attention kernels
On CUDA, we use FlashInfer
Plan is to generalize techniques from FlashInfer and FlashAttention to other platforms, implemented with handwritten TVM IR code
Will be advertised in a blog post once we have benchmark numbers
Support for AMD and Apple GPUs in addition to nVidia
Plan is to support Llama for the first version, others to come later (e.g., Mistral)
We also intend to use the SLIM model definition/compilation flow later on
Expected rollout by the end of the year
SLIM: Our “second-generation” model input format, goal is to make it easier to have custom quantization formats and make it easier to ship models to more platforms
FlashInfer Integration
Focused on long-context and batching/serving scenarios
Library actually shipped with some TVM support built in
In MLC-LLM, we hook into FlashInfer using nn.SourceModule (allows for directly injecting C code into the compilation pipeline and specify compilation options)
The FlashInfer APIs are accessed directly through these hooks
Substantial performance gains at larger context sizes with FlashInfer (however, there might be an accuracy tradeoff at the largest sizes, so these results should be interpreted carefully)
Q: What does the movement of pipelines like these into production look like? Consider issues like Dockerizing (access of Dockers to GPU)? For example, Docker might have to deal with passthrough of different device-specific APIs?
A: For common GPUs, passthrough is likely to be easy. Less common ones might require more work to have passthrough. We can consider cases of containerized and non-containerized deployment
Q: On batching, is there any benefit when there are only one or two GPUs? Would it help process at a higher token rate or be faster? Or do the benefits only manifest when there are many machines?
A: If there are many queries from many users, batching will indeed increase throughput. The bottlenecks in LLMs tend to come from memory accessing, so batching requests might be more efficient in this regard.
Q: On Mistral, would this pipeline be able to handle MoE (Mixture of Experts) models? Would it be able to handle comparable batch sizes?
A: We are exploring integration of MoE models. We might be able to improve their performance using CUTLASS. MoE involves some operators that we haven’t used in LLMs, so we would have to add support for them in our flow. (E.g., topk, cumsum.) Right now our focus is on improving performance-critical operators. Mistral has some of these features as well (using top2 to choose a branch with the best score and cumsum to do a weighted sum of experts). These operators also use different patterns of matrix multiplication (e.g., batchwise) so we might have to explore how to optimize those.
Q: How is C support handled in nn.SourceModule?
A: The C functions are compiled into TVM PackedFuncs. This capability could be very convenient for enabling other such integrations.
Future efforts: Supporting OpenAI APIs and supporting universal deployment. We have made progress on distributed systems and multi-GPU batching. We hope to stabilize our implementations and make TVM more feature-complete.
Meeting: December 5, 2023
Recording
https://octoml-ai.zoom.us/rec/share/elIBVucMUA_NXtzGxH6JqtuscngSV-ZvkzUVS6h2EgAkqC2g4javqnvs-kepyfgT.n9U2yoQCIyFNoP9n (Passcode: M$^2Mk3q)
YouTube Link: https://www.youtube.com/watch?v=DYp-rVIdTWc
Agenda
Hongyi Jin: DistIR
Notes
DistIR
Presentation Notes: https://thirsty-airplane-926.notion.site/distir-unity-meeting-presentation-717b0bde2fe248898a66c5e22d5ead00?pvs=4 (archive link: https://drive.google.com/file/d/12Z3SNrDGx1OsO8asEWBW29cU5df2QEzM/view?usp=drive_link)
Addresses problems related to distributed execution
Sharding has been a particular issue historically
There is sharding in terms of both data and computation
DistIR takes user input on how data should be sharded and uses that to infer computation sharding; computation sharding occasionally requires communication between devices
New StructInfo: DTensor, which has fields similar to a tensor but also takes device_mesh and placement fields. Device_mesh describes which devices the tensor is sharded across (the mesh is described with a shape and range of device IDs) and placement describes what data is given to each device in the mesh (the tensor can be replicated on a device or part of it can be sharded to the device)
Sharding propagation: Sharding annotations can be propagated according to certain rules, allowing a model’s behavior to be defined over the whole device mesh
The sharding information is very helpful for lowering into TIR, which will also produce sharded computations
Future work: Improving sharding specifications, adding Python parser support
Meeting: November 21, 2023
No recording! Nobody in the meeting had permissions to record in Zoom and attempts to use local software to do it all failed (OBS crashed and Audacity on Linux was too hard to get working)
Agenda
Progress on a website to collect meeting notes and maybe a Youtube channel (Yong Wu, Steven Lyubomirsky)
In-place operations: https://github.com/apache/tvm/pull/16129 (Steven Lyubomirsky)
Dynamic tuple indices: https://github.com/apache/tvm/pull/16002 (Eric Lunderberg)
Notes
Progress on setting up a website and YouTube channel: No progress, but we will keep it in the agenda until it’s done
In-place transformation runs end-to-end
Next step is to profile at scale and see if there’s a performance impact
LLaMa: Only roughly about 3% of the total operators can be made in-place by this method, but a performance gain of a couple percent is still a performance gain
Q: Is it permissible to use the arguments as an in-place destination? The current test case does
Probably not a good idea. However, if we inline the functions, this might be a reasonable move
Principle of least surprise: Don’t overwrite arguments. However, we can get some benefit from inlining internal function calls to get rid of this detriment
Action item: Revise the PR not to overwrite function arguments
Dynamic tuple indices
Motivation: Useful for pre-sharded model weights, surprisingly. Currently using strided slice to get out a parameter set from sharded weights when it would be easier to use split and a dynamic rank. Using strided slice in this case ends up being less efficient (strided slices are not necessarily contiguous, so the op is more expensive than an ordinary split, which would be a view without copying). See: https://github.com/mlc-ai/mlc-llm/pull/1096/files#diff-b633effd581b1f5f13d5b961dfbaca3b43a3d9f417046319de4e5e96556a53f1R216-R230
Likely to simplify a lot of special-case handling of TupleGetItem in many passes
Implementation: Use a builtin op to handle dynamic indexing and use the custom normalization (FNormalize) to normalize it into a static TupleGetItem when the index is statically known
Also allows for an enormous simplification in the parser, now we can use Python indexing notation and we don’t have to worry about whether the argument is an integer literal
What is holding up the PR?
It turned out we were missing unit tests for tuples of primitive types and this has turned out to be a bit of an issue! Problem has to do with the TVM FFI (tests compile but fail at run time). The FFI is coercing Python integers into TIR IntImms when they are passed as arguments to the VM, which is not the same thing as an integer PrimValue. Correcting this will require low-level corrections, as the problem impacts every TVM runtime container
Possible solution: https://github.com/apache/tvm/pull/15983#issuecomment-1802501675 Don’t use IntImm as a runtime container, instead have boxed representations. This will require checks at the FFI to determine when to box/unbox (currently we have a forward transformation into ObjectRefs, so we should also have a reverse transformation that undoes it)
“Convenience is a benefit in itself”: Even though there’s a workaround, improving the underlying issue will improve our tools for users
Possible simplification to the type system: We can get rid of shape values if we can have ordinary tuples of PrimValues. No longer necessary, which would let us get rid of a type and get rid of a separate runtime object
Note for readers’ benefit: The issue of boxing was raised right when PrimValues were proposed over a year ago
Vertical-Focused Meeting: September 5, 2023
Recording
YouTube link: https://www.youtube.com/watch?v=GcbuODb51Sc
Agendas
MultiGPU support
Batching discussions
FlashInfer - Zihao
Disco - Junru
New model support roadmap discussions
Vertical-Focused Meeting: July 25, 2023
Recording
https://octoml-ai.zoom.us/rec/share/tweHayS90DlVHXN_57EHCoW67MBYX2pXICUE3FMTdEV29m8eZjfPXCGSe96t6zsK.ITdBBoO0_N-2LDzg (Passcode: Z@c^2TkZ)
YouTube link: https://www.youtube.com/watch?v=ZpCtJ_0QqgU
Agenda
MLC-LLM documentation
Supporting more models and hardware
Performance optimization
Roadmaps discussions and suggestions
Notes
Slides: https://docs.google.com/presentation/d/1eNinLTcVrMnPWKKBKUy3DQ5VlrPKNTB5377_kjgTszc/edit#slide=id.p (Archived link)
LLM inference with FasterTransformer kernel: https://gist.github.com/masahi/079a72120cd54fbb0d3ebf29a751fed1
Roadmap Discussion:
Aiming to improve performance, especially on Android, with DLight
Multimodality: Hoping to get more image generation models
Python API: Expanding tutorials and improving usability
nn.Module: Should also be a major UX improvement
Open discussion:
Q: Is finetuning quantized models part of our goals?
Finetuning is something we are interested in doing and it’s certainly possible
Memory constraints and taking advantage of hardware properties are challenges
We may have finetuning results in the coming months
Right now, MLC-LLM can deploy finetuned checkpoints
Q: Why was LLaMa 2 supported so quickly whereas other models have taken a long time to support?
Question of priorities. We have been focusing on getting “out-of-the-box performance” for many models
Not hard to get a model minimally working, but users expect to get state-of-the-art performance, so we do not want to rush the release
With LLaMa 2, we got lucky: It came out right after we had developed DLight, which helped us get strong performance right away
We are now aiming to support larger sizes of LLaMa 2. We will focus on model coverage more in coming months
Q: Are there any plans to bring autotuning support for TIR dispatch?
The premise of DLight is getting good performance “out of the box” without autotuning and that’s our priority right now
We may add a tuner that works on top of DLight’s rules to try to get the best of both worlds
DLight is designed to be compatible with existing autotuning infrastructure, so it should be possible in principle to perform autotuning on anything we’re producing
Qualcomm has some recent tuning results that they are planning to share with the community, perhaps showing the possibility of combining autotuning with DLight