Generative AI with Scientific Diagrams
General Discussion
When we recall the way we were educated in Science and Math Education, our first resource for our learning was our textbook. One of the biggest aides when learning an abstract science or math topic was the scientific diagram or figure attached to the concept or problem. In this area, we explore (1) how well generative AI makes scientific diagrams and (2) does a scientific diagram attached to a picture help the LLM to solve a science or math problem?
Research on scientific diagram generation
Controllable text-to-image generation with GPT-4: https://arxiv.org/pdf/2305.18583
Early experiments with GPT-4, Bubeck - where the first idea of "generate the TiKZ code for a unicorn" came from.
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model https://arxiv.org/abs/2311.18248 Research done by Alibaba Group - Diagram to text.
DiagrammerGPT https://arxiv.org/abs/2310.12128 Done by researchers from UNC Chapel Hill - Text to Image; where they specifically concentrate on a t a novel two-stage text-to-diagram generation framework that leverages the layout guidance capabilities of LLMs (e.g., GPT-4) to generate more accurate opendomain, open-platform diagrams:
Stage 1: DiagrammerGPT: use LLM to gnerate and itervatively refine "diagram plans" which describes objects and text labels
Stage 2: DiagramGLIGEN: use a diagram generator, and a text label to generate diagrams using the diagram plans from DiagrammerGPT.
NOTE: To benchmark performance, a new "AI2D-Caption" densely annotated diagram dataset was built on top of the AI2D dataset.
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ https://arxiv.org/abs/2310.00367
Highlights:
DaTiKZ, the first large-scale TiKZ code data set featuring approximately 120k paired TikZ drawings and captions.
Fine-tuned on the open sourced LLaMA1 on DaTikZ; and compared performance to GPT4 and Claude 2
Try it out here on Hugging Face: https://huggingface.co/spaces/nllg/AutomaTikZ
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots https://arxiv.org/pdf/2405.07990 (used matplotlib images (generated by python)
Research on MLLM reasoning with diagrams
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI https://arxiv.org/abs/2311.16502 A new benchmark of 11.5 multimodal questions to evaluate covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.
Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination https://arxiv.org/abs/2401.08025 Basic result: If we attach a picture to a problem, the LLM will solve the math or science question better.
Chain of Images for Intuitively Reasoning https://arxiv.org/abs/2311.09241
Vizualization of Thoughts (VoT) Elicits Spatial Reasoning in LLMs https://arxiv.org/abs/2404.03622 (4 Apr 2024) Done by Microsoft Researchers.
• GPT-4 CoT: Let’s think step by step.
• GPT-4 w/o Viz: Don’t use visualization. Let’s think step by step.
• GPT-4V CoT: Let’s think step by step.
• GPT-4 VoT: Visualize the state after each reasoning step.
One can try this with science and math problems in multi-modal LLMs - and see if this improves task performance on problem solving.