diff --git a/docs/awesome_llm_data.md b/docs/awesome_llm_data.md index c5f4dc51a..1156c7427 100644 --- a/docs/awesome_llm_data.md +++ b/docs/awesome_llm_data.md @@ -2,10 +2,11 @@ Welcome to the "Awesome List" for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed **taxonomy** from our data-model co-development [survey](https://arxiv.org/abs/2407.08583), as illustrated below. ![Overview of Our Taxonomy](https://img.alicdn.com/imgextra/i1/O1CN01aN3TVo1mgGZAuSHJ4_!!6000000004983-2-tps-3255-1327.png) -Soon we will provide a dynamic table of contents to help readers more easily navigate through the materials with features such as search, filter, and sort. - Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. **Please feel free to make pull requests or open issues to [contribute to](#contribution-to-this-survey) this list and add more related resources!** +# News ++ ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-10-23] We built a [dynamic table](https://modelscope.github.io/data-juicer/_static/awesome-list.html) based on the [paper list](#paper-list) that supports filtering and searching. ++ ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-10-22] We restructured our [paper list](#paper-list) to provide more streamlined information. ## Candidate Co-Development Tags @@ -72,6 +73,8 @@ These tags correspond to the taxonomy in our paper, and each work may be assigne ## Paper List +Below is a paper list summarized based on our survey. Additionally, we have provided a [dynamic table](https://modelscope.github.io/data-juicer/_static/awesome-list.html) that supports filtering and searching, with the data source same as the list below. + | Title | Tags | |-------|-------| |No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance|![](https://img.shields.io/badge/Data4Model--Scaling--Up--Acquisition-f1db9d) ![](https://img.shields.io/badge/Data4Model--Scaling--Effectiveness--CrossModalAlignment-f1db9d) ![](https://img.shields.io/badge/Model4Data--Synthesis--Evaluator-b4d4fb)| diff --git a/docs/sphinx_doc/source/_static/awesome-list.html b/docs/sphinx_doc/source/_static/awesome-list.html new file mode 100644 index 000000000..560e0c88b --- /dev/null +++ b/docs/sphinx_doc/source/_static/awesome-list.html @@ -0,0 +1,1478 @@ + + + + +
+ + + +Welcome to the "Awesome List" for data-model co-development of Multi-Modal Large Language Models (MLLMs), a continually updated resource tailored for the open-source community. This compilation features cutting-edge research, insightful articles focusing on improving MLLMs involving with the data-model co-development of MLLMs, and tagged based on the proposed taxonomy from our data-model co-development survey, as illustrated below.
+ + + +Due to the rapid development in the field, this repository and our paper are continuously being updated and synchronized with each other. Please feel free to make pull requests or open issues to contribute to this list and add more related resources!
+ +Title | +Tags | +
---|---|
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Scaling Effectiveness->CrossModalAlignment + Model4Data->Synthesis->Evaluator + | +
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning | ++ Model4Data->Synthesis->Creator + | +
Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain | ++ Data4Model->Usability->Ethic->Toxicity + | +
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets | ++ Data4Model->Scaling Effectiveness->Condensation + | +
ChartLlama: A Multimodal LLM for Chart Understanding and Generation | ++ Model4Data->Synthesis->Creator + Model4Data->Insights->Visualizer + | +
VideoChat: Chat-Centric Video Understanding | ++ Model4Data->Synthesis->Creator + Model4Data->Synthesis->Mapper + | +
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex | ++ Model4Data->Synthesis->Mapper + | +
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding | ++ Model4Data->Synthesis->Creator + | +
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting | ++ Data4Model->Scaling Up->Acquisition + Model4Data->Synthesis->Mapper + | +
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | ++ Data4Model->Scaling Up->Acquisition + | +
Audio Retrieval with WavText5K and CLAP Training | ++ Data4Model->Scaling Up->Diversity + Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Eval->Retrieval + | +
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Demystifying CLIP Data | ++ Data4Model->Scaling Effectiveness->Mixture + | +
Learning Transferable Visual Models From Natural Language Supervision | ++ Data4Model->Scaling Up->Acquisition + | +
DataComp: In search of the next generation of multimodal datasets | ++ Data4Model->Scaling Effectiveness->Condensation + Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Eval->Generation + Model4Data->Synthesis->Filter + | +
Beyond neural scaling laws: beating power law scaling via data pruning | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Flamingo: a visual language model for few-shot learning | ++ Data4Model->Scaling Effectiveness->Mixture + | +
Quality not quantity: On the interaction between dataset design and robustness of clip | ++ Data4Model->Scaling Effectiveness->Condensation + Data4Model->Scaling Effectiveness->Mixture + | +
VBench: Comprehensive Benchmark Suite for Video Generative Models | ++ Data4Model->Usability->Eval->Generation + | +
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models | ++ Data4Model->Usability->Eval->Generation + | +
Training Compute-Optimal Large Language Models | ++ Data4Model->Scaling Up->Acquisition + | +
NExT-GPT: Any-to-Any Multimodal LLM | ++ Data4Model->Scaling Up->Acquisition + | +
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Scaling Effectiveness->CrossModalAlignment + | +
ChartReformer: Natural Language-Driven Chart Image Editing | ++ Data4Model->Scaling Up->Acquisition + Model4Data->Insights->Visualizer + | +
GroundingGPT: Language Enhanced Multi-modal Grounding Model | ++ Data4Model->Usability->Responsiveness->ICL + | +
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic | ++ Data4Model->Usability->Responsiveness->Prompt + | +
Kosmos-2: Grounding Multimodal Large Language Models to the World | ++ Data4Model->Usability->Responsiveness->Prompt + | +
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters | ++ Model4Data->Synthesis->Filter + Model4Data->Synthesis->Creator + | +
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation | ++ Model4Data->Synthesis->Creator + Data4Model->Scaling Up->Acquisition + Data4Model->Scaling Up->Diversity + Data4Model->Usability->Responsiveness->HumanBehavior + | +
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset | ++ Data4Model->Usability->Eval->Understanding + | +
Structured Packing in LLM Training Improves Long Context Utilization | ++ Data4Model->Scaling Effectiveness->Packing + | +
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | ++ Data4Model->Scaling Effectiveness->Packing + | +
MoDE: CLIP Data Experts via Clustering | ++ Data4Model->Scaling Effectiveness->Packing + | +
Efficient Multimodal Learning from Data-centric Perspective | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs | ++ Data4Model->Scaling Up->Augmentation + | +
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark | ++ Data4Model->Usability->Eval->Understanding + | +
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | ++ Data4Model->Usability->Eval->Understanding + | +
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models | ++ Data4Model->Scaling Up->Acquisition + | +
Perception Test: A Diagnostic Benchmark for Multimodal Video Models | ++ Data4Model->Usability->Eval->Understanding + | +
FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension | ++ Data4Model->Usability->Eval->Reasoning + | +
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token | ++ Data4Model->Usability->Eval->Understanding + Model4Data->Synthesis->Creator + | +
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning | ++ Data4Model->Usability->Eval->Reasoning + | +
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Reasoning->SingleHop + | +
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Eval->Understanding + | +
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning | ++ Data4Model->Usability->Eval->Understanding + Model4Data->Synthesis->Creator + Data4Model->Scaling Up->Diversity + | +
WorldGPT: Empowering LLM as Multimodal World Model | ++ Data4Model->Usability->Eval->Generation + | +
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs | ++ Data4Model->Usability->Responsiveness->Prompt + Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Responsiveness->ICL + | +
TextSquare: Scaling up Text-Centric Visual Instruction Tuning | ++ Data4Model->Scaling Up->Acquisition + Model4Data->Synthesis->Creator + Model4Data->Synthesis->Filter + Model4Data->Synthesis->Evaluator + | +
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction | ++ Data4Model->Usability->Eval->Understanding + Data4Model->Scaling Up->Acquisition + | +
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? | ++ Data4Model->Usability->Responsiveness->ICL + Model4Data->Insights->Navigator + | +
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want | ++ Data4Model->Usability->Responsiveness->HumanBehavior + | +
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution | ++ Data4Model->Scaling Effectiveness->Packing + | +
Fewer Truncations Improve Language Modeling | ++ Data4Model->Scaling Effectiveness->Packing + | +
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale | ++ Data4Model->Usability->Reasoning->MultiHop + Model4Data->Synthesis->Mapper + | +
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception | ++ Data4Model->Scaling Up->Acquisition + Model4Data->Synthesis->Mapper + | +
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark | ++ Data4Model->Usability->Eval->Understanding + Model4Data->Synthesis->Creator + | +
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives | ++ Data4Model->Scaling Up->Augmentation + Model4Data->Synthesis->Creator + | +
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation | ++ Data4Model->Usability->Responsiveness->Prompt + Data4Model->Usability->Ethic->Toxicity + Model4Data->Synthesis->Evaluator + | +
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models | ++ Data4Model->Scaling Up->Acquisition + | +
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative | ++ Data4Model->Usability->Ethic->Toxicity + | +
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs | ++ Model4Data->Synthesis->Mapper + Data4Model->Scaling Up->Acquisition + | +
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria | ++ Data4Model->Usability->Eval->Understanding + Model4Data->Synthesis->Evaluator + | +
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models | ++ Data4Model->Usability->Eval->Generation + Data4Model->Usability->Ethic->Toxicity + | +
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models | ++ Data4Model->Usability->Responsiveness->ICL + Data4Model->Usability->Reasoning->MultiHop + Data4Model->Scaling Up->Diversity + | +
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts | ++ Data4Model->Usability->Eval->Understanding + | +
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model | ++ Model4Data->Insights->Analyzer + Model4Data->Synthesis->Mapper + | +
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding | ++ Model4Data->Insights->Analyzer + | +
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding | ++ Model4Data->Insights->Analyzer + | +
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | ++ Data4Model->Scaling Up->Augmentation + | +
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model | ++ Model4Data->Insights->Analyzer + | +
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation | ++ Data4Model->Usability->Eval->Understanding + Data4Model->Usability->Eval->Retrieval + | +
On the Adversarial Robustness of Multi-Modal Foundation Models | ++ Data4Model->Usability->Ethic->Toxicity + | +
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models | ++ Data4Model->Usability->Reasoning->SingleHop + Model4Data->Synthesis->Filter + Model4Data->Synthesis->Creator + | +
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | ++ Data4Model->Scaling Up->Acquisition + | +
PaLM-E: An Embodied Multimodal Language Model | ++ Data4Model->Scaling Up->Diversity + | +
Multimodal Data Curation via Object Detection and Filter Ensembles | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Sieve: Multimodal Dataset Pruning Using Image Captioning Models | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Towards a statistical theory of data selection under weak supervision | ++ Data4Model->Scaling Effectiveness->Condensation + | +
𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning | ++ Data4Model->Scaling Up->Diversity + Data4Model->Scaling Effectiveness->Condensation + | +
UIClip: A Data-driven Model for Assessing User Interface Design | ++ Data4Model->Scaling Up->Acquisition + | +
CapsFusion: Rethinking Image-Text Data at Scale | ++ Data4Model->Scaling Up->Augmentation + | +
Improving CLIP Training with Language Rewrites | ++ Model4Data->Synthesis->Mapper + Data4Model->Scaling Up->Augmentation + | +
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation | ++ Data4Model->Usability->Eval->Generation + | +
A Decade's Battle on Dataset Bias: Are We There Yet? | ++ Data4Model->Scaling Effectiveness->Mixture + | +
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Scaling Effectiveness->CrossModalAlignment + | +
Data Filtering Networks | ++ Data4Model->Scaling Effectiveness->Condensation + | +
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning | ++ Data4Model->Scaling Effectiveness->Condensation + | +
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Align and Attend: Multimodal Summarization with Dual Contrastive Losses | ++ Data4Model->Scaling Effectiveness->CrossModalAlignment + | +
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? | ++ Data4Model->Usability->Reasoning->SingleHop + Data4Model->Usability->Reasoning->MultiHop + Data4Model->Usability->Eval->Reasoning + | +
Text-centric Alignment for Multi-Modality Learning | ++ Model4Data->Synthesis->Mapper + | +
Noisy Correspondence Learning with Meta Similarity Correction | ++ Data4Model->Scaling Effectiveness->CrossModalAlignment + | +
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos | ++ Data4Model->Usability->Reasoning->MultiHop + | +
Language-Image Models with 3D Understanding | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Reasoning->SingleHop + Data4Model->Usability->Reasoning->MultiHop + | +
Scaling Laws for Generative Mixed-Modal Language Models | ++ Data4Model->Scaling Up->Acquisition + | +
BLINK: Multimodal Large Language Models Can See but Not Perceive | ++ Data4Model->Usability->Eval->Understanding + | +
Visual Hallucinations of Multi-modal Large Language Models | ++ Data4Model->Usability->Eval->Generation + | +
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | ++ Data4Model->Usability->Responsiveness->Prompt + Data4Model->Usability->Reasoning->MultiHop + | +
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Reasoning->MultiHop + Model4Data->Synthesis->Creator + | +
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Reasoning->MultiHop + | +
Visual Instruction Tuning | ++ Data4Model->Scaling Up->Acquisition + Model4Data->Synthesis->Creator + Model4Data->Synthesis->Mapper + | +
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Scaling Effectiveness->CrossModalAlignment + Data4Model->Usability->Responsiveness->HumanBehavior + | +
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models | ++ Data4Model->Usability->Responsiveness->Prompt + | +
On the De-duplication of LAION-2B | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Scaling Effectiveness->Mixture + | +
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | ++ Data4Model->Usability->Eval->Understanding + | +
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition | ++ Data4Model->Usability->Responsiveness->Prompt + | +
Data Augmentation for Text-based Person Retrieval Using Large Language Models | ++ Data4Model->Scaling Up->Augmentation + Data4Model->Scaling Effectiveness->Mixture + Model4Data->Synthesis->Mapper + | +
Aligning Actions and Walking to LLM-Generated Textual Descriptions | ++ Data4Model->Scaling Up->Augmentation + Model4Data->Synthesis->Mapper + | +
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction | ++ Data4Model->Scaling Up->Augmentation + | +
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models | ++ Data4Model->Scaling Up->Diversity + | +
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability | ++ Data4Model->Scaling Effectiveness->CrossModalAlignment + | +
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling | ++ Model4Data->Synthesis->Creator + | +
Probing Multimodal LLMs as World Models for Driving | ++ Data4Model->Usability->Eval->Understanding + Data4Model->Usability->Eval->Reasoning + | +
Unified Hallucination Detection for Multimodal Large Language Models | ++ Data4Model->Usability->Eval->Generation + Model4Data->Insights->Extractor + Model4Data->Synthesis->Mapper + | +
Semdedup: Data-efficient learning at web-scale through semantic deduplication | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Automated Multi-level Preference for MLLMs | ++ Data4Model->Usability->Responsiveness->HumanBehavior + | +
Silkie: Preference distillation for large visual language models | ++ Data4Model->Usability->Responsiveness->HumanBehavior + | +
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | ++ Data4Model->Usability->Responsiveness->HumanBehavior + | +
M3it: A large-scale dataset towards multi-modal multilingual instruction tuning | ++ Data4Model->Usability->Responsiveness->HumanBehavior + | +
Aligning Large Multimodal Models with Factually Augmented RLHF | ++ Data4Model->Usability->Responsiveness->HumanBehavior + | +
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback | ++ Data4Model->Usability->Responsiveness->HumanBehavior + | +
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | ++ Data4Model->Scaling Effectiveness->CrossModalAlignment + | +
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark | ++ Data4Model->Usability->Eval->Generation + Model4Data->Synthesis->Evaluator + | +
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI | ++ Data4Model->Usability->Eval->Understanding + Data4Model->Usability->Eval->Retrieval + | +
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought | ++ Data4Model->Usability->Eval->Reasoning + | +
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image | ++ Data4Model->Usability->Ethic->Toxicity + Model4Data->Synthesis->Evaluator + Model4Data->Synthesis->Creator + | +
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models | ++ Data4Model->Usability->Ethic->Toxicity + | +
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | ++ Data4Model->Usability->Ethic->Toxicity + | +
Improving Multimodal Datasets with Image Captioning | ++ Data4Model->Scaling Effectiveness->Condensation + | +
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System | ++ Model4Data->Insights->Analyzer + | +
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition | ++ Model4Data->Insights->Extractor + | +
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs | ++ Model4Data->Insights->Extractor + Model4Data->Synthesis->Mapper + | +
CiT: Curation in Training for Effective Vision-Language Data | ++ Data4Model->Scaling Effectiveness->Condensation + Data4Model->Scaling Effectiveness->Mixture + | +
InstructPix2Pix: Learning to Follow Image Editing Instructions | ++ Model4Data->Synthesis->Creator + | +
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study | ++ Model4Data->Insights->Visualizer + | +
ModelGo: A Practical Tool for Machine Learning License Analysis | ++ Data4Model->Usability->Ethic->Privacy&IP + | +
Scaling Laws of Synthetic Images for Model Training ... for Now | ++ Data4Model->Scaling Up->Acquisition + Data4Model->Usability->Responsiveness->Prompt + | +
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs | ++ Data4Model->Scaling Up->Diversity + | +
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V | ++ Data4Model->Usability->Responsiveness->Prompt + | +
Segment Anything | ++ Data4Model->Scaling Up->Acquisition + | +
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning | ++ Data4Model->Usability->Responsiveness->ICL + | +
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning | ++ Data4Model->Usability->Responsiveness->ICL + | +
All in an Aggregated Image for In-Image Learning | ++ Data4Model->Usability->Responsiveness->ICL + | +
Panda-70m: Captioning 70m videos with multiple cross-modality teachers | ++ Data4Model->Scaling Up->Acquisition + | +
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text | ++ Data4Model->Scaling Up->Acquisition + | +
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning | ++ Data4Model->Scaling Up->Acquisition + | +