Video Understanding
-
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language BenchmarkMachine Learning/MLLM 2024. 8. 17. 22:50
Can Multimodal Large Language Models (MLLMs) effectively serve as judges in multimodal domains, and how closely do their evaluations align with human preferences?-> MLLM의 평가 능력을 평가 (Meta-evaluation)1. IntroductionInspiration: LLM-as-a-Judgehttps://arxiv.org/abs/2306.05685 / https://arxiv.org/html/2403.02839v1OverviewMLLM이 다양한 modality에서 판단하는 능력을 평가다음과 같은 3가지 형태의 판단에 대한 MLLM의 능력을 평가Scroing Eval..
-
Video Recap: Recursive Captioning for Hour-Long VideosMachine Learning/MLLM 2024. 8. 17. 22:29
https://sites.google.com/view/vidrecap Video ReCapHierarchical Video Captioning Tasksites.google.comhttps://arxiv.org/abs/2402.13250Video ReCap: Recursive Captioning of Hour-Long VideosMd Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan,Lorenzo Torresani, Gedas BertasiusUNC Chapel Hill and Meta AIAccepted by CVPR 2024[Paper] [Code] [Dataset] [Demo] [HF] Abstract: Video ReCap기존 v..
-
Video Understanding Paper Summary (Data 중심)Machine Learning/MLLM 2024. 8. 17. 22:19
0. 전체 요약" style="width: 17.6744%;"> Data SourceData GenerationPost-Processing# Data(for tuning)" style="width: 17.6744%;">1. LLaVAPublic Dataset(COCO Images)ChatGPT-4 158K" style="width: 17.6744%;">2. MiniGPT-4Public Dataset(Conceptual Captions)Inital Pretrained ModelChatGPT-4+manual 3.5K" style="width: 17.6744%;">3. ValleyJukinmedia(73k)+ llava(150k)+ videochat(11k)ChatGPT-4 234k" style="width:..