Abstract: In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture ...
Abstract: This work proposes TimeChat, a time-sensitive multi-modal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: ...
一些您可能无法访问的结果已被隐去。
显示无法访问的结果