With the rise of various generative models, music generation has become a prominent field. This paper addresses the challenge of generating background music that aligns both semantically and temporally with the video content. The main contributions of this paper are twofold: first, the introduction of V2M, a large-scale video-music paired dataset, and second, the development of VidMuse, a framework that combines both long-term and short-term modules to effectively integrate local and global cues from video into the generated music.
This paper falls under four areas: Video Representation, Audio-Visual Alignment, Conditional Music Generation, and Video-to-Music Datasets. Various models have been proposed to learn spatio-temporal video representations, such as ViViT, which focuses on video classification and action recognition, and VideoMAE, a masked autoencoder for video understanding. Additionally, models like ImageBind have advanced multi-modal alignment, including audio-visual alignment; ImageBind is also used as an evaluation metric in this paper. Conditional Music Generation has been explored in works like M2uGen and Video2Music, which utilize diverse conditional inputs across modalities (e.g., emotional cues, text, video, and images). Datasets like HIMV-200K, which provide video-music pairs for music retrieval, exist but are limited in genre diversity and quality. Addressing these limitations, this paper presents a video-to-music generation framework that relies exclusively on video input, alongside a large-scale, high-quality video-music paired dataset.
The paper has 2 major contributions: V2M dataset and VidMuse framework.
To ensure high-quality data, V2M employs a multi-step Dataset Construction pipeline to clean and process source videos from YouTube, drawing from a subset of HIMV-200K and YouTube-8M labeled with "Music" and "Trailer" tags. The paper states that the dataset majorly comprises of movie trailers.
Dataset Construction pipeline
The proposed framework entails 4 major sections: (1) Visual Encoder, (2) Long-Short-Term Visual Module(LSTV), (3) Music Token Decoder, and (4) Audio Encoder/Decoder