A Multimodal Text-To-MIDI Transformer Model with Special Consideration To Artist Usage
Abstract
Generative AI systems are of great interest in the field of automated music production. Current well-known state-of-the-art music generation systems such as MusicLM, while incredibly versatile and tractable, do not allow for direct control of tone or texture of the generated instruments other than by altering the text prompt, which also alters the structure of the generated composition. Therefore, symbolic music generation would still be of great use to musicians.
Using current state-of-the-art advancements in Deep Learning and an off-the-shelf MIDI dataset and pretrained English encoder, this work proposes a novel sequence-to-sequence music generation model that converts written descriptions of the style and artistic themes of a song into coherent MIDI musical representations.
The architecture and synthetic dataset used in training this model were constructed with special consideration to the needs of musicians, namely, a native musical format, lack of association between any particular artist and musical style, and the possibility of live usage with a human accompaniment.
Table of Contents
Introduction -- Background and related work -- Methodology -- Results and discussion -- Conclusion
Degree
M.S. (Master of Science)