Meta has introduced Spirit LM, a groundbreaking language model that integrates spoken and written text, enhancing the capabilities of traditional models while addressing previous limitations.
Meta, the renowned technology company, has unveiled a pioneering language model known as Spirit LM, designed to seamlessly integrate spoken and written text within a unified multimodal framework. This innovative approach allows for the mixing of speech and text components in a single model, addressing the limitations faced by previous solutions which relied on separate pipelines for processing speech and text inputs.
Spirit LM builds upon Meta’s 7-billion parameter pre-trained language model, Llama 2, expanding its capabilities by incorporating speech elements. This integration is achieved through continuous training on both textual and spoken language, utilising a technique that combines text and speech sequences into a singular token stream. Meta’s researchers employed a word-level interleaving method using a curated parallel corpus of speech-text data to refine the training process.
The primary advantage of Spirit LM, as highlighted by its creators at Meta, lies in its ability to merge the semantic understanding typical of text-based language models with the expressive richness inherent in speech models. However, it is important to note that its current performance with text-only inputs is slightly less effective than that of the original Llama 2 model.
Traditional approaches to incorporating speech capability into language models typically involve a sequential pipeline. Speech input is initially transcribed via automatic speech recognition (ASR), then processed by the language model, and subsequently converted back to speech. This sequence often limits the model’s capacity to generate expressive spoken output. In contrast, Spirit LM utilises a mixed training approach, handling text-only sequences, speech-only sequences, and hybrid interleaved sequences. Speech is converted into phonetically related tokens using HuBERT, alongside additional tokens for pitch and style, allowing for an interleaved training sequence that switches between text and speech at designated word boundaries.
Meta also introduced the Speech-Text Sentiment Preservation benchmark to evaluate Spirit LM’s capability to maintain the sentiment expressed in text and speech prompts. This benchmark assesses whether the model can generate outputs that accurately reflect the sentiment—positive, negative, or neutral—of the given prompt.
Despite its promising features, Spirit LM does face certain limitations. It does not yet match the text processing performance of Llama 2, a challenge that Meta aims to address through further training refinements and potentially adopting a larger base model. Additionally, Spirit LM currently lacks mechanisms to prevent misuse, such as generating misleading information or unauthorised impersonation, and its language capabilities are restricted to English, with no provisions for diverse accents and dialects.
For enthusiasts and developers interested in exploring Spirit LM, Meta has made the model available in two versions: a base version that uses phonetic speech units via HuBERT and an expressive version that incorporates pitch and style units. The models, along with their weights, are accessible on GitHub; however, they are restricted to non-commercial use under the present licensing terms.
Source: Noah Wire Services












