Meta launches Spirit LM, an innovative multimodal language model

Meta has introduced Spirit LM, a groundbreaking language model that integrates spoken and written text, enhancing the capabilities of traditional models while addressing previous limitations.

Meta, the renowned technology company, has unveiled a pioneering language model known as Spirit LM, designed to seamlessly integrate spoken and written text within a unified multimodal framework. This innovative approach allows for the mixing of speech and text components in a single model, addressing the limitations faced by previous solutions which relied on separate pipelines for processing speech and text inputs.

Spirit LM builds upon Meta’s 7-billion parameter pre-trained language model, Llama 2, expanding its capabilities by incorporating speech elements. This integration is achieved through continuous training on both textual and spoken language, utilising a technique that combines text and speech sequences into a singular token stream. Meta’s researchers employed a word-level interleaving method using a curated parallel corpus of speech-text data to refine the training process.

The primary advantage of Spirit LM, as highlighted by its creators at Meta, lies in its ability to merge the semantic understanding typical of text-based language models with the expressive richness inherent in speech models. However, it is important to note that its current performance with text-only inputs is slightly less effective than that of the original Llama 2 model.

Traditional approaches to incorporating speech capability into language models typically involve a sequential pipeline. Speech input is initially transcribed via automatic speech recognition (ASR), then processed by the language model, and subsequently converted back to speech. This sequence often limits the model’s capacity to generate expressive spoken output. In contrast, Spirit LM utilises a mixed training approach, handling text-only sequences, speech-only sequences, and hybrid interleaved sequences. Speech is converted into phonetically related tokens using HuBERT, alongside additional tokens for pitch and style, allowing for an interleaved training sequence that switches between text and speech at designated word boundaries.

Meta also introduced the Speech-Text Sentiment Preservation benchmark to evaluate Spirit LM’s capability to maintain the sentiment expressed in text and speech prompts. This benchmark assesses whether the model can generate outputs that accurately reflect the sentiment—positive, negative, or neutral—of the given prompt.

Despite its promising features, Spirit LM does face certain limitations. It does not yet match the text processing performance of Llama 2, a challenge that Meta aims to address through further training refinements and potentially adopting a larger base model. Additionally, Spirit LM currently lacks mechanisms to prevent misuse, such as generating misleading information or unauthorised impersonation, and its language capabilities are restricted to English, with no provisions for diverse accents and dialects.

For enthusiasts and developers interested in exploring Spirit LM, Meta has made the model available in two versions: a base version that uses phonetic speech units via HuBERT and an expressive version that incorporates pitch and style units. The models, along with their weights, are accessible on GitHub; however, they are restricted to non-commercial use under the present licensing terms.

Source: Noah Wire Services

Automate Your Business

You are one step away from removing your bottlenecks, automating your business and getting your time back. It’s like hiring 3 staff members – minus the headache, minus the pensions, minus the sick pay!

Trending

The shift towards automation in semiconductor chip design

The rise of virtual assistant outsourcing for SMEs

State-sponsored cyber-criminals reportedly utilising Google’s AI model for malicious operations

Automate Your Business

Schedule a free automation consultation

Automate Your Business

Schedule a free automation consultation

Automate Your Business

Schedule a free automation consultation

The shift towards automation in semiconductor chip design

The rise of virtual assistant outsourcing for SMEs

State-sponsored cyber-criminals reportedly utilising Google’s AI model for malicious operations

The shift from machine-like organisations to adaptive ecosystems

Meteomatics secures $22 million in Series-C funding to enhance hyperlocal weather forecasting

Food manufacturers must adapt to new challenges with modern asset management

The rise of virtual assistant outsourcing for SMEs

State-sponsored cyber-criminals reportedly utilising Google’s AI model for malicious operations

New AI-powered automation technologies emerge with Silicon Labs’ BG series

The shift from machine-like organisations to adaptive ecosystems

Meteomatics secures $22 million in Series-C funding to enhance hyperlocal weather forecasting

Trending

Meta launches Spirit LM, an innovative multimodal language model

Automate Your Business

Schedule a free automation consultation

Automate Your Business

Schedule a free automation consultation

Automate Your Business

Schedule a free automation consultation

Keep Reading