DBRX, Grok, Mixtral: Mixture-of-Experts is a trending architecture for LLMs

Mixture-of-Experts (MoE) architecture is a relatively new wave in the development of large language models (LLMs), offering a flexible solution that efficiently tackles computational challenges. Leveraging the MoE technique, models like DBRX demonstrate enhanced performance by activating only a relevant subset of 'experts' for each input. This not only reduces the computational cost but also scales model capacity without proportionately increasing resource demands. The recent introduction of models such as Databricks' DBRX, Grok-1 by xAI, and Mixtral 8x7B by Mistral AI marks a significant trend toward the adoption of MoE architecture in open-source LLM development, making it a focal point for researchers and practitioners alike.

The adoption of MoE models, including DBRX, is paving the way for advancements in efficient LLM training, addressing critical aspects like flop efficiency per parameter and decreased latency. Such models have become instrumental in applications requiring retrieval-augmented generation (RAG) and autonomous agents, thanks to their cost-effective training methods and improved generalization capabilities. With a focus on scalable, high-performing, and efficient LLMs, this article will explore the intricacies of MoE architecture, highlighting how pioneering open implementations by Databricks and others are setting new benchmarks in the field.

The Rise of Mixture-of-Experts in LLMs

The inception of Mixture-of-Experts (MoE) can be traced back to the early 1990s, marking a pivotal moment in neural network design. This innovative architecture, initially introduced by Jacobs et al.[1], revolutionized the way LLMs are developed by integrating multiple "expert" networks. Each of these networks specializes in processing distinct subsets of input data, with a gating mechanism efficiently directing each input to the most relevant expert(s). This approach not only enhances model performance but also significantly reduces computational costs.

Key Features of MoE Models:
- Scalability: MoE models uniquely maintain a relatively constant computational cost during inference, allowing for the scaling up of model size. This is achieved without the proportional increase in resource demand typically seen in dense models.
- Efficiency: These models are celebrated for their flop efficiency per weight, making them ideal for scenarios with fixed computational budgets. This efficiency enables the processing of more tokens within the same time or compute constraints.
- Challenges and Solutions:
  - Training Stability and Overfitting: MoE models are more susceptible to training instabilities and tend to overfit, especially with smaller datasets. Strategies like careful regularization and dataset augmentation are vital.
  - Load Balancing and Communication Overhead: Ensuring even distribution of workload among experts and managing communication overhead in distributed setups are critical for optimal performance.

MoE's application in LLMs, such as DBRX and Mixtral 8x7B, demonstrates its capability to handle complex and diverse datasets with high efficiency. By dynamically allocating tasks to specialized experts, MoE models achieve nuanced understanding and high-performance standards, setting a new benchmark in the field of AI and opening avenues for further exploration in various domains.

Inside the Architecture: Understanding MoE

Applying the Mixture-of-Experts (MoE) architecture to transformers involves a significant architectural shift, particularly in how dense feedforward neural network (FFN) layers are reimagined. Here’s a closer look at this transformative process:

Replacement of Dense FFN Layers:
- Traditional Architecture: Dense FFN layers where each layer is fully connected and participates in the computation for every input.
- MoE Architecture: Sparse MoE layers replace dense FFNs. Each MoE layer houses multiple expert FFNs and a gating mechanism, fundamentally altering the network's computation strategy.
Operational Dynamics:
- Gating Mechanism: Acts as a traffic director, guiding each input sequence to the most relevant subset of experts.
- Selective Activation: Only a specific group of experts is activated for a given input, optimizing computational resources and efficiency.
Scalability and Efficiency:
- MoE models maintain a constant computational cost during inference, a stark contrast to traditional models where costs escalate with size. This trait is particularly valuable in resource-constrained deployment scenarios, ensuring larger models can be trained and deployed without proportional increases in computational demands.

The shift to MoE architecture, as seen in models like DBRX, Grok-1, and Mixtral 8x7B, represents a new trend in developing large, efficient LLMs. By partitioning tasks among specialized experts, MoE models offer a refined approach to handling complex, high-dimensional tasks, setting the stage for more sophisticated and capable AI systems.

‍

The Real Example of MoE Performance

You can explore the capabilities of the MoE architecture by yourself. Below is an example of a text generation task accomplished by an awesome MoE model Mixtral 8x7b Instruct through the AI/ML API:

import time
import openai

client = openai.OpenAI(
    api_key="***",
    base_url="https://api.aimlapi.com"
)

def get_code_completion(messages, max_tokens=2500, model="mistralai/Mixtral-8x7B-Instruct-v0.1"):
    chat_completion = client.chat.completions.create(
        messages=messages,
        model=model,
        max_tokens=max_tokens,
        top_p=1,
        n=10,
        temperature=0.7,
    )
    return chat_completion

if __name__ == '__main__':
    messages = [
        {"role": "system", "content": "Assist in writing an article on a given topic. Write a detailed text with examples and reasoning."},
        {"role": "user", "content": "I need an article about the impact of AI on the World Wide Web."},
    ]
    start = time.perf_counter()
    chat_completion = get_code_completion(messages)
    print(chat_completion.choices[0].message.content)
    print(f'Elapsed time (sec): {time.perf_counter() - start}')

You can replace the model id mistralai/Mixtral-8x7B-Instruct-v0.1 with some other supported model - let’s say, meta-llama/Llama-2-70b-chat-hf - and play with the prompt to assess various aspects of the MoE performance compared to other models. Some of the obvious you will notice - fast inference and accurate instruction-following skills of Mixtral, which are the benefits of the computationally effective MoE architecture and smart selection of experts for a given prompt.

‍

DBRX: A New Benchmark in LLM Efficiency

DBRX, developed by Databricks, is emerging as a new benchmark in the landscape of LLMs such as GPT-3.5, Gemini 1.0, CodeLLaMA-70B and Grok-1, pushing the frontiers of efficiency and performance. This open LLM distinguishes itself through several key features:

Performance Benchmarks:
- Outperforms GPT-3.5 and rivals Gemini 1.0 Pro in standard benchmarks.
- Demonstrates superior capabilities in coding tasks, surpassing CodeLLaMA-70B.
Efficiency and Size:
- Achieves up to double the inference speed of LLaMA2-70B.
- Maintains a compact size, with both total and active parameter counts being about 40% smaller than Grok-1.
Generative Speed and Training Data:
- When integrated with Mosaic AI Model Serving, it achieves a generation speed of up to 150 tokens per second per user.
- Pre-trained on a massive corpus of 12T tokens of text and code data, supporting a maximum context length of 32k tokens.

DBRX's standing on the Open LLM leaderboard is noteworthy, outperforming models like Mistral Instruct and Grok-1 in the majority of benchmarks. Its licensing model is uniquely designed to encourage wide usage while imposing restrictions on very large user bases (more than 700 million monthly active users). Positioned as twice as compute-efficient compared to leading LLMs, DBRX not only sets a new standard for open-source models but also paves the way for customizable, transparent generative AI across various enterprises. Its availability across major cloud platforms and its expected integration into NVIDIA's ecosystem further underscore its accessibility and potential for widespread adoption.

Grok: The first open MoE model of 300B+ size

Grok-1 by xAI stands as a pioneering implementation of the Mixture-of-Experts (MoE) architecture in the realm of large-scale LLMs. This transformer-based model features a staggering 314 billion parameters. However, its efficiency is highlighted by the fact that only about 86 billion parameters (approximately 25%) are active for any given token at a time. This selective activation significantly reduces computational demands while maintaining high-performance levels.

Key Attributes of Grok-1:

Architecture: Mixture-of-8-Experts, with each token processed by two experts during inference.
Training: Developed from scratch using a custom stack based on JAX and Rust, without fine-tuning for specific applications.
Accessibility: Available under the Apache 2.0 license for broad usage, including commercial applications.

Grok-1's technical specifications are impressive, with 64 transformer layers, 6,144-dimensional embeddings, and the ability to process sequences up to 8,192 tokens long. Despite its large size and the substantial computational resources required (e.g., 8x A100 GPUs), Grok-1's design facilitates efficient computation, employing bfloat16 precision. Another notable technical detail is the use of rotary positional embeddings to further enhance the model’s capability to manage extensive data sequences efficiently. This model exemplifies the new trend in open-source LLM development, emphasizing the importance of MoE architecture for achieving both scale and efficiency in AI models.

Mixtral: Fine-Grained MoE for Enhanced Performance

Mixtral 8x7B, developed by Mistral AI, represents a significant advancement in the mixture-of-experts (MoE) architecture, showcasing the power of fine-grained MoE for enhanced performance in large language models (LLMs).

Configuration:
- Consists of eight experts, each with 7 billion parameters.
- During inference, only two experts are activated per token, reducing computational costs effectively.
Performance:
- Surpasses the 70 billion parameter Llama model in performance metrics.
- Offers six times faster inference times, making it a leader in efficiency.
Multilingual Support and Context Handling:
- Supports multiple languages including English, French, Italian, German, and Spanish.
- Can process up to 32,000 tokens, approximately 50 pages of text, showcasing its robustness in handling extensive data sequences.

An easy way to try out the capabilities of the model is to sign up for access to the AI/ML API.

Mixtral 8x7B not only excels in general benchmarks, outperforming Llama 2 70B in areas like commonsense reasoning, world knowledge, and code but also demonstrates remarkable proficiency in multilingual benchmarks. This proficiency is particularly notable in French, German, Spanish, and Italian, where it significantly outperforms Llama 2 70B. Additionally, Mixtral's approach to bias and sentiment, as evidenced in the BBQ and BOLD benchmarks, shows less bias and more positive sentiment compared to its counterparts. This combination of efficiency, performance, and ethical considerations positions Mixtral 8x7B as a model of choice for developers and researchers seeking scalable, high-performance, and ethically conscious LLM solutions.

Future Trends and Directions in MoE LLMs

Exploring the horizon of LLMs reveals a compelling shift towards a more nuanced architecture, the MoT, promising to address the challenges faced by the MoE. The MoT technique, by blending different token representations, paves the way for a richer data understanding in NLP tasks. Its potential lies in:

Enhanced Scalability and Efficiency: MoTs tackle MoE's limitations like training instability and load imbalance head-on, offering a scalable solution without the computational heft.
Performance and Training Efficiency: By mixing tokens from various examples before presenting them to experts, MoTs not only boost model performance but also streamline the training process.
Parameter Reduction: A notable achievement is the drastic cut in parameters, showcasing MoT's capability to deliver high-performing models with fewer resources.

Models like GLaM by Google and initiatives by Cohere AI underscore the industry's move towards adopting MoT and refining MoE architectures. These advancements hint at an exciting future where LLMs achieve unprecedented efficiency and specialization, making them more accessible and effective across a wider range of applications. The journey from MoE to MoT represents a significant leap towards overcoming existing barriers, heralding a new era of AI that is more adaptable, efficient, and powerful.

‍

[1] https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf

Written by Ruslanthedev

Get API Key