Machine learningTime-series forecasting

Time-MoE: Mixture-of-Experts Time-Series Foundation Model

Time-MoE (Mixture-of-Experts Time-Series Foundation Model) · Also known as: Time Mixture-of-Experts, Time-MoE Foundation Model, Sparse Time-Series Transformer, Zaman Karışık Uzmanlar Modeli

Time-MoE is a billion-scale autoregressive foundation model for universal time-series forecasting, introduced by Shi et al. in 2024 and accepted at ICLR 2025. It combines a decoder-only transformer architecture with sparse Mixture-of-Experts (MoE) feed-forward layers, enabling the model to scale to billions of parameters while activating only a small subset of expert networks per token—dramatically increasing capacity without proportional compute cost.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Time-MoE

Chronos Mixture of Experts TimesFM

When to use it

Time-MoE is appropriate when you need a general-purpose forecasting foundation model that can generalize across domains without task-specific training. It is particularly well-suited for zero-shot or few-shot forecasting on univariate series, large-scale deployments where inference efficiency matters, and settings where labeled data is scarce. Assumptions include well-behaved scalar time series; it is not designed for multivariate cross-series dependency modeling out of the box. Alternatives include TimesFM, Chronos, and Moirai for similar zero-shot settings.

Strengths & limitations

Strengths

Scales to billions of parameters with sublinear compute growth via sparse expert activation
Strong zero-shot generalization across diverse time-series domains due to large-scale pretraining
Multi-resolution forecasting heads allow simultaneous short- and long-horizon predictions from a single model
Decoder-only autoregressive design enables flexible context length without architectural changes

Limitations

Primarily designed for univariate forecasting; multivariate cross-series relationships require adaptation
Large model size demands substantial memory and infrastructure for deployment
Autoregressive decoding accumulates prediction errors over long horizons
Pretraining data distribution may not cover highly specialized or rare time-series domains

Frequently asked

How does Time-MoE differ from a standard transformer forecasting model?

Standard transformer forecasters use dense feed-forward layers where every parameter is activated for every token. Time-MoE replaces these with sparse MoE blocks, activating only a small subset of expert networks per token. This allows the total parameter count—and thus model capacity—to scale dramatically while keeping per-token compute roughly constant, enabling billion-parameter models at manageable inference cost.

Can Time-MoE handle multivariate time series?

Time-MoE is primarily architected for univariate forecasting, processing each series independently. While it can be applied channel-independently to multivariate data (treating each variable as a separate series), it does not natively model cross-variable dependencies. Tasks requiring explicit inter-series correlation modeling may benefit from purpose-built multivariate models instead.

What is Time-300B and why does it matter?

Time-300B is the large-scale pretraining corpus introduced alongside Time-MoE, comprising hundreds of billions of time-series observations spanning multiple domains including energy, finance, weather, and transportation. Pretraining on this diverse corpus enables Time-MoE to develop general temporal representations that transfer zero-shot to unseen datasets, similar to how web-scale text corpora underpin large language model generalization.

Sources

Shi, X., Wang, S., Nie, Y., Li, D., Ye, Z., Wen, Q., & Jin, M. (2024). Time-MoE: Billion-scale time series foundation models with mixture of experts. ICLR 2025. link ↗

How to cite this page

ScholarGate. (2026, June 2). Time-MoE (Mixture-of-Experts Time-Series Foundation Model). ScholarGate. https://scholargate.app/en/deep-learning/time-moe

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

ChronosDeep learning↔ compare
Mixture of ExpertsDeep learning↔ compare
TimesFMDeep learning↔ compare

Compare side by side →

Related reference concepts

Sequence-to-Sequence Models and Transformers Energy Forecasting Convolutional and Sequence Models Self-Supervised and Representation Learning Deep Generative Models Machine Learning

Spotted an issue on this page? Report or suggest a fix →

Time-MoE: Mixture-of-Experts Time-Series Foundation Model

Time-MoE (Mixture-of-Experts Time-Series Foundation Model) · Also known as: Time Mixture-of-Experts, Time-MoE Foundation Model, Sparse Time-Series Transformer, Zaman Karışık Uzmanlar Modeli

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Time-MoE

Chronos Mixture of Experts TimesFM

When to use it

Strengths & limitations

Strengths

Scales to billions of parameters with sublinear compute growth via sparse expert activation
Strong zero-shot generalization across diverse time-series domains due to large-scale pretraining
Multi-resolution forecasting heads allow simultaneous short- and long-horizon predictions from a single model
Decoder-only autoregressive design enables flexible context length without architectural changes

Limitations

Primarily designed for univariate forecasting; multivariate cross-series relationships require adaptation
Large model size demands substantial memory and infrastructure for deployment
Autoregressive decoding accumulates prediction errors over long horizons
Pretraining data distribution may not cover highly specialized or rare time-series domains

Frequently asked

How does Time-MoE differ from a standard transformer forecasting model?

Can Time-MoE handle multivariate time series?

What is Time-300B and why does it matter?

Sources

Shi, X., Wang, S., Nie, Y., Li, D., Ye, Z., Wen, Q., & Jin, M. (2024). Time-MoE: Billion-scale time series foundation models with mixture of experts. ICLR 2025. link ↗

How to cite this page

ScholarGate. (2026, June 2). Time-MoE (Mixture-of-Experts Time-Series Foundation Model). ScholarGate. https://scholargate.app/en/deep-learning/time-moe

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

ChronosDeep learning↔ compare
Mixture of ExpertsDeep learning↔ compare
TimesFMDeep learning↔ compare

Compare side by side →

Time-MoE: Mixture-of-Experts Time-Series Foundation Model

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts

Time-MoE: Mixture-of-Experts Time-Series Foundation Model

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts