Tree-based Methods for Clustering Time Series Using Domain-Relevant Attributes



We propose a set of two new methods for clustering time series that capture temporal information (trend, seasonality and autocorrelation) and domain-relevant cross-sectional attributes. The methods are based on model-based partitioning (MOB) trees and can be used as an automated yet transparent tool for clustering a large collection of time series. Our approach addresses the challenge of using common time series models within the MOB framework by utilizing the computationally-advantageous ordinary least squares (OLS) approach. We propose and compare two methods. The single-step method clusters series using trend, seasonality, time series lags and domain-relevant cross-sectional attributes, using a single linear regression model. The two-step method first clusters by trend, seasonality and domain-relevant cross-sectional attributes, and then further clusters the residuals series by autocorrelation and the domain-relevant cross-sectional attributes. Both methods produce clusters that are interpretable by domain experts. We illustrate the usefulness of the proposed clustering approach by considering one-step-ahead forecasting. We present empirical results of comparing our approach to forecasting each series using an Auto Regressive Integrated Moving Average (ARIMA) model applied to a large set of Wikipedia article pageviews time series. Our results show that the tree-based approach produces forecasts that are practically on par with ARIMA models, yet are significantly faster and more efficient, thereby suitable for scaling to large collections of time-series. Moreover, our method produces simple parametric forecasting models for interpretable clusters of time series, whereas ARIMA cannot provide such interpretability.