2.1.0
User Documentation for Apache MADlib

Given a time series of data X, the Autoregressive Integrated Moving Average (ARIMA) model is a tool for understanding and, perhaps, predicting future values in the series. The model consists of three parts, an autoregressive (AR) part, a moving average (MA) part, and an integrated (I) part where an initial differencing step can be applied to remove any non-stationarity in the signal. The model is generally referred to as an ARIMA(p, d, q) model where parameters p, d, and q are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

Training Function

The ARIMA training function has the following syntax.

arima_train( input_table,
       output_table,
       timestamp_column,
       timeseries_column,
       grouping_columns,
       include_mean,
       non_seasonal_orders,
       optimizer_params
     )

Arguments

input_table

TEXT. The name of the table containing time series data.

output_table

TEXT. The name of the table to store the ARIMA model. Three tables are created, with names based on the value of the output_table argument in the training function:

  1. output_table: Table containing the ARIMA model. Contains the following columns:
    mean Model mean (only if 'include_mean' is TRUE)
    mean_std_error Standard errors for mean
    ar_params Auto-regressions parameters of the ARIMA model
    ar_std_errors Standard errors for AR parameters
    ma_params Moving average parameters of the ARIMA model
    ma_std_errors Standard errors for MA parameters
  2. output_table_summary: Table containing descriptive statistics of the ARIMA model. Contains the following columns:
    input_table Table name with the source data
    timestamp_col Column name in the source table that contains the timestamp index to data
    timeseries_col Column name in the source table that contains the data values
    non_seasonal_orders Orders of the non-seasonal ARIMA model
    include_mean TRUE if intercept was included in ARIMA model
    residual_variance Variance of the residuals
    log_likelihood Log likelihood value (when using MLE)
    iter_num The number of iterations executed
    exec_time Total time taken to train the model
  3. output_table_residual: Table containing the residuals for each data point in 'input_table'. Contains the following columns:
    timestamp_col Same as the 'timestamp_col' parameter (all indices from source table included except the first d elements, where d is the differencing order value from 'non_seasonal_orders')
    residual Residual value for each data point

timestamp_column

TEXT. The name of the column containing the timestamp (index) data. This could be a serial index (INTEGER) or date/time value (TIMESTAMP).

timeseries_column

TEXT. The name of the column containing the time series data. This data is currently restricted to DOUBLE PRECISION.

grouping_columns (not currently implemented)

TEXT, default: NULL.

A comma-separated list of column names used to group the input dataset into discrete groups, training one ARIMA model per group. It is similar to the SQL GROUP BY clause. When this value is null, no grouping is used and a single result model is generated.

Note
Grouping is not currently implemented for ARIMA, but will be added in the future. Any non-NULL value for this parameter is ignored.
include_mean (optional)

BOOLEAN, default: FALSE. Mean value of the data series is added in the ARIMA model if this variable is True.

non_seasonal_orders (optional)

INTEGER[], default: 'ARRAY[1,1,1]'. Orders of the ARIMA model. The orders are [p, d, q], where parameters p, d, and q are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

optimizer_params (optional)
TEXT. Comma-separated list of optimizer-specific parameters of the form ‘name=value'. The order of the parameters does not matter. The following parameters are recognized:
  • max_iter: Maximum number of iterations to run learning algorithm (Default = 100)
  • tau: Computes the initial step size for gradient algorithm (Default = 0.001)
  • e1: Algorithm-specific threshold for convergence (Default = 1e-15)
  • e2: Algorithm-specific threshold for convergence (Default = 1e-15)
  • e3: Algorithm-specific threshold for convergence (Default = 1e-15)
  • hessian_delta: Delta parameter to compute a numerical approximation of the Hessian matrix (Default = 1e-6)

Forecasting Function

The ARIMA forecast function has the following syntax.

arima_forecast( model_table,
                output_table,
                steps_ahead
              )

Arguments

model_table

TEXT. The name of the table containing the ARIMA model trained on the time series dataset.

output_table

TEXT. The name of the table to store the forecasted values. The output table produced by the forecast function contains the following columns.

group_by_cols Grouping column values (if grouping parameter is provided)
step_ahead Time step for the forecast
forecast_value Forecast of the current time step

steps_ahead
INTEGER. The number of steps to forecast at the end of the time series.

Examples
  1. View online help for the ARIMA training function.
    SELECT madlib.arima_train();
    
  2. Create an input data set.
    DROP TABLE IF EXISTS arima_beer;
    CREATE TABLE arima_beer (time_id integer NOT NULL, value double precision NOT NULL );
    COPY arima_beer (time_id, value) FROM stdin WITH DELIMITER '|';
    1  | 93.2
    2  | 96.0
    3  | 95.2
    4  | 77.0
    5  | 70.9
    6  | 64.7
    7  | 70.0
    8  | 77.2
    9  | 79.5
    10 | 100.5
    11 | 100.7
    12 | 107.0
    13 | 95.9
    14 | 82.7
    15 | 83.2
    16 | 80.0
    17 | 80.4
    18 | 67.5
    19 | 75.7
    20 | 71.0
    21 | 89.2
    22 | 101.0
    23 | 105.2
    24 | 114.0
    25 | 96.2
    26 | 84.4
    27 | 91.2
    28 | 81.9
    29 | 80.5
    30 | 70.4
    31 | 74.7
    32 | 75.9
    33 | 86.2
    34 | 98.7
    35 | 100.9
    36 | 113.7
    37 | 89.7
    38 | 84.4
    39 | 87.2
    40 | 85.5
    \.
    
  3. Train an ARIMA model.
    -- Train ARIMA model with 'grouping_columns'=NULL, 'include_mean'=TRUE,
    --   and 'non_seasonal_orders'=[1,1,1]
    SELECT madlib.arima_train( 'arima_beer',
                               'arima_beer_output',
                               'time_id',
                               'value',
                               NULL,
                               FALSE,
                               ARRAY[1, 1, 1]
                             );
    
  4. Examine the ARIMA model.
    \x ON
    SELECT * FROM arima_beer_output;
    
    Result:
    -[ RECORD 1 ]-+------------------
    ar_params     | {0.221954769696}
    ar_std_errors | {0.575367782602}
    ma_params     | {-0.140623564576}
    ma_std_errors | {0.533445214346}
    
  5. View the summary statistics table.
    SELECT * FROM arima_beer_output_summary;
    
    Result:
    -[ RECORD 1 ]-------+---------------
    input_table         | arima_beer
    timestamp_col       | time_id
    timeseries_col      | value
    non_seasonal_orders | {1,1,1}
    include_mean        | f
    residual_variance   | 100.989970539
    log_likelihood      | -145.331516396
    iter_num            | 28
    exec_time (s)       | 2.75
    
  6. View the residuals.
    \x OFF
    SELECT * FROM arima_beer_output_residual;
    
    Result:
     time_id |      residual
    ---------+--------------------
           2 |                  0
           4 |   -18.222328834394
           6 |  -5.49616627282665
    ...
          35 |   1.06298837051437
          37 |  -25.0886854003757
          39 |   3.48401666299571
    (40 rows)
    
  7. Use the ARIMA forecast function to forecast 10 future values.
    SELECT madlib.arima_forecast( 'arima_beer_output',
                                  'arima_beer_forecast_output',
                                  10
                                );
    SELECT * FROM arima_beer_forecast_output;
    
    Result:
     steps_ahead | forecast_value
    -------------+----------------
               1 |  85.3802343659
               3 |  85.3477516875
               5 |  85.3461514635
               7 |  85.3460726302
               9 |  85.3460687465
               2 |  85.3536518121
               4 |  85.3464421267
               6 |  85.3460869494
               8 |  85.3460694519
              10 |    85.34606859
    (10 rows)
    

Technical Background
An ARIMA model is an auto-regressive integrated moving average model. An ARIMA model is typically expressed in the form

\[ (1 - \phi(B)) Y_t = (1 + \theta(B)) Z_t, \]

where \(B\) is the backshift operator. The time \( t \) is from \( 1 \) to \( N \).

ARIMA models involve the following variables:

The auto regression operator models the prediction for the next observation as some linear combination of the previous observations. More formally, an AR operator of order \( p \) is defined as

\[ \phi(B) Y_t= \phi_1 Y_{t-1} + \dots + \phi_{p} Y_{t-p} \]

The moving average operator is similar, and it models the prediction for the next observation as a linear combination of the errors in the previous prediction errors. More formally, the MA operator of order \( q \) is defined as

\[ \theta(B) Z_t = \theta_{1} Z_{t-1} + \dots + \theta_{q} Z_{t-q}. \]

We estimate the parameters using the Levenberg-Marquardt Algorithm. In mathematics and computing, the Levenberg-Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, provides a numerical solution to the problem of minimizing a function, generally nonlinear, over a space of parameters of the function.

Like other numeric minimization algorithms, LMA is an iterative procedure. To start a minimization, the user has to provide an initial guess for the parameter vector, $p$, as well as some tuning parameters \(\tau, \epsilon_1, \epsilon_2, \epsilon_3,\).

Literature

[1] Rob J Hyndman and George Athanasopoulos: Forecasting: principles and practice, http://otexts.com/fpp/

[2] Robert H. Shumway, David S. Stoffer: Time Series Analysis and Its Applications With R Examples, Third edition Springer Texts in Statistics, 2010

[3] Henri Gavin: The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011

Related Topics

File arima.sql_in documenting the ARIMA functions