The Formula, ModelFrame and ModelMatrix TypesΒΆ

In regression model, we often want to describe the relationship between a response variable and one or more input variables in terms of main effects and interactions. To facilitate the specification of a regression model in terms of the columns of a DataFrame, the DataFrames package provides a Formula type, which is created by the ~ binary operator in Julia:

fm = Z ~ X + Y

A Formula object can be used to transform a DataFrame into a ModelFrame object:

df = DataFrame(X = randn(10), Y = randn(10), Z = randn(10))
mf = ModelFrame(Z ~ X + Y, df)

A ModelFrame object is just a simple wrapper around a DataFrame. For modeling purposes, one generally wants to construct a ModelMatrix, which constructs a Matrix{Float64} that can be used directly to fit a statistical model:

mm = ModelMatrix(ModelFrame(Z ~ X + Y, df))

Note that mm contains an additional column consisting entirely of 1.0 values. This is used to fit an intercept term in a regression model.

In addition to specifying main effects, it is possible to specify interactions using the & operator inside a Formula:

mm = ModelMatrix(ModelFrame(Z ~ X + Y + X&Y, df))

If you would like to specify both main effects and an interaction term at once, use the * operator inside a Formula:

mm = ModelMatrix(ModelFrame(Z ~ X*Y, df))

The construction of model matrices makes it easy to formulate complex statistical models. These are used to good effect by the GLM Package.