# Why Use the DataFrames Package?¶

We believe that Julia is the future of technical computing. Nevertheless, Base Julia is not sufficient for statistical computing. The DataFrames package (and its sibling, DataArrays) extends Base Julia by introducing three basic types needed for statistical computing:

`NA`

: An indicator that a data value is missing`DataArray`

: An extension to the`Array`

type that can contain missing values`DataFrame`

: A data structure for representing tabular data sets

`NA`

: An Indicator for Missing Data Points¶

Suppose that we want to calculate the mean of a list of five `Float64`

numbers: `x1`

, `x2`

, `x3`

, `x4`

and `x5`

. We would normally do this
in Julia as follows:

- Represent these five numbers as a
`Vector`

:`v = [x1, x2, x3, x4, x5]`

- Compute the mean of
`v`

using the`mean`

function

*But what if one of the five numbers were missing?*

The concept of a missing data point cannot be directly expressed in Julia.
In contrast with languages like Java and R, which provide `NULL`

and `NA`

values that represent missingness, there is no missing data value in Base
Julia.

The DataArrays package therefore provides `NA`

, which serves as an indicator
that a specific value is missing. In order to exploit Julia’s multiple dispatch
rules, `NA`

is a singleton object of a new type called `NAtype`

.

Like R’s `NA`

value and unlike Java’s `NULL`

value, Julia’s `NA`

value represents
epistemic uncertainty. This means that operations involving `NA`

return `NA`

when the result of the operation cannot be determined, but operations whose
value can be determined despite the presence of `NA`

will return a value that
is not `NA`

.

For example, `false && NA`

evaluates to `false`

and `true || NA`

evaluates
to `true`

. In contrast, `1 + NA`

evaluates to `NA`

because the outcome is
uncertain in the absence of knowledge about the missing value represented
by `NA`

.

`DataArray`

: Efficient Arrays with Missing Values¶

Although the `NA`

value is sufficient for representing missing scalar values,
it cannot be stored efficiently inside of Julia’s standard `Array`

type. To
represent arrays with potentially missing entries, the DataArrays package
introduces a `DataArray`

type. For example, a `DataArray{Float64}`

can
contain `Float64`

values and `NA`

values, but nothing else. In contrast, the
most specific `Array`

that can contain both `Float64`

and `NA`

values is an
`Array{Any}`

.

Except for the ability to store `NA`

values, the `DataArray`

type is meant to
behave exactly like Julia’s standard `Array`

type (n.b. Base methods that make
assumptions about container types may not work with DataArrays). In particular,
`DataArray`

provides two typealiases called `DataVector`

and `DataMatrix`

that mimic the `Vector`

and `Matrix`

typealiases for 1D and 2D `Array`

types.

`DataFrame`

: Tabular Data Sets¶

`NA`

and `DataArray`

provide mechanisms for handling missing values for scalar
types and arrays, but most real world data sets have a tabular structure that
does not correspond to a simple `DataArray`

.

For example, the data table shown below highlights some of the ways in which a
typical data set is not like a `DataArray`

:

Name | Height | Weight | Gender |
---|---|---|---|

John Smith | 73.0 | NA | Male |

Jane Doe | 68.0 | 130 | Female |

Note three important properties that this table possesses:

- The columns of a tabular data set may have different types. A
`DataMatrix`

can only contain values of one type: these might all be`UTF8String`

or`Int`

, but we cannot have one column of`UTF8String`

type and another column of`Int`

type. - The values of the entries within a column always have a consistent type.
This means that a single column could be represented using a
`DataVector`

. Unfortunately, the heterogeneity of types between columns means that we need some way of wrapping a group of columns together into a coherent whole. We could use a standard`Vector`

to wrap up all of the columns of the table, but this will not enforce an important constraint imposed by our intuitions:*every column of a tabular data set has the same length as all of the other columns*. - The columns of a tabular data set are typically named using some sort of
`UTF8String`

. Often, one wants to access the entries of a data set by using a combination of verbal names and numeric indices.

We can summarize these concerns by noting that we face four problems when with working with tabular data sets:

- Tabular data sets may have columns of heterogeneous type
- Each column of a tabular data set has a consistent type across all of its entries
- All of the columns of a tabular data set have the same length
- The columns of a tabular data set should be addressable using both verbal names and numeric indices

The DataFrames package solves these problems by adding a `DataFrame`

type
to Julia. This type will be familiar to anyone who has worked with R’s
`data.frame`

type, Pandas’ `DataFrame`

type, an SQL-style database, or
Excel spreadsheet.