# Getting Started¶

Installation

The DataFrames package is available through the Julia package system. Throughout
the rest of this tutorial, we will assume that you have installed the DataFrames
package and have already typed `using DataArrays, DataFrames`

to bring all of
the relevant variables into your current namespace. In addition, we will make
use of the `RDatasets`

package, which provides access to hundreds of
classical data sets.

## The `NA`

Value¶

To get started, let’s examine the `NA`

value. Type the following into the
REPL:

```
NA
```

One of the essential properties of `NA`

is that it poisons other items. To
see this, try to add something like `1`

to `NA`

:

```
1 + NA
```

## The `DataArray`

Type¶

Now that we see that `NA`

is working, let’s insert one into a `DataArray`

.
We’ll create one now using the `@data`

macro:

```
dv = @data([NA, 3, 2, 5, 4])
```

To see how `NA`

poisons even complex calculations, let’s try to take
the mean of the five numbers stored in `dv`

:

```
mean(dv)
```

In many cases we’re willing to just ignore `NA`

values and remove them
from our vector. We can do that using the `dropna`

function:

```
dropna(dv)
mean(dropna(dv))
```

Instead of removing `NA`

values, you can try to convert the `DataArray`

into a normal Julia `Array`

using `convert`

:

```
convert(Array, dv)
```

This fails in the presence of `NA`

values, but will succeed if there are
no `NA`

values:

```
dv[1] = 3
convert(Array, dv)
```

In addition to removing `NA`

values and hoping they won’t occur, you can
also replace any `NA`

values using the `convert`

function, which takes a
replacement value as an argument:

```
dv = @data([NA, 3, 2, 5, 4])
mean(convert(Array, dv, 11))
```

Which strategy for dealing with `NA`

values is most appropriate will
typically depend on the specific details of your data analysis pathway.

Although the examples above employed only 1D `DataArray`

objects, the
`DataArray`

type defines a completely generic N-dimensional array type.
Operations on generic `DataArray`

objects work in higher dimensions in
the same way that they work on Julia’s Base `Array`

type:

```
dm = @data([NA 0.0; 0.0 1.0])
dm * dm
```

## The `DataFrame`

Type¶

The `DataFrame`

type can be used to represent data tables, each column of
which is a `DataArray`

. You can specify the columns using keyword arguments:

```
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
```

It is also possible to construct a `DataFrame`

in stages:

```
df = DataFrame()
df[:A] = 1:8
df[:B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
df
```

The `DataFrame`

we build in this way has 8 rows and 2 columns. You
can check this using `size`

function:

```
nrows = size(df, 1)
ncols = size(df, 2)
```

We can also look at small subsets of the data in a couple of different ways:

```
head(df)
tail(df)
df[1:3, :]
```

Having seen what some of the rows look like, we can try to summarize the
entire data set using `describe`

:

```
describe(df)
```

To focus our search, we start looking at just the means and medians of
specific columns. In the example below, we use numeric indexing to access
the columns of the `DataFrame`

:

```
mean(df[1])
median(df[1])
```

We could also have used column names to access individual columns:

```
mean(df[:A])
median(df[:A])
```

We can also apply a function to each column of a `DataFrame`

with the `colwise`

function. For example:

```
df = DataFrame(A = 1:4, B = randn(4))
colwise(cumsum, df)
```

## Accessing Classic Data Sets¶

To see more of the functionality for working with `DataFrame`

objects, we need
a more complex data set to work with. We’ll use the `RDatasets`

package, which
provides access to many of the classical data sets that are available in R.

For example, we can access Fisher’s iris data set using the following functions:

```
using RDatasets
iris = dataset("datasets", "iris")
head(iris)
```

In the next section, we’ll discuss generic I/O strategy for reading and writing
`DataFrame`

objects that you can use to import and export your own data files.