## Box-plot and violin plot

These words were typed according to `wikipedia`

to let me know the concepts and descriptions. **One can totally skip this part**.

In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quantiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plooted as individual points.

Box plots display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Boxplots can be drawn either horizontally or vertically. Wiki

Violin plots are a method of plotting numeric data. A violin plot is a combination of a box plot and a kernel density plot. Specifically, it starts woth a box plot. It then adds a rotared kernel density plot to each side of the boxplot.

The violoin plots is similar to box plots, except that they show the prbability density of the data ad different values (in the simplest case this could be a hitogram). Typically violin plots will include a marker for the median of the data and a box indicaiting the interquantile range, as in standard box plots. Oberlaid on this box plot is a kernel density estimation. Wiki

## How-to do the plot

Here I will introduce a script `boxplot.sh`

as one-line command to plot various box-plots on given data.

### Input file format

Two types of input files are supported. The first type is the table file with the first column as ID and other columns as data values, just as what you think in the mind. It is suitable when all boxes share same IDs. If each box contains different set of IDs, you may use the sceond format.

#### Matrix data table like `diamonds`

as described in Basic things you should know to use s-plot.

The first column is the ID variable, normally the values in this column should be unqiue. The other columns are data columns and you can have any number of columns if you want.

#### Molten format as described in Basic things you should know to use s-plot.

A practical example, if one want to compare the expression level of all genes in multiple samples, the table format is OK. Otherwise, if one only want to compare the expression distribution of top 100 genes in each sample, the molten format should be used since each sample may have different top 100 genes.

### Begin plotting

#### Simply, running `boxplot.sh -f diamond.extract.matrix`

or `boxplot.sh -f diamond.extract.matrix.melt -m TRUE`

will get the following boxplot.

#### Also want to get the violin plot, `boxplot.sh -f diamond.extract.matrix -V TRUE`

. If you do not want to plot the inner-boxplot, please give `FALSE`

to `-W`

.

#### Plot the distribution of `carat`

and `price`

in each given category.

In `cut`

category, `boxplot.sh -f diamond.extract.matrix -a cut -I "'color'"`

;

In `color`

category `boxplot.sh -f diamond.extract.matrix -a color -I "'cut'"`

.

Remember to *exclude other columns* if there is any by giving their names to `-I`

in format `"'col1','col2'"`

or `"'col'"`

. Pay attention to the **double quotation** marks.

If the input file is in molten format, `boxplot.sh -f diamond.extract.matrix.melt -m TRUE -d value -F variable -a color -I "'cut'" -x color`

or `boxplot.sh -f diamond.extract.matrix.melt -m TRUE -a color -I "'cut'" -x color`

would be suitable since the default value for `-d`

is `value`

, `-F`

is `variable`

. For the practical example, `rpkm`

should be given to `-d`

and `gene`

given to `-F`

.

#### Manually set color for boxes in each category and exclude outliers to display clearly.

`boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -x 'cut' -c TRUE -C "'red','blue'" -o TRUE`

.

For manual color setting, give `TRUE`

to `-c`

and color list to `-C`

.

When `-o`

is `TRUE`

, outliers will be excluded. Default, points out [minimum/1.05, maximum*1.05] will be treated as outliers. One can give other numbers to `-O`

when `-o`

is `TRUE`

to set other ranges of outliers.

Excluding outliers will be usefull for displaying data with very large ranges. Another way to do this is scale the Y-axis using `log10`

or `log2`

.

Such as `boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -x 'cut' -c TRUE -C "'red','blue'" -s TRUE -v "scale_y_log10()"`

.

Giving `TRUE`

to `-s`

means executing scale y-axis. Strings given to `-v`

indicates the way to transform y-axis, including `scale_y_log10()`

,`coord_trans(y="log10")`

, `scale_y_continuous(trans=log2_trans())`

, `coord_trans(y="log2")`

. You may also want to add a value like `1`

given to `-S`

to avoid to get log value for `0`

.

#### Set the order of boxes in one category or set the order of categories (default alphabetical order).

`boxplot.sh -f diamond.extract.matrix -a cut -I "'color'" -l "'price','carat'" -L "'Ideal','Premium','Very Good','Good','Fair'" -x 'cut'`

.

For legend variable, the order should be given to `-l`

. For category variable, the order should be given to `-L`

. You may also want to rotate the x-tics to display vertically sometimes, please give `-90`

to `-b`

.

#### Plot only the distribution of one column `price`

in each category.

In `cut`

category, `boxplot.sh -f diamond.extract.matrix -r 70 -a cut -I "'color','carat'"`

;

In `color`

category `boxplot.sh -f diamond.extract.matrix -r 70 -a color -I "'cut','carat'"`

.

Remember to exclude other columns if there is by giving their names to `-I`

in format `"'col1','col2'"`

or `"'col'"`

. If you want to do this to molten files, please remove unneeded numerical columns before metling process using shell commands like `grep -v`

and give new file to `boxplot.sh`

.

#### Plot the distribution of `price`

in different `carat`

categories.

`boxplot.sh -f diamond.extract.matrix -a carat -I "'cut','color'" -B 4 -x 'carat' -y 'price'`

or `boxplot.sh -f diamond.extract.matrix -a carat -I "'cut','color'" -B "c(0.1,0.4,0.7,1,6)" -x 'carat' -y 'price'`

The first command gives `4`

to `-B`

indicates splitting `carat`

into 4 categories.

The second command gives numerical vector `c(0.1,0.4,0.7,1,6)`

to `-B`

to splie `carat`

into given ranges.

#### Please see Basic things you should know to use s-plot to modify other formats like the *position of legend*, *width*, *height*, *resolution* and *output type* of pictures, *install required module* or *generating R scripts only*.

## Ref

- ggplot2
- http://www.statmethods.net/graphs/boxplot.html
- http://stackoverflow.com/questions/2492947/boxplot-in-r-showing-the-mean
- http://stackoverflow.com/questions/10441214/r-boxplot-with-multiple-factor-labels
- http://stackoverflow.com/questions/10805643/ggplot2-add-color-to-boxplot-continuous-value-supplied-to-discrete-scale-er