10 Basic Statistics

This week we’ll be diving into some basic statistical procedures. R started out as THE statistical programming language. It spread like wildfire for it’s performance and efficiency at crunching numbers. It’s early success as a statistical programming langugae attracted the early developers who really made R into the do-it-all language we’re using today. First, I’ll throw in the useful definitions list at the top of this tutorial just for ease of access. Second, we’ll dive into a recap of for loops and how they’re structured. Then we’ll show off some of R’s great statistics functions and how to add them to a plot. Finally, we’ll have an assignment where we cover some of

10.1 Statistical Function Dictionary

Function Description
length returns length of a vector
sum returns the sum
mean returns the mean
median returns the median
sd returns the standard deviation (n − 1 in denominator)
min returns minimum
max returns maximum
sort sort a vector (rearranges the vector in order)
order returns indices of vectors that will order them
rank returns rank of each element in vector
lm multiple linear regression
glm generalized linear regression
anova analysis of variance
chisq.test Pearson’s Chi-squared test for count data
summary shows results of various model fitting functions
predict predicted results from model
passing an lm object will result in adding the predicted line to the plot
sample produces a random sample of the specified values
set.seed sets seed for next random sample (repeat random sample)
rnorm produces a random sample from a normal distribution
qnorm quantiles (percentiles) of normal distribution
pnorm CDF of normal distribution
dnorm PDF of normal distribution
rbinom produces a random sample from a binomial distribution
fitdistrplus package that helps find the fit of univariat eparametric distributions
plotdist empirical distribution plotting function
descdist Computes descriptive parameters of an empirical distribution and provides a skewness-kurtosis plot
fitdist Fits various distributions to data
denscomp plots the histogram against fitted density functions
cdfcomp lots the empirical cumulative distribution against fitted distribution functions
qqcomp plots theoretical quantiles against empirical ones
ppcomp plots theoretical probabilities against empirical ones

10.2 Basic Statistics in R

R is famous for it’s easy-to-use statistics features and once you get the hang of it you’ll never want to touch Microsoft Excel again. Despite the fact that these stats functions seem pretty fancy to me, I call them basic functions because R has so much to offer to statisticians. In this tutorial, we examine meteorological observations that were recorded in Willow Creek Wisconsin, USA. There are 3 years of Willow Creek data - WCr_1hr.2010.nc, WCr_1hr.2011.nc, WCr_1hr.2012.nc. Let’s start with the 2010 dataset.

10.2.1 Load Willow Creek data

library(ncdf4)
# open the netcdf file of Willow Creek, Wisconsin meteorology data
nc_file = nc_open("/Users/james/Documents/Github/geog473-673/datasets/WCr_1hr.2010.nc")
# what does the nc file look like 
nc_file
## File /Users/james/Documents/Github/geog473-673/datasets/WCr_1hr.2010.nc (NC_FORMAT_CLASSIC):
## 
##      10 variables (excluding dimension variables):
##         float air_temperature[latitude,longitude,time]   
##             units: Kelvin
##             _FillValue: -999
##         float air_temperature_max[latitude,longitude,time]   
##             units: Kelvin
##             _FillValue: -999
##         float air_temperature_min[latitude,longitude,time]   
##             units: Kelvin
##             _FillValue: -999
##         float surface_downwelling_longwave_flux_in_air[latitude,longitude,time]   
##             units: W/m2
##             _FillValue: -999
##         float air_pressure[latitude,longitude,time]   
##             units: Pascal
##             _FillValue: -999
##         float surface_downwelling_shortwave_flux_in_air[latitude,longitude,time]   
##             units: W/m2
##             _FillValue: -999
##         float eastward_wind[latitude,longitude,time]   
##             units: m/s
##             _FillValue: -999
##         float northward_wind[latitude,longitude,time]   
##             units: m/s
##             _FillValue: -999
##         float specific_humidity[latitude,longitude,time]   
##             units: g/g
##             _FillValue: -999
##         float precipitation_flux[latitude,longitude,time]   
##             units: kg/m2/s
##             _FillValue: -999
## 
##      3 dimensions:
##         latitude  Size:1
##             units: degree_north
##             long_name: latitude
##         longitude  Size:1
##             units: degree_east
##             long_name: longitude
##         time  Size:8760   *** is unlimited ***
##             units: sec
##             long_name: time
# ok, still a lot of info...let's list the names of the variables
names(nc_file$var)
##  [1] "air_temperature"                          
##  [2] "air_temperature_max"                      
##  [3] "air_temperature_min"                      
##  [4] "surface_downwelling_longwave_flux_in_air" 
##  [5] "air_pressure"                             
##  [6] "surface_downwelling_shortwave_flux_in_air"
##  [7] "eastward_wind"                            
##  [8] "northward_wind"                           
##  [9] "specific_humidity"                        
## [10] "precipitation_flux"
# alright, now we have some names, so let's put the variables into a new dataframe separate from the nc_file
var_names = names(nc_file$var)
willow_creek_2010 = list()
dim <- nc_file$dim
for (v in seq_along(var_names)){
  willow_creek_2010[[v]] = ncvar_get(nc_file, varid = var_names[v])
}
# convert the list into a dataframe
wcreek_df = data.frame(willow_creek_2010)
# tell the dataframe what the column names are
colnames(wcreek_df) = var_names

# let's rename the variables to make them shorter - note that these short names MUST be in the same order as the longer names
short_names = c("tair", "tmax", "tmin", "lwave", "pres", "swave", "ewind", "nwind", "shum", "prec")
# rename the column names to our new short name vector
colnames(wcreek_df) = short_names
# print the first few lines
head(wcreek_df)
##       tair    tmax    tmin    lwave    pres swave    ewind     nwind
## 1 260.1135 260.137 260.090 237.6100 95704.5     0 1.522976 -1.462524
## 2 260.0315 260.044 260.019 241.6900 95735.5     0 1.624952 -1.563508
## 3 259.9805 259.994 259.967 240.3500 95765.5     0 1.696362 -1.553177
## 4 259.9005 259.941 259.860 240.9325 95795.5     0 1.493329 -1.702995
## 5 259.7585 259.780 259.737 247.3875 95824.5     0 1.921457 -1.460475
## 6 259.6825 259.693 259.672 246.5075 95853.0     0 1.963118 -1.667144
##          shum prec
## 1 0.001176792    0
## 2 0.001170627    0
## 3 0.001167514    0
## 4 0.001160727    0
## 5 0.001142952    0
## 6 0.001133232    0

Our data has been read in and the variables have been renamed for our convenience. A major note here is that the replacement names must be in the same order as the variable names. Now, we turn our attnetion to the time component of this data. This is a time series dataset of weather variables so we’ll need a time variable to keep us organized. Here’s how we can create one…

### TIME ###
# what are the units of time
dim$time$units
## [1] "sec"
# how are the time values spaced
dim$time$vals[1:10]
##  [1]  3600  7200 10800 14400 18000 21600 25200 28800 32400 36000
# we can back out that this data is hourly data just from knowing the units are seconds and time between
# each recorded value is 3600 (3600 seconds == 1 hour)
#### Add a datetime column ####
date.seq = seq(as.POSIXct("2010-01-01 00:00:00"), as.POSIXct("2010-12-31 23:00:00"), by="hour")
wcreek_df['datetime'] = date.seq
wcreek_df$datetime[1:10]
##  [1] "2010-01-01 00:00:00 EST" "2010-01-01 01:00:00 EST"
##  [3] "2010-01-01 02:00:00 EST" "2010-01-01 03:00:00 EST"
##  [5] "2010-01-01 04:00:00 EST" "2010-01-01 05:00:00 EST"
##  [7] "2010-01-01 06:00:00 EST" "2010-01-01 07:00:00 EST"
##  [9] "2010-01-01 08:00:00 EST" "2010-01-01 09:00:00 EST"

The data is loaded, organized, and ready for statistical analysis.

10.2.2 Subsetting Data

In certain cases, our dataset may be too large. For example, our dataset contains 1 year of data. How can we subset our dataset from January 1 to June 30?

We can use the which function to find out which row a particular datetime is located. In other words, the which function returns the dates index position. In order for this to work, the datetime format we use must match the datetime format of the dataset. In our case, the datetime is YYYY-mm-dd, so we need to search with that format. After we have our indices, we can subset the wcreek_df dataset using wcreek_df[rows,columns] subsetting rules.

# use YYYY-mm-dd to return which datettime index has a value of 2010-01-01
start_ind = which(wcreek_df$datetime == "2010-01-01")
# use YYYY-mm-dd to return which datettime index has a value of 2010-06-30
end_ind = which(wcreek_df$datetime == "2010-06-30")
# Index our wcree_df dataframe using the indices gathered above. We want all columns so column section stays blank
jan2jun = wcreek_df[start_ind:end_ind,]

Note that there is also a subset function which is handy, but unreliable. Sometimes this function produces unintended consequences so the above method is the preferred for subsetting data.

10.2.4 Finding a Best-Fit Distribution

The fitdistrplus package is excellent for performing statistical analysis tests and fitting distributions. We can use this package to tell us which distribution fits a particular set of data best. One useful function to perform quick histograms with empirical fits and cumulative distributions is the plotdist function.

library(fitdistrplus)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following objects are masked from 'package:raster':
## 
##     area, select
## Loading required package: survival
plotdist(jan2jun$tair, histo = TRUE, demp = TRUE)

In the previous section, we fit our histogram of Air Temperature with a normal distribution. Was this a good selection? We can use the descdist function to plot a Cullen and Frey Graph. This graph provides analysis for which distribution (normal, uniform, exponential, etc.) best fits our data. Basically, we just look to see which theoretical distribution is closest to our observation point.

descdist(jan2jun$tair)

## summary statistics
## ------
## min:  248.4165   max:  304.0025 
## median:  278.452 
## mean:  277.785 
## estimated sd:  11.34891 
## estimated skewness:  -0.176518 
## estimated kurtosis:  2.190581

Uniform, normal, and gamma distributions are the 3 closest to our air temperature data. Let’s fit each of these distributions to the empirical data and create a density plot (denscomp), a cumulative density function plot (cdfcomp), a Q-Q plot (qqcomp), and a P-P plot (ppcomp) for these distributions.

fit_u  <- fitdist(jan2jun$tair, "unif")
fit_n  <- fitdist(jan2jun$tair, "norm")
fit_g <- fitdist(jan2jun$tair, "gamma")
summary(fit_g)
## Fitting of the distribution ' gamma ' by maximum likelihood 
## Parameters : 
##         estimate  Std. Error
## shape 595.374330 12.80305602
## rate    2.143291  0.04610913
## Loglikelihood:  -16634.45   AIC:  33272.91   BIC:  33285.65 
## Correlation matrix:
##           shape      rate
## shape 1.0000000 0.9995799
## rate  0.9995799 1.0000000
par(mfrow=c(2,2))
plot.legend <- c("uniform", "normal", "gamma")
denscomp(list(fit_u, fit_n, fit_g), legendtext = plot.legend)
cdfcomp (list(fit_u, fit_n, fit_g), legendtext = plot.legend)
qqcomp  (list(fit_u, fit_n, fit_g), legendtext = plot.legend)
ppcomp  (list(fit_u, fit_n, fit_g), legendtext = plot.legend)

Some of the statistics shown in this section are beyond the scope of this course but this package is important to know of.

10.2.5 Correlation Plots

Correlation plots show correlation coefficients across variables. For example, we expect shortwave radiation and temperature to have high a high correlation coefficient because generally speaking when the a lot of sunlight is received at the surface, temperature increases. There’s a handy package called corrplot that caluclates correlation coefficients quickly and intuitively. If you don’t have the package installed, you’ll need to use - install.packages("corrplot").

For the correlation plot, we’re going to use the full year dataset, wcreek_df, rather than the jan2jun dataset. When we calculate correlations, the more data we have the better. We’re also going to calculate residual values - that is the distance between actual data and the trendline. It’s another way to express error. When we have a large dataset like this with many scatter points, it’s difficult to nail down a specific trendline that may capture the dataset the best. Different trendlines may represent the data in similar ways. Residuals measure the diffence between a particular trendline and the data point.

library(corrplot)
## corrplot 0.84 loaded
# create a duplicate of the wcreek_df - remove the datetime variable from this
wcreek_nodate = wcreek_df
# we need to remove the datetime so our dataset is full of numeric values only - no datetime values or characters
wcreek_nodate$datetime = NULL

# awesome, now let's calculate the correlation coefficients
cor_wcreek = cor(wcreek_nodate)
head(cor_wcreek)
##             tair       tmax       tmin      lwave        pres      swave
## tair   1.0000000  0.9998053  0.9998039  0.8200776 -0.16229303 0.35668470
## tmax   0.9998053  1.0000000  0.9992183  0.8177929 -0.16085314 0.35552272
## tmin   0.9998039  0.9992183  1.0000000  0.8220495 -0.16367456 0.35771119
## lwave  0.8200776  0.8177929  0.8220495  1.0000000 -0.29557674 0.27059006
## pres  -0.1622930 -0.1608531 -0.1636746 -0.2955767  1.00000000 0.02962979
## swave  0.3566847  0.3555227  0.3577112  0.2705901  0.02962979 1.00000000
##             ewind       nwind       shum        prec
## tair  -0.35196115  0.07174053  0.9064799  0.08391199
## tmax  -0.35195936  0.07146779  0.9056241  0.08342905
## tmin  -0.35182514  0.07198617  0.9069838  0.08436382
## lwave -0.29876208  0.09322482  0.8416941  0.14016558
## pres   0.22972978  0.03904779 -0.1628097 -0.09972888
## swave -0.07181433 -0.11448984  0.2681873 -0.02659980
# now let's calculate the residuals of the correlations with a 95% confidence interval
residuals_1 <- cor.mtest(wcreek_nodate, conf.level = .95)
# now let's plot this up. 
corrplot(cor_wcreek, p.mat = residuals_1$p, method = 'color', number.cex = .7, type = 'lower',
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
         # Combine with significance
         sig.level = 0.05, insig = "blank")

Do these correlations make sense? Let’s take two variables that are highly correlated - temperature and shortwave radiation (sunlight). Our correalation is .36 for these two variables - this is a slight correlation. Why isn’t the correlation higher? Advection - that is air being transported from other locations via wind.

10.3 Assignment

Using the WCr_1hr.2012.nc dataset found in the datasets folder, create a document (word, pdf, notepad, etc.) answering these questions with accompanied figures (no code).

  1. Subset the data between July 1 and Dec 31.
  2. Create a scatterplot with a trendline of Shortwave Radiation similar to the air temperature example above. Does this trendline make sense?
  3. Create a filled density plot of Shortwave Radiation similar to the air temperature example above. Note, you will need to change your binwidth to fit the shortwave dataset. Does this density plot make sense? Why or why not?
  4. Create a correlation plot of your willow creek 2012 dataset. How does it compare to the 2010 dataset in the tutorial?

Save your plots created in #2, #3, #4, attach them to a document and answer the questions. Submit this document to Canvas.

10.3.1 Extra Credit - 2 points

Using the WCr_1hr.2010.nc, WCr_1hr.2011.nc and WCr_1hr.2012.nc found in the datasets folder, complete the following

  1. Fuse together the 3 datasets into one continuous dataframe.
  2. Resample the data from an hourly to a daily resolution
  3. Plot Air temperature for your new combined data frame and add a trendline to it.
  4. Submit to assignment above labeled ‘extra_credit.png’