R: Descriptive stats

You are currently browsing the archive for the R: Descriptive stats category.

So what investment fund should we buy in our ISA? That’s a question discussed in the media between Christmas and New Year. I want to show you that you can select consistently good performing funds by just knowing a bit of stats, and a bit of R. Good way of getting back into stats after the Christmas festivities.

Bar plots are for when we have categorical variables – known also as factors in ANOVA.
Instances we may use barplots: descriptive stats when we have categorical variables; regression when we have categorical variables; crosstabs.

#compare with hist() command
Windows()
hist(num$cells)
#2. stacked bar plot 1: plot of frequencies of 2 categorical variables
table(num$smoker, num$weight)
barplot(table(num$smoker, num$weight))
#2. plot of means of DV across categorical variables in regression
tab = with(num, tapply(cells,list(smoker,weight),mean))
tab
barplot(tab)
#3 bar plot with bars arranged by group
barplot(tab,beside=T)
#color and labels
barplot(tab,beside=T,col=c(1,2))
#shading and legend
barplot(tab,beside=T,col=NA,density=c(10,20))
legend(1,3.5, c("non-smoker","smoker"), col=NA, density=c(10,20))

One of the first plots we learn about is the histogram which is easy to interpret. No so the q-q plot, whose purpose is to shed light as to whether the variable (data) comes from a specified distribution. Here I wanna simulate data to see what the normal q-q plot looks like for symmetric distributions with fat tails, and skewed distributions. The command to plot the normal q-q plot is qqnorm()

R Code for simulating data from a number of distributions and then get the q-q plot


#simulate from various distributions
simn=rnorm(10000,0,2) # simulate 10k observations from N(0,2)
simchi=rchisq(10000,6) #simulate from chi-square(6)
simchi2= - simchi # create negative skew distribution from chi-squared distribution
simt= rt(10000,10) # simulate t-distribution
#Plots in 2 graphics windows
par(mfrow=c(2,2)) #set up graphics page, 2x2 table
hist(simn, main="Symmetric distribution", xlab="")
qqnorm(simn)
qqline(simn)
hist(simt, main="Symmetric with fat tails", xlab="")
qqnorm(simt)
qqline(simt)
windows() #second graphics windows pops up
par(mfrow=c(2,2))
hist(simchi, main="Postive skew", xlab="")
qqnorm(simchi)
qqline(simchi)
hist(simchi2, main="Negative skew", xlab="")
qqnorm(simchi2)
qqline(simchi2)

Once your data is ready for analysis, you need to obtain the descriptive statistics.
Video with examples to show how to obtain in R:

#summary stats (mean,median, min, max, sd, quantile, range, skewness, kurtosis, #not for mode) for one variable – vector and dataframe
#individual stats for observations in a vector/dataframe
#individual stats for subset of variables in a dataframe
#summary stats for a continuous variable over a factor/group
#frequency table applicable for factors


#summary stats for one variable: vector and dataframe
x=c(10,15,18,25,30)
summary(x)
ToothGrowth # len = continuous, supp = nominal, dose = ordinal/cts
summary(ToothGrowth,summary)


#individual stats for observations in a vector
mean(x)
sd(x) # other commands are: median, min, max, quantile, range
# to extend to skewness and kurtosis install moments package
install.packages(moments)
skewness(x)
kurtosis(x)


#individual stats for variables in a dataframe
# we show this for the mean. In place of the mean, you can use
# median, min, max, quantile, range, skewness, kurtosis

mean(ToothGrowth) #deprecated.
mean(ToothGrowth$len)
skewness(ToothGrowth$len)
kurtosis(ToothGrowth$len)


#format for sapply is sapply(, ) where can be:
# mean, sd, min, max, median, range,quantile, skewness, kurtosis
sapply(ToothGrowth[,c(1,3)], mean) # mean for vars in col 1 and 3
sapply(ToothGrowth[,c(1,3)], mean)


#summary stats by factor/group
# use split() in sapply()
# sapply(split(, ), )
sapply(split(ToothGrowth$len, ToothGrowth$supp), mean)
with(ToothGrowth, sapply(split(len,supp), mean)) #same but using with() instead of df$var


#frequency table applicable for factors
table(dose,supp)

Testing for an association/relationship/independence between two (qualitative) factors.