Thursday, December 20, 2007

Data Exploration: State Education Data

Again, we will want to read our data into R. The data, called StateEducation.sav, are in the SPSS (.sav) format. In order to read data in the .sav format into R, we need to use a function that is available in the

foreign

library. Load the library and read in the data using the following commands.


> library(foreign)
> education=read.spss("StateEducation.sav", to.data.frame=T)
> attach(education)



Next, we will summarize the B2005 variable by finding the mean and median.


> mean(B2005)
[1] 27.35098
> median(B2005)
[1] 25.8



To obtain the trimmed mean, we add the argument

trim=x

to the

mean

command, where x is the fraction (0 to 0.5) of observations to be trimmed from each end of the data before the mean is computed.


> mean(B2005, trim=.2)
[1] 26.84194



Unfortunately, R does not have a native function to compute the winsorized mean. To obtain the winsorized mean, we have to create our own function. You can copy and paste the following lines of code into R.


win<-function(x,tr=.2){
#
# Compute the gamma Winsorized mean for the data in the vector x.
# tr is the amount of Winsorization
#
   y<-sort(x)
   n<-length(x)
   ibot<-floor(tr*n)+1
   itop<-length(x)-ibot+1
   xbot<-y[ibot]
   xtop<-y[itop]
   y<-ifelse(y<=xbot,xbot,y)
   y<-ifelse(y>=xtop,xtop,y)
   win<-mean(y)
   win
}



To obtain the Winsorized mean we can now employ the function that we just put into R.


> win(B2005, tr=.20)
[1] 26.94314



Both the variance and standard deviation can be obtained in R by using the following syntax.


> var(B2005)
[1] 33.22055
> sd(B2005)
[1] 5.763727



To obtain a robust estimate of the variance of the trimmed mean - which can be based on the Winsorized sum of squared deviations - we will again copy and paste the following function into R.


winvar<-function(x,tr=.2,na.rm=F){
#
# Compute the gamma Winsorized variance for the data in the vector x.
# tr is the amount of Winsorization which defaults to .2.
#
   if(na.rm)x<-x[!is.na(x)]
   y<-sort(x)
   n<-length(x)
   ibot<-floor(tr*n)+1
   itop<-length(x)-ibot+1
   xbot<-y[ibot]
   xtop<-y[itop]
   y<-ifelse(y<=xbot,xbot,y)
   y<-ifelse(y>=xtop,xtop,y)
   winvar<-var(y)
   winvar
}



We can then find the Winsorized variance and for the B2005 data by using,


> winvar(B2005)
[1] 8.514902



To get the standard error for the mean we can employ the formula from the notes as,


> sd(B2005)/sqrt(length(B2005))
[1] 0.8070832



The robust estimate of the standard error for the trimmed mean can be obtained by again copying and pasting the following formula into R


trimse<-function(x,tr=.2,na.rm=F){
#
# Estimate the standard error of the gamma trimmed mean
# The default amount of trimming is tr=.2.
#
   if(na.rm)x<-x[!is.na(x)]
   trimse<-sqrt(winvar(x,tr))/((1-2*tr)*sqrt(length(x)))
   trimse
}



Then entering the following command.


> trimse(B2005, tr=.2)
[1] 0.68101



To obtain the mean plot and the mean plot with error bars, we need to use data that is in a different format. This data has been reformatted and is available as StateEducationlongFormat.sav. Read in this data and load the

gplots

library using the following commands.


> education2=read.spss("StateEducationLongFormat.sav", to.data.frame=T)
> library(gplots)



The mean plot can be obtained by entering,


> plotmeans(BACH~YEAR, data=education2, connect=T, p=0, n.label=F, xlab="Year", ylab="Percent", main="US Population Obtaining Bachelor's Degree")




The mean plot with error bars can be obtained by modifying the

p=0

argument to

p=.95

as follows,


> plotmeans(BACH~YEAR, data=education2, connect=T, p=.95, n.label=F, xlab="Year", ylab="Percent", main="US Population Obtaining Bachelor's Degree")


No comments: