Introduction:

The dataset is a collection of data from Snijders and Bosker (2012) adapted from the raw data from a 1989 study by H. P. Brandsma and W. M. Knuver containing information on 4106 pupils at 216 schools, found in the R mice library (1). The 14 variables of the adapted dataset are listed below, featuring demographic information on the students and schools and their pre- and post-test scores for language and mathematics. Last time, we did some initial investigation on each of the features (2) with regards to their distributions and relationships between each other.

In this exploratory analysis, we found that eleven of the fourteen features in the dataset included missing values, ranging from 0.1% to 15.15% missing within each individual feature. Depending on the nature of the missing values, we intend to use three different imputation algorithms to address the missing values present in the dataset: replacement imputation (for categorical features), regression-based imputation (for numeric features), and multiple imputation. We will construct two imputed datasets, one using replacement imputation + regression-based imputation, and the other one using multiple imputation using the R mice library. After both datasets have missing values imputed, we will compare their performance in a regression model.

Using the selected imputed dataset, we will then conduct a few feature engineering procedures encompassing feature transforming, feature selection, and feature creation. We will correct issues with skewness identified in the initial exploratory data analysis (2) and standardize the features, then use recursive feature elimination from the caret package in R and feature selection using LASSO through the glmnet package to identify features that may be able to be eliminated for the prediction of the total post-test score as a response variable. Finally, we will combine sparse categories that were identified in the initial EDA and examine the applicability of procedures such as K-means clustering and PCA to identify patterns and reduce the dimensionality of the data.

Missing Value Imputation

Missing values are an inevitable consequence of data collection, but can lead to erroneous analysis if not properly handled. Simply eliminating observations with missing values leads to loss of valuable information and can introduce bias into the analysis. Therefore, different imputation methods can be used as another means of preserving the original information collected and mitigating the issues that arise due to missing values in the data.

Missing values can be categorized as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Data is considered missing completely at random if the probability of the information being missing is the same for all observations, missing at random if the missingness of the data depends only on information that has been collected in the data, and missing not at random if the missingness of the data is dependent on information that has not been collected. The first two categories can be adequately addressed through different imputation mechanisms.

The methods of missing value imputation that will be utilized in this analysis include k nearest neighbors imputation (for categorical variables), regression imputation (for continuous variables), and multiple imputation. The first two methods will be utilized to create one imputed dataset, and the second will be utilized to create a different imputed dataset. The two will then be compared.

k-nearest neighbors imputation was implemented using the VIM package. A k-nearest neighbors approach takes k of the neighbors of an observation with a missing value, finding the observations that are the “closest” in Euclidean distances, and then uses those values to impute the missing value in the observation under consideration (3). Regression imputation uses the observed values of identified predictors in the dataset to fit a linear regression model for a selected response variable, which is then used to impute values for missing values in the response. This method can be limited by the existence of missing values in the predictors, which we will see below. Finally, multiple imputation is used to generate multiple plausible values for one missing value, which can allow for the quantifying of uncertainty in the estimation of the missing values (4). It generates multiple plausible values and then analyzes the results, generating multiple estimates that are then combined together through some algorithm. The R library mice, which stands for multiple imputation by chained equations, is capable of using different processes based on the type of variable it is imputing through an iterative process while preserving the structure of the dataset.

An initial summary of the dataset reveals the number of missing values for each of the individual features as well as some preliminary statistics.

##       sch             pup            iqv               iqp          
##  Min.   :  1.0   Min.   :   1   Min.   :-7.8535   Min.   :-6.72275  
##  1st Qu.: 66.0   1st Qu.:1073   1st Qu.:-1.3535   1st Qu.:-1.72275  
##  Median :132.0   Median :2128   Median : 0.1465   Median :-0.05608  
##  Mean   :129.2   Mean   :2121   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.:187.0   3rd Qu.:3170   3rd Qu.: 1.1465   3rd Qu.: 1.60392  
##  Max.   :259.0   Max.   :4214   Max.   : 6.6465   Max.   : 6.61058  
##                                 NA's   :17        NA's   :8         
##    sex            ses          min        rpg            lpr       
##  0   :2096   Min.   :-17.667   0:3868   0   :3562   Min.   : 9.00  
##  1   :2000   1st Qu.: -7.667   1: 238   1   : 521   1st Qu.:30.00  
##  NA's:  10   Median : -1.667            2   :  10   Median :35.00  
##              Mean   :  0.000            NA's:  13   Mean   :34.16  
##              3rd Qu.:  8.333                        3rd Qu.:39.00  
##              Max.   : 22.333                        Max.   :49.00  
##              NA's   :137                            NA's   :320    
##       lpo             apr            apo          den            ssi       
##  Min.   : 8.00   Min.   : 1.0   Min.   : 2.00   1   :1200   Min.   :10.00  
##  1st Qu.:36.00   1st Qu.: 9.0   1st Qu.:15.00   2   :1489   1st Qu.:16.00  
##  Median :42.00   Median :12.0   Median :21.00   3   : 983   Median :18.00  
##  Mean   :41.33   Mean   :11.9   Mean   :19.76   4   : 185   Mean   :18.73  
##  3rd Qu.:48.00   3rd Qu.:15.0   3rd Qu.:25.00   NA's: 249   3rd Qu.:22.00  
##  Max.   :58.00   Max.   :20.0   Max.   :30.00               Max.   :30.00  
##  NA's   :204     NA's   :309    NA's   :200                 NA's   :622

As seen above, the features iqv, iqp, sex, ses, rpg, lpr, lpo, apr, apo, den, and ssi all have missing values. The feature with the fewest missing values is iqp, which represents the performal IQ of the student and has eight missing values, and the feature with the most missing values is ssi, which represents a score for the socioeconomic status of the school and has 622 missing values. The features that do not have missing values are sch and pup, two unique identification numbers for the student represented by the observation and the school they attend, and min, which represents the minority status of the student.

## Warning in plot.aggr(res, ...): not enough vertical space to display
## frequencies (too many combinations)

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##       ssi 0.151485631
##       lpr 0.077934730
##       apr 0.075255723
##       den 0.060642962
##       lpo 0.049683390
##       apo 0.048709206
##       ses 0.033365806
##       iqv 0.004140283
##       rpg 0.003166098
##       sex 0.002435460
##       iqp 0.001948368
##       sch 0.000000000
##       pup 0.000000000
##       min 0.000000000

A k-nearest neighbors algorithm can impute missing values by assigning the value most common among the k-nearest neighbors surrounding the point with the missing data. We will use it in order to impute values for the categorical features with missing values of sex, rpg, and den using the kNN() function in the VIM package and a value of k=3.

##       sch             pup            iqv               iqp           sex     
##  Min.   :  1.0   Min.   :   1   Min.   :-7.8535   Min.   :-6.72275   0:2101  
##  1st Qu.: 66.0   1st Qu.:1073   1st Qu.:-1.3535   1st Qu.:-1.72275   1:2005  
##  Median :132.0   Median :2128   Median : 0.1465   Median :-0.05608           
##  Mean   :129.2   Mean   :2121   Mean   : 0.0000   Mean   : 0.00000           
##  3rd Qu.:187.0   3rd Qu.:3170   3rd Qu.: 1.1465   3rd Qu.: 1.60392           
##  Max.   :259.0   Max.   :4214   Max.   : 6.6465   Max.   : 6.61058           
##                                 NA's   :17        NA's   :8                  
##       ses          min      rpg           lpr             lpo       
##  Min.   :-17.667   0:3868   0:3575   Min.   : 9.00   Min.   : 8.00  
##  1st Qu.: -7.667   1: 238   1: 521   1st Qu.:30.00   1st Qu.:36.00  
##  Median : -1.667            2:  10   Median :35.00   Median :42.00  
##  Mean   :  0.000                     Mean   :34.16   Mean   :41.33  
##  3rd Qu.:  8.333                     3rd Qu.:39.00   3rd Qu.:48.00  
##  Max.   : 22.333                     Max.   :49.00   Max.   :58.00  
##  NA's   :137                         NA's   :320     NA's   :204    
##       apr            apo        den           ssi       
##  Min.   : 1.0   Min.   : 2.00   1:1227   Min.   :10.00  
##  1st Qu.: 9.0   1st Qu.:15.00   2:1686   1st Qu.:16.00  
##  Median :12.0   Median :21.00   3:1005   Median :18.00  
##  Mean   :11.9   Mean   :19.76   4: 188   Mean   :18.73  
##  3rd Qu.:15.0   3rd Qu.:25.00            3rd Qu.:22.00  
##  Max.   :20.0   Max.   :30.00            Max.   :30.00  
##  NA's   :309    NA's   :200              NA's   :622

As we can see from the summary() function, the missing values for the categorical features identified above have been imputed.

We will continue through regression imputation. We create a response variable of post, identified as the total post-score for each of the students and obtained by adding the language and arithmetic post-test scores together. The rest of the variables are selected as predictors.

##       iqv                iqp           sex           ses           min     
##  Min.   :-7.85351   Min.   :-6.72275   0:1606   Min.   :-17.6667   0:2912  
##  1st Qu.:-1.35351   1st Qu.:-1.72275   1:1489   1st Qu.: -7.6667   1: 183  
##  Median : 0.14649   Median :-0.05608            Median : -2.6667           
##  Mean   :-0.07096   Mean   :-0.04532            Mean   : -0.2906           
##  3rd Qu.: 1.14649   3rd Qu.: 1.60392            3rd Qu.:  7.3333           
##  Max.   : 6.14649   Max.   : 6.61058            Max.   : 22.3333           
##                                                                            
##  rpg           lpr             apr        den           post      
##  0:2696   Min.   :10.00   Min.   : 1.00   1:1014   Min.   :13.00  
##  1: 392   1st Qu.:30.00   1st Qu.: 9.00   2:1091   1st Qu.:50.00  
##  2:   7   Median :35.00   Median :12.00   3: 835   Median :62.00  
##           Mean   :34.06   Mean   :11.82   4: 155   Mean   :60.46  
##           3rd Qu.:39.00   3rd Qu.:14.00            3rd Qu.:72.00  
##           Max.   :49.00   Max.   :20.00            Max.   :88.00  
##                                                    NA's   :174    
##       ssi       
##  Min.   :10.00  
##  1st Qu.:16.00  
##  Median :18.00  
##  Mean   :18.73  
##  3rd Qu.:22.00  
##  Max.   :30.00  
## 

With the observations with missing values for the predictors removed, the response variable post now has 174 observations with missing values that we will impute through regression imputation.

##       iqv                iqp           sex           ses           min     
##  Min.   :-7.85351   Min.   :-6.72275   0:1606   Min.   :-17.6667   0:2912  
##  1st Qu.:-1.35351   1st Qu.:-1.72275   1:1489   1st Qu.: -7.6667   1: 183  
##  Median : 0.14649   Median :-0.05608            Median : -2.6667           
##  Mean   :-0.07096   Mean   :-0.04532            Mean   : -0.2906           
##  3rd Qu.: 1.14649   3rd Qu.: 1.60392            3rd Qu.:  7.3333           
##  Max.   : 6.14649   Max.   : 6.61058            Max.   : 22.3333           
##  rpg           lpr             apr        den           post      
##  0:2696   Min.   :10.00   Min.   : 1.00   1:1014   Min.   :11.06  
##  1: 392   1st Qu.:30.00   1st Qu.: 9.00   2:1091   1st Qu.:50.00  
##  2:   7   Median :35.00   Median :12.00   3: 835   Median :62.00  
##           Mean   :34.06   Mean   :11.82   4: 155   Mean   :60.14  
##           3rd Qu.:39.00   3rd Qu.:14.00            3rd Qu.:72.00  
##           Max.   :49.00   Max.   :20.00            Max.   :88.85  
##       ssi       
##  Min.   :10.00  
##  1st Qu.:16.00  
##  Median :18.00  
##  Mean   :18.73  
##  3rd Qu.:22.00  
##  Max.   :30.00

We can see that the removal of observations where the predictor variables had missing values has led to a pretty substantial loss of information, as the imputed dataset only has 3095 observations as opposed to the original dataset’s 4106, or a 24.6% loss.

We will do multiple imputation through the R library mice, setting m = 5.

## Warning: Number of logged events: 1
##       sch       pup       iqv       iqp       sex       ses       min       rpg 
##        ""        ""     "pmm"     "pmm"  "logreg"     "pmm"        "" "polyreg" 
##       lpr       lpo       apr       apo       den       ssi 
##     "pmm"     "pmm"     "pmm"     "pmm" "polyreg"     "pmm"
## 'data.frame':    4106 obs. of  15 variables:
##  $ iqv : num  -1.35 2.15 3.15 2.65 -2.35 ...
##  $ iqp : num  -3.7227 3.2773 1.2739 -1.0561 -0.0561 ...
##  $ sex : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
##  $ ses : num  -17.67 -4.67 -4.67 -17.67 -12.67 ...
##  $ min : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 1 2 2 1 ...
##  $ rpg : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lpr : num  33 44 36 36 33 29 19 22 20 44 ...
##  $ lpo : num  35 50 46 45 33 46 20 30 30 57 ...
##  $ apr : num  10 18 14 12 10 13 8 8 7 17 ...
##  $ apo : num  12 30 24 19 24 26 9 13 13 30 ...
##  $ den : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ssi : num  11 11 11 11 11 11 11 11 11 11 ...
##  $ post: num  43 80 70 64 57 72 29 43 43 87 ...
##  $ pup : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sch : int  1 1 1 1 1 1 1 1 1 1 ...
##       iqv                iqp            sex           ses           min     
##  Min.   :-7.85351   Min.   :-6.722750   0:2100   Min.   :-17.6667   0:3868  
##  1st Qu.:-1.35351   1st Qu.:-1.722750   1:2006   1st Qu.: -7.6667   1: 238  
##  Median : 0.14649   Median :-0.056083            Median : -1.6667           
##  Mean   :-0.00256   Mean   :-0.000923            Mean   : -0.1209           
##  3rd Qu.: 1.14649   3rd Qu.: 1.603917            3rd Qu.:  8.3333           
##  Max.   : 6.64649   Max.   : 6.610584            Max.   : 22.3333           
##  rpg           lpr             lpo             apr             apo       
##  0:3572   Min.   : 9.00   Min.   : 8.00   Min.   : 1.00   Min.   : 2.00  
##  1: 524   1st Qu.:30.00   1st Qu.:35.00   1st Qu.: 9.00   1st Qu.:14.00  
##  2:  10   Median :35.00   Median :42.00   Median :12.00   Median :20.00  
##           Mean   :34.28   Mean   :41.14   Mean   :11.95   Mean   :19.62  
##           3rd Qu.:39.00   3rd Qu.:48.00   3rd Qu.:15.00   3rd Qu.:25.00  
##           Max.   :49.00   Max.   :58.00   Max.   :20.00   Max.   :30.00  
##  den           ssi             post            pup            sch       
##  1:1271   Min.   :10.00   Min.   :13.00   Min.   :   1   Min.   :  1.0  
##  2:1600   1st Qu.:16.00   1st Qu.:50.00   1st Qu.:1073   1st Qu.: 66.0  
##  3:1038   Median :18.00   Median :61.00   Median :2128   Median :132.0  
##  4: 197   Mean   :18.75   Mean   :60.14   Mean   :2121   Mean   :129.2  
##           3rd Qu.:22.00   3rd Qu.:72.00   3rd Qu.:3170   3rd Qu.:187.0  
##           Max.   :30.00   Max.   :88.00   Max.   :4214   Max.   :259.0

Using multiple imputation through mice, all of the missing values for all 4106 observations were imputed.

We construct the same multiple linear regression model using both imputed datasets, with post as a response and iqv, iqp, sex, min, rpg, lpr, apr, den, and ssi as the predictors. Both models mark each of the predictors as significant at the 0.05 significance level except for min.

## # A tibble: 70 × 6
##    term        estimate std.error statistic   p.value  nobs
##    <chr>          <dbl>     <dbl>     <dbl>     <dbl> <int>
##  1 (Intercept)  16.2       1.24     13.1    1.16e- 38  4106
##  2 iqv           1.21      0.0896   13.5    1.51e- 40  4106
##  3 iqp           0.834     0.0748   11.2    1.69e- 28  4106
##  4 sex1          0.993     0.278     3.58   3.53e-  4  4106
##  5 ses           0.119     0.0147    8.09   7.67e- 16  4106
##  6 min1          0.268     0.602     0.445  6.56e-  1  4106
##  7 rpg1         -3.60      0.426    -8.44   4.19e- 17  4106
##  8 rpg2         -0.0959    2.76     -0.0348 9.72e-  1  4106
##  9 lpr           0.744     0.0290   25.6    2.65e-134  4106
## 10 apr           1.04      0.0496   21.1    1.51e- 93  4106
## # ℹ 60 more rows
est lo 95 hi 95 fmi
R^2 0.6574773 0.6398391 0.6744685 0.0370674
## 
## Call:
## lm(formula = post ~ iqv + iqp + sex + ses + min + rpg + lpr + 
##     apr + den + ssi, data = comp.imputedData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.914  -5.555   0.083   5.870  29.218 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.28956    1.37364  12.587  < 2e-16 ***
## iqv          1.30204    0.09968  13.062  < 2e-16 ***
## iqp          0.81934    0.08388   9.768  < 2e-16 ***
## sex1         1.14646    0.31109   3.685 0.000232 ***
## ses          0.13092    0.01676   7.811 7.74e-15 ***
## min1         1.31255    0.66532   1.973 0.048607 *  
## rpg1        -3.87468    0.47827  -8.102 7.74e-16 ***
## rpg2         3.06984    3.21076   0.956 0.339092    
## lpr          0.73159    0.03232  22.638  < 2e-16 ***
## apr          0.99006    0.05474  18.088  < 2e-16 ***
## den2         3.51989    0.37384   9.415  < 2e-16 ***
## den3         1.31260    0.39670   3.309 0.000948 ***
## den4         1.78517    0.75532   2.363 0.018166 *  
## ssi          0.24385    0.04058   6.010 2.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.437 on 3081 degrees of freedom
## Multiple R-squared:  0.6678, Adjusted R-squared:  0.6664 
## F-statistic: 476.4 on 13 and 3081 DF,  p-value: < 2.2e-16

We see that the model based on the dataset created through kNN and regression imputation has a higher estimate for the R-squared at 0.6676 as opposed to the model based on the dataset imputed through multiple imputation. However, this makes sense, as the missing values of post were imputed based on the regression model constructed with the above predictors, which likely biases the estimate of the performance of this model upwards for the model based on regression imputation; also, the second model was only based on 3095 observations as opposed to the 4106 of the mice imputed dataset, which preserved all of the information contained in the original data. The estimate for the R-squared for the regression + kNN imputed dataset is also within the confidence bounds of the R-squared of the model based on the mice imputed dataset, which doesn’t indicate an obvious, significantly better performance for the dataset created through single imputation methods. Therefore, for the rest of the analysis, we proceed with the dataset created through multiple imputation.

Feature Engineering

The feature engineering methods that will be utilized in this analysis all fall in one of three categories:

  1. Feature Transformation
  1. Feature Selection
  1. Feature Creation

Feature Transformation

The variables identified as having notable skewness in the initial analysis were the following: left skew observed in lpr, lpo, apo, and possibly iqv; evidence of right skew appears in ssi and ses.

We can examine their distributions in the following histograms.

As we can see, there is some pretty obvious left skew in lpr and lpo, though all of the distributions do seem unimodal. We conduct box-cox transformations on each of the above features to create transformed versions of these features, shifting the variables of ses and iqv by 8 and 18 respectively in order to keep them positive, as the box-cox only works for nonnegative values.

par(mfrow = c(2, 3))

# iqv - need to add constant of 8 to keep response nonnegative
boxcox_iqv <- boxcox(lm(mice.brandsma$iqv+8 ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_iqv$x[which.max(boxcox_iqv$y)]

mice.brandsma$tiqv <- if (optimal_lambda == 0) {
  log(mice.brandsma$iqv+8)
} else {
  ((mice.brandsma$iqv+8)^optimal_lambda - 1) / optimal_lambda
}

# ses - need to add constant of 18 to keep response nonnegative

boxcox_ses <- boxcox(lm(mice.brandsma$ses+18 ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_ses$x[which.max(boxcox_ses$y)]

mice.brandsma$tses <- if (optimal_lambda == 0) {
  log(mice.brandsma$ses+18)
} else {
  ((mice.brandsma$ses+18)^optimal_lambda - 1) / optimal_lambda
}

# lpr
boxcox_lpr <- boxcox(lm(mice.brandsma$lpr ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_lpr$x[which.max(boxcox_lpr$y)]

mice.brandsma$tlpr <- if (optimal_lambda == 0) {
  log(mice.brandsma$lpr)
} else {
  (mice.brandsma$lpr^optimal_lambda - 1) / optimal_lambda
}

# lpo
boxcox_lpo <- boxcox(lm(mice.brandsma$lpo ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_lpo$x[which.max(boxcox_lpo$y)]

mice.brandsma$tlpo <- if (optimal_lambda == 0) {
  log(mice.brandsma$lpo)
} else {
  (mice.brandsma$lpo^optimal_lambda - 1) / optimal_lambda
}

# apo
boxcox_apo <- boxcox(lm(mice.brandsma$apo ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_apo$x[which.max(boxcox_apo$y)]

mice.brandsma$tapo <- if (optimal_lambda == 0) {
  log(mice.brandsma$apo)
} else {
  (mice.brandsma$apo^optimal_lambda - 1) / optimal_lambda
}

# ssi 
boxcox_ssi <- boxcox(lm(mice.brandsma$ssi ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")

optimal_lambda <- boxcox_ssi$x[which.max(boxcox_ssi$y)]

mice.brandsma$tssi <- if (optimal_lambda == 0) {
  log(mice.brandsma$ssi)
} else {
  (mice.brandsma$ssi^optimal_lambda - 1) / optimal_lambda
}

h1 <- ggplot(mice.brandsma, aes(x = tiqv)) + geom_histogram(color = "#32373B", fill = "#C83E4D", bins = 15) 
h2 <- ggplot(mice.brandsma, aes(x = tses)) + geom_histogram(color = "#32373B", fill = "#F4D6CC", bins = 15)
h3 <- ggplot(mice.brandsma, aes(x = tlpr)) + geom_histogram(color = "#32373B", fill = "#F4B860", bins = 15)
h4 <- ggplot(mice.brandsma, aes(x = tlpo)) + geom_histogram(color = "#32373B", fill = "#C83E4D", bins = 15)
h5 <- ggplot(mice.brandsma, aes(x = tapo)) + geom_histogram(color = "#32373B", fill = "#F4D6CC", bins = 15)
h6 <- ggplot(mice.brandsma, aes(x = tssi)) + geom_histogram(color = "#32373B", fill = "#F4B860", bins = 15)

grid.arrange(h1, h2, h3, h4, h5, h6, nrow = 2, ncol = 3)

The transformed values of iqv, lpr, lpo, and ssi do seem more centered; however, apo and ses still do not seem to be distributed normally.

We then create a function to standardize each of the variables based on their mean and standard deviation, centering them at 0. Certain algorithms, such as those based on Euclidean distances such as k-means clustering and principal component anaysis, can weight features that have larger magnitudes because of their units compared to those with smaller magnitudes. Standardizing the features can improve their performance.

Feature Selection

Two methods of feature selection will be explored in this analysis. The first is recursive feature elimination from the caret package. This method can be computationally intensive and takes more time to run, and uses backwards selection and random forest models to consider subsets of features and their interactions, using cross-validation to select the optimal set of features.

##  [1] "apr"  "tlpr" "tiqv" "iqp"  "rpg"  "tses" "tssi" "den"  "min"  "sex"

The features selected based on this method are the following: apr, tlpr, tiqv, iqp, tses, rpg, tssi, den, min, and sex.

The other method of feature selection we will use is feature selection through LASSO, which is implemented through the glmnet package. We use the same response variable of post.

##  [1] "(Intercept)" "sex"         "min"         "rpg"         "den"        
##  [6] "tlpr"        "tses"        "tssi"        "tiqv"        "apr"        
## [11] "iqp"

The features selected are the following: sex, min, rpg, den, tlpr, tses, tssi, tiqv, apr, and iqp.

Both of these methods selected all of the features that were originally included. This may have to do with the large sample size of the dataset. Small p-values were observed in the original linear models we created as well for all of these predictors except for min.

Feature Creation

The process of feature creation in this report will consist of a few analytic tasks.

To begin, sparse categories were identified in the original exploratory analysis in the variable rpg, where there were only 10 observations with 2 repeated groups. Therefore, I will create a binary variable that identifies if an observation had repeated groups or not, combining a rpg value of 1 and 2 into one category.

##  sex      min      rpg      den     
##  0:2100   0:3868   0:3572   1:1271  
##  1:2006   1: 238   1: 524   2:1600  
##                    2:  10   3:1038  
##                             4: 197
##    0    1 
## 3572  534

The categories for min, has_repeated_group, and den are still unbalanced in terms of sample size; however, without a meaningful way to combine the categories in den and only two categories in the other features, any issues that may arise due to these imbalances would best be mitigated through a different analytic technique than category regrouping.

PCA, or principal component analysis, can be used to simplify the dataset, especially when there are a number of highly correlated variables in the dataset by creating new variables that are linear combinations of the original variables that are uncorrelated with each other and maximize the amount of information conveyed in the first few principal components. We will investigate its applicability to this dataset. We will also use clustering methods such as k-means clustering in combination with PCA to create plots of the clusters to investigate the existence of defined groupings in the dataset.

To begin, we will look at the correlations between numerical features in the data

There are some features that do show notable positive correlations, such as the transformed language pre-score and the transformed language post-score, which have a correlation of 0.715, as well as the transformed arithmetic post-score and the transformed language post-score, which have a correlation of 0.709. However, we don’t see any features that are obviously highly correlated at r = 0.8-1.0.

Highly correlated features can create issues in regression analysis, leading to multicollinearity and overfitting. Principal component analysis uses linear combinations of features to create “principal components” that are designed to have as little correlation between them as possible. This can be used to reduce the dimensionality of the data, capturing much of its variability in fewer principal components than the number of variables in the original data, simplifying future analysis procedures.

We will attempt PCA on all of the features in this dataset to examine its usefulness.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0287 1.0904 0.8347 0.73195 0.70081 0.63768 0.59461
## Proportion of Variance 0.5144 0.1486 0.0871 0.06697 0.06139 0.05083 0.04419
## Cumulative Proportion  0.5144 0.6631 0.7502 0.81715 0.87854 0.92937 0.97356
##                            PC8
## Standard deviation     0.45989
## Proportion of Variance 0.02644
## Cumulative Proportion  1.00000

Using PCA, we can see that the first two prinicipal components capture 66.3% of the variance in the dataset, the first three capture 75%, and the first four capture over 80%. The scree plot shows that after the first principal component, the decreases in variances with each added principal component slows drastically and progresses more or less evenly between PC3 and PC8. The elbow on the screeplot appears around the 2nd or 3rd principal component, and if one is okay with capturing 75% of the variability in the dataset, the first 3 PCs can be utilized for future analysis. However, if we would prefer a threshold of 85% then we would need 5 PCs and a threshold of 95% would require 7 PCs. In general, while PCA is definitely usable, the features don’t show enough correlation between themselves where I would consider it extremely effective in terms of future analysis and its applicability should be weighed on interpretability and its relevance depending on the chosen response variable in future regression analysis.

We will also use the first two PCs to visualize clustering algorithms such as k-means clustering. K-means is a distance based algorithm that partitions points in a dataset into groups in order to minimize the distance between each point and the centroid of the group (4). We will use an elbow plot to identify the ideal number of clusters.

## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration

We see an obvious drop off from 1 to 2 clusters, then a smoother decline for higher values of k. The most appropriate elbow point seems to be around k=3.

We can graph the dataset across its first and second principal components and then visualize the clusters using clusplot() from the R library cluster.

Looking purely at the datapoints when graphed across the principal components, we don’t see much evidence of explicit clustering in this dataset. The clusters identified by k-means clustering seem to horizontally across varying values of PC1 while not really depending on the values of PC2. Without much obvious evidence in favor of clustering according to the scatterplot we made over the first and second principal components, it doesn’t seem like clustering methods are the most appropriate features to implement in future analysis of this data.

Conclusion

This analysis of the brandsma dataset on secondary school students and their test scores in Language and Arithmetic is a follow-up on an initial exploratory analysis (2) in order to create a final analysis dataset for future modeling and analytic activities. This report can be broken up into two main sections: 1) missing data imputation and 2) feature engineering.

For missing data imputation, we used methods of k-nearest neighbors imputation, regression imputation, and multiple imputation through the R mice library. In order to retain as much information as possible, we chose to use multiple imputation for the final analytic dataset. However, there are limitations to multiple imputation through MICE that include the propagation of errors throughout the data and its inability to handle cases of MNAR missing data.

For feature engineering, analytic tasks were split between three main categories: 1) feature transformation, 2) feature selection, and 3) feature creation. Skewed features were transformed through Box-cox transformations and then standardized to increase the applicability of distance-based algorithms, features were selected through two feature selection algorithms which showed that all features in the dataset are worthy of inclusion in future analysis, and sparse categories were handled through regrouping and PCA and clustering algorithms were considered for their relevance in future analysis. While PCA may be applicable based on the goals of future analysis, there didn’t appear to be strong evidence in support of the use of clustering for future analytic tasks.

The created analytic dataset can now be used for future model building and prediction.

References and Appendix

  1. https://amices.org/mice/reference/brandsma.html

  2. https://xiang-a.github.io/sta552-project1/

  3. https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-016-0318-z

  4. https://www.ibm.com/think/topics/k-means-clustering