The dataset is a collection of data from Snijders and Bosker (2012) adapted from the raw data from a 1989 study by H. P. Brandsma and W. M. Knuver containing information on 4106 pupils at 216 schools, found in the R mice library (1). The 14 variables of the adapted dataset are listed below, featuring demographic information on the students and schools and their pre- and post-test scores for language and mathematics. Last time, we did some initial investigation on each of the features (2) with regards to their distributions and relationships between each other.
In this exploratory analysis, we found that eleven of the fourteen features in the dataset included missing values, ranging from 0.1% to 15.15% missing within each individual feature. Depending on the nature of the missing values, we intend to use three different imputation algorithms to address the missing values present in the dataset: replacement imputation (for categorical features), regression-based imputation (for numeric features), and multiple imputation. We will construct two imputed datasets, one using replacement imputation + regression-based imputation, and the other one using multiple imputation using the R mice library. After both datasets have missing values imputed, we will compare their performance in a regression model.
Using the selected imputed dataset, we will then conduct a few feature engineering procedures encompassing feature transforming, feature selection, and feature creation. We will correct issues with skewness identified in the initial exploratory data analysis (2) and standardize the features, then use recursive feature elimination from the caret package in R and feature selection using LASSO through the glmnet package to identify features that may be able to be eliminated for the prediction of the total post-test score as a response variable. Finally, we will combine sparse categories that were identified in the initial EDA and examine the applicability of procedures such as K-means clustering and PCA to identify patterns and reduce the dimensionality of the data.
Missing values are an inevitable consequence of data collection, but can lead to erroneous analysis if not properly handled. Simply eliminating observations with missing values leads to loss of valuable information and can introduce bias into the analysis. Therefore, different imputation methods can be used as another means of preserving the original information collected and mitigating the issues that arise due to missing values in the data.
Missing values can be categorized as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Data is considered missing completely at random if the probability of the information being missing is the same for all observations, missing at random if the missingness of the data depends only on information that has been collected in the data, and missing not at random if the missingness of the data is dependent on information that has not been collected. The first two categories can be adequately addressed through different imputation mechanisms.
The methods of missing value imputation that will be utilized in this analysis include k nearest neighbors imputation (for categorical variables), regression imputation (for continuous variables), and multiple imputation. The first two methods will be utilized to create one imputed dataset, and the second will be utilized to create a different imputed dataset. The two will then be compared.
k-nearest neighbors imputation was implemented using the VIM package. A k-nearest neighbors approach takes k of the neighbors of an observation with a missing value, finding the observations that are the “closest” in Euclidean distances, and then uses those values to impute the missing value in the observation under consideration (3). Regression imputation uses the observed values of identified predictors in the dataset to fit a linear regression model for a selected response variable, which is then used to impute values for missing values in the response. This method can be limited by the existence of missing values in the predictors, which we will see below. Finally, multiple imputation is used to generate multiple plausible values for one missing value, which can allow for the quantifying of uncertainty in the estimation of the missing values (4). It generates multiple plausible values and then analyzes the results, generating multiple estimates that are then combined together through some algorithm. The R library mice, which stands for multiple imputation by chained equations, is capable of using different processes based on the type of variable it is imputing through an iterative process while preserving the structure of the dataset.
An initial summary of the dataset reveals the number of missing values for each of the individual features as well as some preliminary statistics.
## sch pup iqv iqp
## Min. : 1.0 Min. : 1 Min. :-7.8535 Min. :-6.72275
## 1st Qu.: 66.0 1st Qu.:1073 1st Qu.:-1.3535 1st Qu.:-1.72275
## Median :132.0 Median :2128 Median : 0.1465 Median :-0.05608
## Mean :129.2 Mean :2121 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.:187.0 3rd Qu.:3170 3rd Qu.: 1.1465 3rd Qu.: 1.60392
## Max. :259.0 Max. :4214 Max. : 6.6465 Max. : 6.61058
## NA's :17 NA's :8
## sex ses min rpg lpr
## 0 :2096 Min. :-17.667 0:3868 0 :3562 Min. : 9.00
## 1 :2000 1st Qu.: -7.667 1: 238 1 : 521 1st Qu.:30.00
## NA's: 10 Median : -1.667 2 : 10 Median :35.00
## Mean : 0.000 NA's: 13 Mean :34.16
## 3rd Qu.: 8.333 3rd Qu.:39.00
## Max. : 22.333 Max. :49.00
## NA's :137 NA's :320
## lpo apr apo den ssi
## Min. : 8.00 Min. : 1.0 Min. : 2.00 1 :1200 Min. :10.00
## 1st Qu.:36.00 1st Qu.: 9.0 1st Qu.:15.00 2 :1489 1st Qu.:16.00
## Median :42.00 Median :12.0 Median :21.00 3 : 983 Median :18.00
## Mean :41.33 Mean :11.9 Mean :19.76 4 : 185 Mean :18.73
## 3rd Qu.:48.00 3rd Qu.:15.0 3rd Qu.:25.00 NA's: 249 3rd Qu.:22.00
## Max. :58.00 Max. :20.0 Max. :30.00 Max. :30.00
## NA's :204 NA's :309 NA's :200 NA's :622
As seen above, the features iqv, iqp, sex, ses, rpg, lpr, lpo, apr, apo, den, and ssi all have missing values. The feature with the fewest missing values is iqp, which represents the performal IQ of the student and has eight missing values, and the feature with the most missing values is ssi, which represents a score for the socioeconomic status of the school and has 622 missing values. The features that do not have missing values are sch and pup, two unique identification numbers for the student represented by the observation and the school they attend, and min, which represents the minority status of the student.
## Warning in plot.aggr(res, ...): not enough vertical space to display
## frequencies (too many combinations)
##
## Variables sorted by number of missings:
## Variable Count
## ssi 0.151485631
## lpr 0.077934730
## apr 0.075255723
## den 0.060642962
## lpo 0.049683390
## apo 0.048709206
## ses 0.033365806
## iqv 0.004140283
## rpg 0.003166098
## sex 0.002435460
## iqp 0.001948368
## sch 0.000000000
## pup 0.000000000
## min 0.000000000
A k-nearest neighbors algorithm can impute missing values by assigning the value most common among the k-nearest neighbors surrounding the point with the missing data. We will use it in order to impute values for the categorical features with missing values of sex, rpg, and den using the kNN() function in the VIM package and a value of k=3.
## sch pup iqv iqp sex
## Min. : 1.0 Min. : 1 Min. :-7.8535 Min. :-6.72275 0:2101
## 1st Qu.: 66.0 1st Qu.:1073 1st Qu.:-1.3535 1st Qu.:-1.72275 1:2005
## Median :132.0 Median :2128 Median : 0.1465 Median :-0.05608
## Mean :129.2 Mean :2121 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.:187.0 3rd Qu.:3170 3rd Qu.: 1.1465 3rd Qu.: 1.60392
## Max. :259.0 Max. :4214 Max. : 6.6465 Max. : 6.61058
## NA's :17 NA's :8
## ses min rpg lpr lpo
## Min. :-17.667 0:3868 0:3575 Min. : 9.00 Min. : 8.00
## 1st Qu.: -7.667 1: 238 1: 521 1st Qu.:30.00 1st Qu.:36.00
## Median : -1.667 2: 10 Median :35.00 Median :42.00
## Mean : 0.000 Mean :34.16 Mean :41.33
## 3rd Qu.: 8.333 3rd Qu.:39.00 3rd Qu.:48.00
## Max. : 22.333 Max. :49.00 Max. :58.00
## NA's :137 NA's :320 NA's :204
## apr apo den ssi
## Min. : 1.0 Min. : 2.00 1:1227 Min. :10.00
## 1st Qu.: 9.0 1st Qu.:15.00 2:1686 1st Qu.:16.00
## Median :12.0 Median :21.00 3:1005 Median :18.00
## Mean :11.9 Mean :19.76 4: 188 Mean :18.73
## 3rd Qu.:15.0 3rd Qu.:25.00 3rd Qu.:22.00
## Max. :20.0 Max. :30.00 Max. :30.00
## NA's :309 NA's :200 NA's :622
As we can see from the summary() function, the missing values for the categorical features identified above have been imputed.
We will continue through regression imputation. We create a response variable of post, identified as the total post-score for each of the students and obtained by adding the language and arithmetic post-test scores together. The rest of the variables are selected as predictors.
## iqv iqp sex ses min
## Min. :-7.85351 Min. :-6.72275 0:1606 Min. :-17.6667 0:2912
## 1st Qu.:-1.35351 1st Qu.:-1.72275 1:1489 1st Qu.: -7.6667 1: 183
## Median : 0.14649 Median :-0.05608 Median : -2.6667
## Mean :-0.07096 Mean :-0.04532 Mean : -0.2906
## 3rd Qu.: 1.14649 3rd Qu.: 1.60392 3rd Qu.: 7.3333
## Max. : 6.14649 Max. : 6.61058 Max. : 22.3333
##
## rpg lpr apr den post
## 0:2696 Min. :10.00 Min. : 1.00 1:1014 Min. :13.00
## 1: 392 1st Qu.:30.00 1st Qu.: 9.00 2:1091 1st Qu.:50.00
## 2: 7 Median :35.00 Median :12.00 3: 835 Median :62.00
## Mean :34.06 Mean :11.82 4: 155 Mean :60.46
## 3rd Qu.:39.00 3rd Qu.:14.00 3rd Qu.:72.00
## Max. :49.00 Max. :20.00 Max. :88.00
## NA's :174
## ssi
## Min. :10.00
## 1st Qu.:16.00
## Median :18.00
## Mean :18.73
## 3rd Qu.:22.00
## Max. :30.00
##
With the observations with missing values for the predictors removed, the response variable post now has 174 observations with missing values that we will impute through regression imputation.
## iqv iqp sex ses min
## Min. :-7.85351 Min. :-6.72275 0:1606 Min. :-17.6667 0:2912
## 1st Qu.:-1.35351 1st Qu.:-1.72275 1:1489 1st Qu.: -7.6667 1: 183
## Median : 0.14649 Median :-0.05608 Median : -2.6667
## Mean :-0.07096 Mean :-0.04532 Mean : -0.2906
## 3rd Qu.: 1.14649 3rd Qu.: 1.60392 3rd Qu.: 7.3333
## Max. : 6.14649 Max. : 6.61058 Max. : 22.3333
## rpg lpr apr den post
## 0:2696 Min. :10.00 Min. : 1.00 1:1014 Min. :11.06
## 1: 392 1st Qu.:30.00 1st Qu.: 9.00 2:1091 1st Qu.:50.00
## 2: 7 Median :35.00 Median :12.00 3: 835 Median :62.00
## Mean :34.06 Mean :11.82 4: 155 Mean :60.14
## 3rd Qu.:39.00 3rd Qu.:14.00 3rd Qu.:72.00
## Max. :49.00 Max. :20.00 Max. :88.85
## ssi
## Min. :10.00
## 1st Qu.:16.00
## Median :18.00
## Mean :18.73
## 3rd Qu.:22.00
## Max. :30.00
We can see that the removal of observations where the predictor variables had missing values has led to a pretty substantial loss of information, as the imputed dataset only has 3095 observations as opposed to the original dataset’s 4106, or a 24.6% loss.
We will do multiple imputation through the R library mice, setting m = 5.
## Warning: Number of logged events: 1
## sch pup iqv iqp sex ses min rpg
## "" "" "pmm" "pmm" "logreg" "pmm" "" "polyreg"
## lpr lpo apr apo den ssi
## "pmm" "pmm" "pmm" "pmm" "polyreg" "pmm"
## 'data.frame': 4106 obs. of 15 variables:
## $ iqv : num -1.35 2.15 3.15 2.65 -2.35 ...
## $ iqp : num -3.7227 3.2773 1.2739 -1.0561 -0.0561 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
## $ ses : num -17.67 -4.67 -4.67 -17.67 -12.67 ...
## $ min : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 1 2 2 1 ...
## $ rpg : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ lpr : num 33 44 36 36 33 29 19 22 20 44 ...
## $ lpo : num 35 50 46 45 33 46 20 30 30 57 ...
## $ apr : num 10 18 14 12 10 13 8 8 7 17 ...
## $ apo : num 12 30 24 19 24 26 9 13 13 30 ...
## $ den : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ ssi : num 11 11 11 11 11 11 11 11 11 11 ...
## $ post: num 43 80 70 64 57 72 29 43 43 87 ...
## $ pup : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sch : int 1 1 1 1 1 1 1 1 1 1 ...
## iqv iqp sex ses min
## Min. :-7.85351 Min. :-6.722750 0:2100 Min. :-17.6667 0:3868
## 1st Qu.:-1.35351 1st Qu.:-1.722750 1:2006 1st Qu.: -7.6667 1: 238
## Median : 0.14649 Median :-0.056083 Median : -1.6667
## Mean :-0.00256 Mean :-0.000923 Mean : -0.1209
## 3rd Qu.: 1.14649 3rd Qu.: 1.603917 3rd Qu.: 8.3333
## Max. : 6.64649 Max. : 6.610584 Max. : 22.3333
## rpg lpr lpo apr apo
## 0:3572 Min. : 9.00 Min. : 8.00 Min. : 1.00 Min. : 2.00
## 1: 524 1st Qu.:30.00 1st Qu.:35.00 1st Qu.: 9.00 1st Qu.:14.00
## 2: 10 Median :35.00 Median :42.00 Median :12.00 Median :20.00
## Mean :34.28 Mean :41.14 Mean :11.95 Mean :19.62
## 3rd Qu.:39.00 3rd Qu.:48.00 3rd Qu.:15.00 3rd Qu.:25.00
## Max. :49.00 Max. :58.00 Max. :20.00 Max. :30.00
## den ssi post pup sch
## 1:1271 Min. :10.00 Min. :13.00 Min. : 1 Min. : 1.0
## 2:1600 1st Qu.:16.00 1st Qu.:50.00 1st Qu.:1073 1st Qu.: 66.0
## 3:1038 Median :18.00 Median :61.00 Median :2128 Median :132.0
## 4: 197 Mean :18.75 Mean :60.14 Mean :2121 Mean :129.2
## 3rd Qu.:22.00 3rd Qu.:72.00 3rd Qu.:3170 3rd Qu.:187.0
## Max. :30.00 Max. :88.00 Max. :4214 Max. :259.0
Using multiple imputation through mice, all of the missing values for all 4106 observations were imputed.
We construct the same multiple linear regression model using both imputed datasets, with post as a response and iqv, iqp, sex, min, rpg, lpr, apr, den, and ssi as the predictors. Both models mark each of the predictors as significant at the 0.05 significance level except for min.
## # A tibble: 70 × 6
## term estimate std.error statistic p.value nobs
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 (Intercept) 16.2 1.24 13.1 1.16e- 38 4106
## 2 iqv 1.21 0.0896 13.5 1.51e- 40 4106
## 3 iqp 0.834 0.0748 11.2 1.69e- 28 4106
## 4 sex1 0.993 0.278 3.58 3.53e- 4 4106
## 5 ses 0.119 0.0147 8.09 7.67e- 16 4106
## 6 min1 0.268 0.602 0.445 6.56e- 1 4106
## 7 rpg1 -3.60 0.426 -8.44 4.19e- 17 4106
## 8 rpg2 -0.0959 2.76 -0.0348 9.72e- 1 4106
## 9 lpr 0.744 0.0290 25.6 2.65e-134 4106
## 10 apr 1.04 0.0496 21.1 1.51e- 93 4106
## # ℹ 60 more rows
est | lo 95 | hi 95 | fmi | |
---|---|---|---|---|
R^2 | 0.6574773 | 0.6398391 | 0.6744685 | 0.0370674 |
##
## Call:
## lm(formula = post ~ iqv + iqp + sex + ses + min + rpg + lpr +
## apr + den + ssi, data = comp.imputedData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.914 -5.555 0.083 5.870 29.218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.28956 1.37364 12.587 < 2e-16 ***
## iqv 1.30204 0.09968 13.062 < 2e-16 ***
## iqp 0.81934 0.08388 9.768 < 2e-16 ***
## sex1 1.14646 0.31109 3.685 0.000232 ***
## ses 0.13092 0.01676 7.811 7.74e-15 ***
## min1 1.31255 0.66532 1.973 0.048607 *
## rpg1 -3.87468 0.47827 -8.102 7.74e-16 ***
## rpg2 3.06984 3.21076 0.956 0.339092
## lpr 0.73159 0.03232 22.638 < 2e-16 ***
## apr 0.99006 0.05474 18.088 < 2e-16 ***
## den2 3.51989 0.37384 9.415 < 2e-16 ***
## den3 1.31260 0.39670 3.309 0.000948 ***
## den4 1.78517 0.75532 2.363 0.018166 *
## ssi 0.24385 0.04058 6.010 2.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.437 on 3081 degrees of freedom
## Multiple R-squared: 0.6678, Adjusted R-squared: 0.6664
## F-statistic: 476.4 on 13 and 3081 DF, p-value: < 2.2e-16
We see that the model based on the dataset created through kNN and regression imputation has a higher estimate for the R-squared at 0.6676 as opposed to the model based on the dataset imputed through multiple imputation. However, this makes sense, as the missing values of post were imputed based on the regression model constructed with the above predictors, which likely biases the estimate of the performance of this model upwards for the model based on regression imputation; also, the second model was only based on 3095 observations as opposed to the 4106 of the mice imputed dataset, which preserved all of the information contained in the original data. The estimate for the R-squared for the regression + kNN imputed dataset is also within the confidence bounds of the R-squared of the model based on the mice imputed dataset, which doesn’t indicate an obvious, significantly better performance for the dataset created through single imputation methods. Therefore, for the rest of the analysis, we proceed with the dataset created through multiple imputation.
The feature engineering methods that will be utilized in this analysis all fall in one of three categories:
The variables identified as having notable skewness in the initial analysis were the following: left skew observed in lpr, lpo, apo, and possibly iqv; evidence of right skew appears in ssi and ses.
We can examine their distributions in the following histograms.
As we can see, there is some pretty obvious left skew in lpr and lpo, though all of the distributions do seem unimodal. We conduct box-cox transformations on each of the above features to create transformed versions of these features, shifting the variables of ses and iqv by 8 and 18 respectively in order to keep them positive, as the box-cox only works for nonnegative values.
par(mfrow = c(2, 3))
# iqv - need to add constant of 8 to keep response nonnegative
boxcox_iqv <- boxcox(lm(mice.brandsma$iqv+8 ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_iqv$x[which.max(boxcox_iqv$y)]
mice.brandsma$tiqv <- if (optimal_lambda == 0) {
log(mice.brandsma$iqv+8)
} else {
((mice.brandsma$iqv+8)^optimal_lambda - 1) / optimal_lambda
}
# ses - need to add constant of 18 to keep response nonnegative
boxcox_ses <- boxcox(lm(mice.brandsma$ses+18 ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_ses$x[which.max(boxcox_ses$y)]
mice.brandsma$tses <- if (optimal_lambda == 0) {
log(mice.brandsma$ses+18)
} else {
((mice.brandsma$ses+18)^optimal_lambda - 1) / optimal_lambda
}
# lpr
boxcox_lpr <- boxcox(lm(mice.brandsma$lpr ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_lpr$x[which.max(boxcox_lpr$y)]
mice.brandsma$tlpr <- if (optimal_lambda == 0) {
log(mice.brandsma$lpr)
} else {
(mice.brandsma$lpr^optimal_lambda - 1) / optimal_lambda
}
# lpo
boxcox_lpo <- boxcox(lm(mice.brandsma$lpo ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_lpo$x[which.max(boxcox_lpo$y)]
mice.brandsma$tlpo <- if (optimal_lambda == 0) {
log(mice.brandsma$lpo)
} else {
(mice.brandsma$lpo^optimal_lambda - 1) / optimal_lambda
}
# apo
boxcox_apo <- boxcox(lm(mice.brandsma$apo ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_apo$x[which.max(boxcox_apo$y)]
mice.brandsma$tapo <- if (optimal_lambda == 0) {
log(mice.brandsma$apo)
} else {
(mice.brandsma$apo^optimal_lambda - 1) / optimal_lambda
}
# ssi
boxcox_ssi <- boxcox(lm(mice.brandsma$ssi ~ 1), lambda = seq(0, 3, by = 0.1))
title("Box-Cox Transformation")
optimal_lambda <- boxcox_ssi$x[which.max(boxcox_ssi$y)]
mice.brandsma$tssi <- if (optimal_lambda == 0) {
log(mice.brandsma$ssi)
} else {
(mice.brandsma$ssi^optimal_lambda - 1) / optimal_lambda
}
h1 <- ggplot(mice.brandsma, aes(x = tiqv)) + geom_histogram(color = "#32373B", fill = "#C83E4D", bins = 15)
h2 <- ggplot(mice.brandsma, aes(x = tses)) + geom_histogram(color = "#32373B", fill = "#F4D6CC", bins = 15)
h3 <- ggplot(mice.brandsma, aes(x = tlpr)) + geom_histogram(color = "#32373B", fill = "#F4B860", bins = 15)
h4 <- ggplot(mice.brandsma, aes(x = tlpo)) + geom_histogram(color = "#32373B", fill = "#C83E4D", bins = 15)
h5 <- ggplot(mice.brandsma, aes(x = tapo)) + geom_histogram(color = "#32373B", fill = "#F4D6CC", bins = 15)
h6 <- ggplot(mice.brandsma, aes(x = tssi)) + geom_histogram(color = "#32373B", fill = "#F4B860", bins = 15)
grid.arrange(h1, h2, h3, h4, h5, h6, nrow = 2, ncol = 3)
The transformed values of iqv, lpr, lpo, and ssi do seem more centered; however, apo and ses still do not seem to be distributed normally.
We then create a function to standardize each of the variables based on their mean and standard deviation, centering them at 0. Certain algorithms, such as those based on Euclidean distances such as k-means clustering and principal component anaysis, can weight features that have larger magnitudes because of their units compared to those with smaller magnitudes. Standardizing the features can improve their performance.
Two methods of feature selection will be explored in this analysis. The first is recursive feature elimination from the caret package. This method can be computationally intensive and takes more time to run, and uses backwards selection and random forest models to consider subsets of features and their interactions, using cross-validation to select the optimal set of features.
## [1] "apr" "tlpr" "tiqv" "iqp" "rpg" "tses" "tssi" "den" "min" "sex"
The features selected based on this method are the following: apr, tlpr, tiqv, iqp, tses, rpg, tssi, den, min, and sex.
The other method of feature selection we will use is feature selection through LASSO, which is implemented through the glmnet package. We use the same response variable of post.
## [1] "(Intercept)" "sex" "min" "rpg" "den"
## [6] "tlpr" "tses" "tssi" "tiqv" "apr"
## [11] "iqp"
The features selected are the following: sex, min, rpg, den, tlpr, tses, tssi, tiqv, apr, and iqp.
Both of these methods selected all of the features that were originally included. This may have to do with the large sample size of the dataset. Small p-values were observed in the original linear models we created as well for all of these predictors except for min.
The process of feature creation in this report will consist of a few analytic tasks.
To begin, sparse categories were identified in the original exploratory analysis in the variable rpg, where there were only 10 observations with 2 repeated groups. Therefore, I will create a binary variable that identifies if an observation had repeated groups or not, combining a rpg value of 1 and 2 into one category.
## sex min rpg den
## 0:2100 0:3868 0:3572 1:1271
## 1:2006 1: 238 1: 524 2:1600
## 2: 10 3:1038
## 4: 197
## 0 1
## 3572 534
The categories for min, has_repeated_group, and den are still unbalanced in terms of sample size; however, without a meaningful way to combine the categories in den and only two categories in the other features, any issues that may arise due to these imbalances would best be mitigated through a different analytic technique than category regrouping.
PCA, or principal component analysis, can be used to simplify the dataset, especially when there are a number of highly correlated variables in the dataset by creating new variables that are linear combinations of the original variables that are uncorrelated with each other and maximize the amount of information conveyed in the first few principal components. We will investigate its applicability to this dataset. We will also use clustering methods such as k-means clustering in combination with PCA to create plots of the clusters to investigate the existence of defined groupings in the dataset.
To begin, we will look at the correlations between numerical features in the data
There are some features that do show notable positive correlations, such as the transformed language pre-score and the transformed language post-score, which have a correlation of 0.715, as well as the transformed arithmetic post-score and the transformed language post-score, which have a correlation of 0.709. However, we don’t see any features that are obviously highly correlated at r = 0.8-1.0.
Highly correlated features can create issues in regression analysis, leading to multicollinearity and overfitting. Principal component analysis uses linear combinations of features to create “principal components” that are designed to have as little correlation between them as possible. This can be used to reduce the dimensionality of the data, capturing much of its variability in fewer principal components than the number of variables in the original data, simplifying future analysis procedures.
We will attempt PCA on all of the features in this dataset to examine its usefulness.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0287 1.0904 0.8347 0.73195 0.70081 0.63768 0.59461
## Proportion of Variance 0.5144 0.1486 0.0871 0.06697 0.06139 0.05083 0.04419
## Cumulative Proportion 0.5144 0.6631 0.7502 0.81715 0.87854 0.92937 0.97356
## PC8
## Standard deviation 0.45989
## Proportion of Variance 0.02644
## Cumulative Proportion 1.00000
Using PCA, we can see that the first two prinicipal components capture 66.3% of the variance in the dataset, the first three capture 75%, and the first four capture over 80%. The scree plot shows that after the first principal component, the decreases in variances with each added principal component slows drastically and progresses more or less evenly between PC3 and PC8. The elbow on the screeplot appears around the 2nd or 3rd principal component, and if one is okay with capturing 75% of the variability in the dataset, the first 3 PCs can be utilized for future analysis. However, if we would prefer a threshold of 85% then we would need 5 PCs and a threshold of 95% would require 7 PCs. In general, while PCA is definitely usable, the features don’t show enough correlation between themselves where I would consider it extremely effective in terms of future analysis and its applicability should be weighed on interpretability and its relevance depending on the chosen response variable in future regression analysis.
We will also use the first two PCs to visualize clustering algorithms such as k-means clustering. K-means is a distance based algorithm that partitions points in a dataset into groups in order to minimize the distance between each point and the centroid of the group (4). We will use an elbow plot to identify the ideal number of clusters.
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
## Warning: did not converge in 1 iteration
We see an obvious drop off from 1 to 2 clusters, then a smoother decline for higher values of k. The most appropriate elbow point seems to be around k=3.
We can graph the dataset across its first and second principal components and then visualize the clusters using clusplot() from the R library cluster.
Looking purely at the datapoints when graphed across the principal components, we don’t see much evidence of explicit clustering in this dataset. The clusters identified by k-means clustering seem to horizontally across varying values of PC1 while not really depending on the values of PC2. Without much obvious evidence in favor of clustering according to the scatterplot we made over the first and second principal components, it doesn’t seem like clustering methods are the most appropriate features to implement in future analysis of this data.
This analysis of the brandsma dataset on secondary school students and their test scores in Language and Arithmetic is a follow-up on an initial exploratory analysis (2) in order to create a final analysis dataset for future modeling and analytic activities. This report can be broken up into two main sections: 1) missing data imputation and 2) feature engineering.
For missing data imputation, we used methods of k-nearest neighbors imputation, regression imputation, and multiple imputation through the R mice library. In order to retain as much information as possible, we chose to use multiple imputation for the final analytic dataset. However, there are limitations to multiple imputation through MICE that include the propagation of errors throughout the data and its inability to handle cases of MNAR missing data.
For feature engineering, analytic tasks were split between three main categories: 1) feature transformation, 2) feature selection, and 3) feature creation. Skewed features were transformed through Box-cox transformations and then standardized to increase the applicability of distance-based algorithms, features were selected through two feature selection algorithms which showed that all features in the dataset are worthy of inclusion in future analysis, and sparse categories were handled through regrouping and PCA and clustering algorithms were considered for their relevance in future analysis. While PCA may be applicable based on the goals of future analysis, there didn’t appear to be strong evidence in support of the use of clustering for future analytic tasks.
The created analytic dataset can now be used for future model building and prediction.