In this project, we decided to study the relationship between personal variables and income from United States citizens. It was not easy to find a suitable dataset. The data was retrieved from the Minnesota Population Center, a database that allowed for manual selection of the variables to be included in the dataset. After creating the dataset of choice, the data was preprocessed to account for missing values and other inconsistencies. Several multiple linear regression models are applied to preprocessed data set, including a preliminary regression on all variables, a regression on manually selected variables, stepwise regression, and regression on data that was transformed using box-cox transformations. These models are compared based on their \(\bar{R}^2\) and the validation of the assumed model assumptions. Additionally, we explore the outliers in the data to make a potentially stronger model without these values.
The initially selected variables obtained from the Minnesota Population Center are:
However, there are some complexities of using the original dataset. First of all, the dataset has factor variables of a high level (e.g., up to 10k levels for the industry variable). Moreover, the dataset is presented in the form of a frequency table, with corresponding weight values showing the number of individuals representing each row. Finally, the number of rows is close to 10k, which is too big for the analysis.
In this section we preprocess the data to account for missing values and other inconsistencies. We aggregate several variables, apply weighting, and sample the dataset to 500 observations.
df <- read.csv('dataset.csv')
df$VETSTAT <- as.factor(df$VETSTAT)
df$IND <- as.factor(df$IND)
df$SCHLTYPE_revised <- as.factor(df$SCHLTYPE_revised)
df_raw <- read.csv('dataset_original.csv')
head(df_raw)
This is the look of the original dataset.
income$IND_revised <- income$IND
income$IND_revised[income$IND_revised >= 170 & income$IND_revised <= 3990 ] <- "Agriculture, Forestry, Fishing and Hunting, and Mining"
income$IND_revised[income$IND_revised >= 4070 & income$IND_revised <= 4590 ] <- "Wholesale Trade"
income$IND_revised[income$IND_revised >= 4670 & income$IND_revised <= 5790 ] <- "Retail Trade"
income$IND_revised[income$IND_revised >= 6070 & income$IND_revised <= 6390 ] <- "Transportation and Warehousing, and Utilities"
income$IND_revised[income$IND_revised == 570 || income$IND_revised == 580 || income$IND_revised == 590 || income$IND_revised == 670 || income$IND_revised == 680 || income$IND_revised == 690] <- "Transportation and Warehousing, and Utilities"
income$IND_revised[income$IND_revised >= 6470 & income$IND_revised <= 6780 ] <-"Information"
income$IND_revised[income$IND_revised >= 6870 & income$IND_revised <= 7190 ] <- "Finance and Insurance, and Real Estate and Rental and Leasing"
income$IND_revised[income$IND_revised >= 7270 & income$IND_revised <= 7790 ] <- "Professional, Scientific, and Management, and Administrative, and Waste Management Services"
states_revised$IND_revised[states_revised$IND %in% c(570,580,590,670,680,690)] <- "Transportation and Warehousing, and Utilities"
income$IND_revised[income$IND_revised >= 7860 & income$IND_revised <= 8470 ] <- "Educational Services, and Health Care and Social Assistance"
income$IND_revised[income$IND_revised >= 8560 & income$IND_revised <= 8690 ] <- "Arts, Entertainment, and Recreation, and Accommodation and Food Services"
income$IND_revised[income$IND_revised >= 8770 & income$IND_revised <= 9290 ] <- "Other Services, Except Public Administration"
income$IND_revised[income$IND_revised >= 9370 & income$IND_revised <= 9590 ] <- "Public Administration"
income$IND_revised[income$IND_revised >= 9670 & income$IND_revised <= 9920 ] <- "Active Duty Military"
Industry variable represents industry codes. We aggregated IND
variable by aggregating each data point into its appropriate industry category.
# aggregating the industry variable
par(mfrow=c(1,2))
plot(df$IND)
plot(df$IND_revised)
states_revised <- read.csv("dataset.csv")
table(states_revised$PWSTATE2_revised)
table(states_revised$PWSTATE2)
View(states_revised)
states_revised$PWSTATE2_revised <- states_revised$PWSTATE2
states_revised <- states_revised[!(states_revised$PWSTATE2_revised >= 61),]
states_revised <- states_revised[!(states_revised$PWSTATE2_revised == 0),]
states_revised$PWSTATE2_revised[states_revised$PWSTATE2_revised %in% c(53,41,6,30,16,56,32,49,8,4,35,2,15)] <- "WEST"
states_revised$PWSTATE2_revised[states_revised$PWSTATE2_revised %in% c(38,27,46,19,31,20,29,55,26,17,18,39)] <- "MIDWEST"
states_revised$PWSTATE2_revised[states_revised$PWSTATE2_revised %in% c(40,5,48,22,28,1,47,21,54,51,37,45,13,12,10,24,11)] <- "SOUTH"
states_revised$PWSTATE2_revised[states_revised$PWSTATE2_revised %in% c(33,50,23,25,44,9,36,34,42)] <- "NORTHEAST"
Transformed PWSTATE2
by categorizing each data point into four regions of united states.
cat_revised$RACE_revised <- cat_revised$RACE
table(cat_revised$RACE)
cat_revised$RACE_revised[cat_revised$RACE_revised == 1] <- "White"
cat_revised$RACE_revised[cat_revised$RACE_revised == 2] <- "Black"
cat_revised$RACE_revised[cat_revised$RACE_revised == 3] <- "American Indian"
cat_revised$RACE_revised[cat_revised$RACE_revised %in% c(4,5,6)] <- "Asian"
cat_revised$RACE_revised[cat_revised$RACE_revised %in% c(7,8,9)] <- "Other"
Transformed RACE
variable by assigning descripitve names rather than just simple numbers.
cat_revised$HCOVANY_revised <- cat_revised$HCOVANY
cat_revised$HCOVANY_revised[cat_revised$HCOVANY_revised == 1] <- "No health insurance coverage"
cat_revised$HCOVANY_revised[cat_revised$HCOVANY_revised == 2] <- "With health insurance coverage"
Transformed HCOVANY
variable by assigning descriptive names rather than just simple numbers.
cat_revised$WKSWORK2_revised <- cat_revised$WKSWORK2
cat_revised$WKSWORK2_revised[cat_revised$WKSWORK2_revised == 1] <- "1-13 weeks"
cat_revised$WKSWORK2_revised[cat_revised$WKSWORK2_revised == 2] <- "14-26 weeks"
cat_revised$WKSWORK2_revised[cat_revised$WKSWORK2_revised == 3] <- "27-39 weeks"
cat_revised$WKSWORK2_revised[cat_revised$WKSWORK2_revised == 4] <- "40-47 weeks"
cat_revised$WKSWORK2_revised[cat_revised$WKSWORK2_revised == 5] <- "48-49 weeks"
cat_revised$WKSWORK2_revised[cat_revised$WKSWORK2_revised == 6] <- "50-52 weeks"
Transformed WKSWORK2
variable by assigning descripitve names rather than just simple numbers.
cat_revised$VETSTAT_revised <- cat_revised$VETSTAT
cat_revised$VETSTAT_revised[cat_revised$VETSTAT_revised == 1] <- "Not a veteran"
cat_revised$VETSTAT_revised[cat_revised$VETSTAT_revised == 2] <- "Veteran"
Transformed VETSTAT
variable by assigning descripitve names rather than just simple numbers.
# aggregate the education variable
df$EDUC_revised <- as.factor(sapply(df$EDUC, function(x) {
if (x == 0) return('no school or N/A')
else if (x < 7) return('1-12 years of pre-college')
else if (x < 11) return('1-4 years of college')
else return ('5+ years of college')
}))
par(mfrow=c(1,2))
plot(as.factor(df$EDUC))
plot(df$EDUC_revised)
We aggregated the EDUC
variable into 4 levels that capture high school, college, and advanced college degrees.
library(ggplot2)
ggplot(df) +
geom_point(aes(x = seq(1,nrow(df)), y = df$INCTOT, color=df$EDUC_revised)) +
theme_bw()
Unsuprisingly, there is an obvious correlation between the education level and income.
# aggregate the school type variable
df$SCHLTYPE_revised <- as.factor(sapply(df$SCHLTYPE, function(x) {
if (x == 0) return('N/A')
else if (x == 1) return('Not Enrolled')
else if (x == 2) return('Public School')
else return ('Private School')
}))
par(mfrow=c(1,2))
plot(as.factor(df$SCHLTYPE))
plot(df$SCHLTYPE_revised)
We dropped INCWAGE
variable after realizing it was not very discriptive. We decided to use total income as a response variable.
# drop old variables
dfn <- subset(df, select = -c(MARRNO_revised, RACE, HCOVANY, IND, VETSTAT, PWSTATE2, EDUC, SCHLTYPE))
# rename columns
colnames(dfn) <- c("WEIGHT", "NUM_CHILD", "SEX", "AGE", "NUM_MARR", "WEEKS_WORKED_NUMERIC", "NUM_HOURS_WEEK", "INCOME", "TRAVEL_TIME", "INDUSTRY", "EDUC", "DIVISION", "SCHL_TYPE", "RACE", "HEALTH_INS", "WEEKS_WORKED", "VETERAN")
head(dfn)
In the following script, we calculated the weighted (by WEIGHT
) income for the observations with equal predictor variables.
library(data.table)
dt <- as.data.table(dfn)
# group by predictor variables
dt <- dt[,list(INCOME=weighted.mean(INCOME, WEIGHT)), setdiff(names(dt), c("WEIGHT", "INCOME"))]
head(dt)
The data is prepared, but the size of the dataset is still around 10,000 observations.
distributions <- function(dt) {
par(mfrow=c(4,4))
# show distributions of the variables
for (i in names(dt)) {
if (is.numeric(dt[[i]])) {
hist(dt[[i]], main = i)
} else {
plot(dt[[i]], main = i)
}
}
}
# visualise distributions
distributions(dt)
It seems that taking a sample maintained the general distribution properties of the variables.
set.seed(1337)
dts <- dt[sample(nrow(dt), 500), ]
distributions(dts)
It can be seen that the general distribution properties are indeed captured in the 500 observation sample subset.
After all preprocessing operations, the final dataset contained 16 variables:
Variable | Description |
---|---|
NUM_CHILD | Counts the number of own children (of any age or marital status) residing with each individual. Includes step-children and adopted children as well as biological children. |
SEX | Reports whether the person was male or female. |
AGE | Reports the person’s age in years as of the last birthday. |
EDUC | Reports the person’s education level. |
NUM_MARR | Reports the number of times ever-married persons have been married. |
NUM_HOURS_WEEK | Reports the usual number of hours the person earned per week. |
TRAVEL_TIME | Reports the number of minutes it usually took the person to get home from work. |
INDUSTRY | Reports the type of industry in which the person performed an occupation. Persons who worked in multiple industries were asked to report from which they earned the most money. Unemployed persons were to report their most recent industry. |
DIVISION | Reports the region of the country that the person’s primary workplace is located. |
SCHL_TYPE | Reports whether the person attended public or private school, if they were enrolled at all. |
RACE | Reports the race of the person. |
HEALTH_INS | Reports whether or not the person has health insurance coverage. |
WEEKS_WORKED | Reports the number of weeks the person worked last year, intervalled. |
WEEKS_WORKED_NUMERIC | Reports the number of week the person worked last year. |
VETERAN | Reports whether or not the person is a veteran. |
INCOME | Reports the total person’s pre-tax personal income from the previous year. This is the response variable in the analysis. |
In this first model, we decided to run multiple regression on income using all 14 of independent variables in the dataset. But before building the model, we made some final preprocessing steps:
WEEKS_WORKED_NUMERIC
variable has been removed from the data setNUM_MARR
variable is converted from numerical variable to categorical variableorig_data <- read.csv("dataset_final_sample.csv")
orig_data <- orig_data[, !(colnames(orig_data) %in% c("WEEKS_WORKED_NUMERIC"))]
orig_data$INDUSTRY <- sapply(orig_data$INDUSTRY, function(x) substr(x, 0, 60))
orig_data$NUM_MARR <- as.factor(orig_data$NUM_MARR)
mlr <- lm(orig_data$INCOME ~ ., data = orig_data )
summary(mlr)
Call:
lm(formula = orig_data$INCOME ~ ., data = orig_data)
Residuals:
Min 1Q Median 3Q Max
-128772 -27124 -3844 15457 499322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -165788.76 70528.26 -2.351 0.019160 *
NUM_CHILD 2046.01 2748.93 0.744 0.457079
SEXM 1134.39 6154.00 0.184 0.853832
AGE 830.05 242.86 3.418 0.000687 ***
NUM_MARR1 -6869.47 8157.50 -0.842 0.400166
NUM_MARR2 -8278.84 11200.04 -0.739 0.460174
NUM_MARR3 -12990.01 17548.30 -0.740 0.459529
NUM_HOURS_WEEK 1624.57 243.89 6.661 7.76e-11 ***
TRAVEL_TIME 88.69 123.45 0.718 0.472858
INDUSTRYAgriculture, Forestry, Fishing and Hunting, and Mining 86980.34 61230.35 1.421 0.156126
INDUSTRYArts, Entertainment, and Recreation, and Accommodation and F 71257.54 61407.54 1.160 0.246485
INDUSTRYEducational Services, and Health Care and Social Assistance 69823.35 60625.52 1.152 0.250035
INDUSTRYFinance and Insurance, and Real Estate and Rental and Leasin 93888.77 61451.75 1.528 0.127237
INDUSTRYInformation 85829.84 63137.54 1.359 0.174681
INDUSTRYOther Services, Except Public Administration 72156.69 61784.31 1.168 0.243459
INDUSTRYProfessional, Scientific, and Management, and Administrative 92833.81 60938.76 1.523 0.128346
INDUSTRYPublic Administration 76667.05 61591.52 1.245 0.213850
INDUSTRYRetail Trade 63849.68 61220.12 1.043 0.297517
INDUSTRYTransportation and Warehousing, and Utilities 92340.16 62031.19 1.489 0.137274
INDUSTRYWholesale Trade 92271.72 62373.79 1.479 0.139734
EDUC1-4 years of college 17001.63 6254.97 2.718 0.006813 **
EDUC5+ years of college 73774.17 9826.65 7.508 3.14e-13 ***
EDUCno school or N/A 2006.79 28248.65 0.071 0.943397
DIVISIONNORTHEAST 47545.64 8970.93 5.300 1.80e-07 ***
DIVISIONSOUTH 13721.27 7422.74 1.849 0.065164 .
DIVISIONWEST 10935.99 8346.16 1.310 0.190746
SCHL_TYPEPrivate School -4270.07 17321.88 -0.247 0.805395
SCHL_TYPEPublic School 8276.74 11767.30 0.703 0.482181
RACEAsian 11384.88 26644.44 0.427 0.669368
RACEBlack -2060.72 25130.19 -0.082 0.934681
RACEOther 13034.37 25764.01 0.506 0.613159
RACEWhite 9501.71 23492.58 0.404 0.686065
HEALTH_INSWith health insurance coverage 11342.18 10730.66 1.057 0.291071
WEEKS_WORKED14-26 weeks -28019.15 26296.32 -1.066 0.287200
WEEKS_WORKED27-39 weeks -10906.77 26017.77 -0.419 0.675262
WEEKS_WORKED40-47 weeks -31954.51 24974.32 -1.279 0.201367
WEEKS_WORKED48-49 weeks 2138.16 27505.78 0.078 0.938073
WEEKS_WORKED50-52 weeks -6065.70 23081.82 -0.263 0.792829
VETERANVeteran 1993.57 13615.86 0.146 0.883658
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59190 on 461 degrees of freedom
Multiple R-squared: 0.3618, Adjusted R-squared: 0.3092
F-statistic: 6.878 on 38 and 461 DF, p-value: < 2.2e-16
For this model, \(\bar{R}^2\) turnes out to be \(0.3092\), which means that the model fails to describe the data accurately.
qf(0.95, df1 = 38, df2 = 461)
[1] 1.431182
Running F-test on the preliminary model reveals that the model is significant at alpha = 0.05 because F-statistic of the model (\(6.878\)) is greater than \(F(0.05) = 1.431182\). Also, p-value is close to being \(0\).
VIF
test (multicollinearity test)library(car)
vif(mlr)
GVIF Df GVIF^(1/(2*Df))
NUM_CHILD 1.275544 1 1.129400
SEX 1.350841 1 1.162257
AGE 1.890971 1 1.375126
NUM_MARR 2.286037 3 1.147750
NUM_HOURS_WEEK 1.429269 1 1.195520
TRAVEL_TIME 1.117092 1 1.056926
INDUSTRY 2.964278 11 1.050633
EDUC 1.618329 3 1.083539
DIVISION 1.386669 3 1.055996
SCHL_TYPE 1.564106 2 1.118321
RACE 1.660683 4 1.065457
HEALTH_INS 1.153855 1 1.074176
WEEKS_WORKED 1.727593 5 1.056195
VETERAN 1.160938 1 1.077468
mean(vif(mlr)[,1]) # mean of vifs = 1.6133
[1] 1.6133
max(vif(mlr)[,1]) # maximum value of vifs = 2.964278
[1] 2.964278
Since mean of vifs = \(1.6133\) is not substantially greater than \(1\) and maximum value of vifs = \(2.964278\) is less than \(10\), we can conclude that our “preliminary” model doesn’t exhibit multicollinearity.
# Split the plotting panel into a 2 x 2 grid
par(mfrow = c(2, 2), pch = 16, cex = .5)
plot(mlr)
not plotting observations with leverage one:
24
Four plots above shows that our model does not satisfy the regression assumptions:
For the second model, we decided to select variables that passed 95% significance test on the “preliminary” model. They are AGE
, NUM_HOURS_WEEK
, EDUC
and DIVISION
.
mlr_t <- lm(orig_data$INCOME ~ AGE + NUM_HOURS_WEEK + EDUC + DIVISION, data =orig_data)
summary(mlr_t)
Call:
lm(formula = orig_data$INCOME ~ AGE + NUM_HOURS_WEEK + EDUC +
DIVISION, data = orig_data)
Residuals:
Min 1Q Median 3Q Max
-136847 -25843 -6832 17561 516542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -78920.2 13214.0 -5.972 4.49e-09 ***
AGE 772.2 180.1 4.288 2.18e-05 ***
NUM_HOURS_WEEK 1826.2 207.2 8.815 < 2e-16 ***
EDUC1-4 years of college 17439.2 5790.9 3.011 0.00273 **
EDUC5+ years of college 70118.3 8850.6 7.922 1.57e-14 ***
EDUCno school or N/A 4534.5 27067.9 0.168 0.86703
DIVISIONNORTHEAST 45091.6 8809.4 5.119 4.43e-07 ***
DIVISIONSOUTH 7645.3 7061.9 1.083 0.27951
DIVISIONWEST 10141.6 7924.9 1.280 0.20125
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59190 on 491 degrees of freedom
Multiple R-squared: 0.3203, Adjusted R-squared: 0.3092
F-statistic: 28.92 on 8 and 491 DF, p-value: < 2.2e-16
For this model, \(\bar{R}^2\) turned out to be \(0.3092\), which is the same as the one we got from “preliminary” model. It means that by reducing the dimensionality, we did not lose much information about the predicted variable.
qf(0.95, df1 = 8, df2 = 491)
[1] 1.957253
“Manual” model is statistically significant because F-statistic: \(28.9\) is greater than \(F(alpha = 0.05) = 1.957253\)
par(mfrow = c(2, 2), pch = 16, cex = .5)
plot(mlr_t)
Four plots above shows that the second model does not satisfy the regression assumptions. Similarly to the first model, The Residuals vs. Fitted Values plot, as well as the Normal Probability Plot show variations from the expected plots, indicating the error terms are not normally distributed with constant variance.
set.seed(1337)
step(mlr, direction = "backward", trace = F)
Call:
lm(formula = orig_data$INCOME ~ SEX + AGE + NUM_HOURS_WEEK +
EDUC + DIVISION, data = orig_data)
Coefficients:
(Intercept) SEXM AGE NUM_HOURS_WEEK EDUC1-4 years of college
-81462 7924 777 1774 18103
EDUC5+ years of college EDUCno school or N/A DIVISIONNORTHEAST DIVISIONSOUTH DIVISIONWEST
72010 2280 44859 7582 10098
Running backward elimination method reveals that we should include SEX
, AGE
, NUM_HOURS_WEEK
, EDUC
, DIVISION
as our independent variables. It should be noted that these variables are the same as the significant variables chosen for the manual model above, with the addition of the SEX
variable here.
step(mlr, direction = "both", trace = F)
Call:
lm(formula = orig_data$INCOME ~ SEX + AGE + NUM_HOURS_WEEK +
EDUC + DIVISION, data = orig_data)
Coefficients:
(Intercept) SEXM AGE NUM_HOURS_WEEK EDUC1-4 years of college
-81462 7924 777 1774 18103
EDUC5+ years of college EDUCno school or N/A DIVISIONNORTHEAST DIVISIONSOUTH DIVISIONWEST
72010 2280 44859 7582 10098
Running stepwise regression method reveals that we should include SEX
, AGE
, NUM_HOURS_WEEK
, EDUC
, DIVISION
as our independent variables. This is the same conclusion drawn from the backward elimination.
The automatic model building procedure suggested that we should use SEX
, AGE
, NUM_HOURS_WEEK
, EDUC
, DIVISION
as our independent variables.
mlr_bs <- lm(orig_data$INCOME ~ SEX + AGE + NUM_HOURS_WEEK + EDUC + DIVISION, data = orig_data)
summary(mlr_bs)
Call:
lm(formula = orig_data$INCOME ~ SEX + AGE + NUM_HOURS_WEEK +
EDUC + DIVISION, data = orig_data)
Residuals:
Min 1Q Median 3Q Max
-142127 -25991 -6689 17095 520568
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81462.3 13312.7 -6.119 1.93e-09 ***
SEXM 7924.1 5420.1 1.462 0.14439
AGE 777.0 179.9 4.319 1.90e-05 ***
NUM_HOURS_WEEK 1774.0 210.0 8.448 3.39e-16 ***
EDUC1-4 years of college 18102.6 5802.0 3.120 0.00191 **
EDUC5+ years of college 72009.6 8934.5 8.060 5.89e-15 ***
EDUCno school or N/A 2280.3 27080.5 0.084 0.93293
DIVISIONNORTHEAST 44859.1 8800.7 5.097 4.93e-07 ***
DIVISIONSOUTH 7581.7 7053.9 1.075 0.28298
DIVISIONWEST 10098.3 7915.8 1.276 0.20266
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59120 on 490 degrees of freedom
Multiple R-squared: 0.3232, Adjusted R-squared: 0.3108
F-statistic: 26 on 9 and 490 DF, p-value: < 2.2e-16
Adjusted \(\bar{R}^2\) turned out to be \(0.3108\), which is slightly better than the first models.
par(mfrow = c(2, 2), pch = 16, cex = .5)
plot(mlr_bs)
Although the \(\bar{R}^2\) was slightly improved compared to the prior models, the residual plots still shows that our model doesn’t satisfy the regression assumptions.
hist(mlr_bs$residuals, col = "lightgray")
Plot of residuals confirms that they are definitely not distributed normally, but in fact are positively skewed. We can try to apply some transformations to normalize the residuals.
library(MASS)
b_bs <- boxcox(mlr, lambda = seq(0,0.2,.1))
boxcox_values_bs <- cbind(b_bs$x,b_bs$y)
value_bs <- boxcox_values_bs[order(-b_bs$y),][1,1]
value_bs
[1] 0.01414141
Running Box-Cox test on the original data reveals that we should use \(\lambda = 0.01414141\) for our transformation on our dependent variable which is income. It can be seen from the above plot that the maximum likelihood estimation has the highest peak at \(\lambda = 0.01414141\).
We created a new model using transformed predicted variable. Re-running backward elimination again (steps omitted) showed that we should include AGE
, NUM_HOURS_WEEK
, TRAVEL_TIME
,INDUSTRY
, EDUC
, DIVISION
, SCHL_TYPE
, HEALTH_INS
, WEEKS_WORKED
. It is notable that these variables are different than the ones picked out from the non-transformed data: SEX
, AGE
, NUM_HOURS_WEEK
, EDUC
, DIVISION
.
orig_data$INCOME_TRANSFORMED <- orig_data$INCOME^value_bs
mlr_transform_bs <- lm(formula = orig_data$INCOME_TRANSFORMED ~ AGE + NUM_HOURS_WEEK +
TRAVEL_TIME + INDUSTRY + EDUC + DIVISION + SCHL_TYPE + HEALTH_INS + WEEKS_WORKED, data = orig_data)
summary(mlr_transform_bs)
Call:
lm(formula = orig_data$INCOME_TRANSFORMED ~ AGE + NUM_HOURS_WEEK +
TRAVEL_TIME + INDUSTRY + EDUC + DIVISION + SCHL_TYPE + HEALTH_INS +
WEEKS_WORKED, data = orig_data)
Residuals:
Min 1Q Median 3Q Max
-0.033972 -0.005723 0.000365 0.006087 0.038519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.083e+00 1.166e-02 92.880 < 2e-16 ***
AGE 2.339e-04 3.647e-05 6.414 3.44e-10 ***
NUM_HOURS_WEEK 4.454e-04 4.205e-05 10.593 < 2e-16 ***
TRAVEL_TIME 4.370e-05 2.143e-05 2.039 0.04196 *
INDUSTRYAgriculture, Forestry, Fishing and Hunting, and Mining 1.533e-02 1.069e-02 1.435 0.15196
INDUSTRYArts, Entertainment, and Recreation, and Accommodation and F 5.329e-03 1.074e-02 0.496 0.62015
INDUSTRYEducational Services, and Health Care and Social Assistance 9.123e-03 1.062e-02 0.859 0.39080
INDUSTRYFinance and Insurance, and Real Estate and Rental and Leasin 1.513e-02 1.074e-02 1.408 0.15976
INDUSTRYInformation 1.451e-02 1.103e-02 1.315 0.18900
INDUSTRYOther Services, Except Public Administration 1.035e-02 1.081e-02 0.958 0.33851
INDUSTRYProfessional, Scientific, and Management, and Administrative 1.437e-02 1.065e-02 1.349 0.17804
INDUSTRYPublic Administration 1.417e-02 1.076e-02 1.318 0.18831
INDUSTRYRetail Trade 9.134e-03 1.070e-02 0.853 0.39385
INDUSTRYTransportation and Warehousing, and Utilities 1.379e-02 1.082e-02 1.274 0.20326
INDUSTRYWholesale Trade 1.572e-02 1.087e-02 1.446 0.14884
EDUC1-4 years of college 5.522e-03 1.085e-03 5.089 5.22e-07 ***
EDUC5+ years of college 1.330e-02 1.714e-03 7.757 5.43e-14 ***
EDUCno school or N/A 1.491e-05 4.890e-03 0.003 0.99757
DIVISIONNORTHEAST 6.735e-03 1.571e-03 4.286 2.21e-05 ***
DIVISIONSOUTH 9.181e-05 1.277e-03 0.072 0.94273
DIVISIONWEST 2.440e-03 1.436e-03 1.699 0.08992 .
SCHL_TYPEPrivate School -4.918e-03 3.017e-03 -1.630 0.10381
SCHL_TYPEPublic School -5.215e-03 2.031e-03 -2.568 0.01054 *
HEALTH_INSWith health insurance coverage 4.412e-03 1.879e-03 2.348 0.01930 *
WEEKS_WORKED14-26 weeks 1.456e-02 4.562e-03 3.191 0.00151 **
WEEKS_WORKED27-39 weeks 2.360e-02 4.508e-03 5.235 2.49e-07 ***
WEEKS_WORKED40-47 weeks 2.177e-02 4.350e-03 5.005 7.92e-07 ***
WEEKS_WORKED48-49 weeks 3.154e-02 4.745e-03 6.646 8.32e-11 ***
WEEKS_WORKED50-52 weeks 2.789e-02 3.997e-03 6.978 1.02e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01042 on 471 degrees of freedom
Multiple R-squared: 0.6251, Adjusted R-squared: 0.6029
F-statistic: 28.05 on 28 and 471 DF, p-value: < 2.2e-16
The transformed model has \(\bar{R}^2\) equal to \(0.6029\), which is a big improvement from all previous models.
par(mfrow = c(2, 2), pch = 16, cex = .5)
plot(mlr_transform_bs)
not plotting observations with leverage one:
24
The transformed model still seems to not quite satisfy the regression assumptions. There are, however, improvements in the Normal Probability Plot and Residuals vs. Fitted Values plot, indicating that this model more accurately follows the assumptions of constant variance and normality than the previous models discussed:
hist(mlr_transform_bs$residuals, col = "lightgray")
Histogram plot above can be said appproximately normal, further indicating that the normality assumption may hold for the transformed model. However, the Anderson-Darling (AD) Statistic test can be run to further determine whether or not the normality condition is met.
library(nortest)
ad.test(mlr_transform_bs$residuals)
Anderson-Darling normality test
data: mlr_transform_bs$residuals
A = 1.4942, p-value = 0.0007409
Applyting AD test on our box-cox transfomed and stepwise regression applied model reveal that normality assumption is not satisfied because we can reject the null hypothesis of AD test at \(\alpha = 0.05\).
library(car)
plot(mlr_transform_bs, pch = 18, col ="red", which = c(4))
cook <- cooks.distance(mlr_transform_bs)
abline(h = 4/nrow(orig_data), col="blue") # add cutoff line
According to the plot of the Cook’s distances, there might be several outlying values.
k <- length(coef(mlr_transform_bs)) - 1
f_50 <- qf(0.50, df1 = k+1, df2 = mlr_transform_bs$df)
f_80 <- qf(0.20, df1 = k+1, df2 = mlr_transform_bs$df)
any(cook > f_50, na.rm = T)
[1] FALSE
There seems to be no influential observations according to Cook’s distance test described in the textbook. None of the observations have cook’s distance values greater than \(F_{[0.5]}\), but every observation has a Cook’s distance value less than \(F_{[0.8]}\). Some sources suggest different cut-off values to use for spotting highly influential points.
omit_cook <- which(cook > 4/nrow(orig_data))
length(omit_cook)
[1] 39
Using the cut-off value \(D_i>4/n\), where \(n\) is the number of observations (shown as a horisontal line on the plot above), we detected 39 outlying values.
stud_residual <- rstandard(mlr_transform_bs) # studentized residuals
omit_stud <- which(stud_residual < -2 | stud_residual > 2 | is.na(stud_residual)) #studentized residuals greater than absolute value of 2 and na values
length(omit_stud) # 28 potential outliers wrt y values
[1] 29
After studentizing residuals for transformed model, we selected 29 potential outliers with respect to \(y\) values.
lev <- hatvalues(mlr_transform_bs)
cutoff_lev <- 2 * (k +1) / n
plot(lev, pch = 16, cex = .5)
abline(h = cutoff_lev, col="red") # add cutoff line
omit_lev <- which(lev > cutoff_lev)
length(omit_lev)
[1] 25
There are 25 observations that are associated with large leverage values.
Studying the nature of outliers is beyond the scope of this analysis, so we decided to see how the model will improve if we exclude the outliers with respect to \(y\) them from the model. We didn’t exclude outliers wrt to \(x\) because it showed to decrease the utility of the model.
orig_data_stud_omit <- orig_data[-c(omit_stud, omit_cook),]
mlr_transform_orig_bs_omit <- lm(INCOME_TRANSFORMED ~ AGE + NUM_HOURS_WEEK +
TRAVEL_TIME + INDUSTRY + EDUC + DIVISION + SCHL_TYPE + HEALTH_INS +
WEEKS_WORKED, data = orig_data_stud_omit)
summary(mlr_transform_orig_bs_omit)
Call:
lm(formula = INCOME_TRANSFORMED ~ AGE + NUM_HOURS_WEEK + TRAVEL_TIME +
INDUSTRY + EDUC + DIVISION + SCHL_TYPE + HEALTH_INS + WEEKS_WORKED,
data = orig_data_stud_omit)
Residuals:
Min 1Q Median 3Q Max
-0.0221479 -0.0054799 0.0003281 0.0052736 0.0193001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.092e+00 6.123e-03 178.317 < 2e-16 ***
AGE 2.222e-04 2.905e-05 7.650 1.36e-13 ***
NUM_HOURS_WEEK 4.196e-04 3.362e-05 12.483 < 2e-16 ***
TRAVEL_TIME 7.416e-05 1.789e-05 4.145 4.10e-05 ***
INDUSTRYArts, Entertainment, and Recreation, and Accommodation and F -9.026e-03 1.677e-03 -5.383 1.21e-07 ***
INDUSTRYEducational Services, and Health Care and Social Assistance -5.963e-03 1.302e-03 -4.581 6.08e-06 ***
INDUSTRYFinance and Insurance, and Real Estate and Rental and Leasin -1.529e-03 1.892e-03 -0.808 0.419343
INDUSTRYInformation 9.638e-04 2.593e-03 0.372 0.710311
INDUSTRYOther Services, Except Public Administration -4.914e-03 1.822e-03 -2.697 0.007284 **
INDUSTRYProfessional, Scientific, and Management, and Administrative -1.718e-03 1.519e-03 -1.131 0.258726
INDUSTRYPublic Administration -4.715e-04 1.898e-03 -0.248 0.803929
INDUSTRYRetail Trade -4.926e-03 1.453e-03 -3.391 0.000761 ***
INDUSTRYTransportation and Warehousing, and Utilities -3.328e-03 2.045e-03 -1.627 0.104403
INDUSTRYWholesale Trade 3.210e-03 2.132e-03 1.506 0.132889
EDUC1-4 years of college 5.475e-03 8.474e-04 6.461 2.84e-10 ***
EDUC5+ years of college 1.526e-02 1.405e-03 10.865 < 2e-16 ***
EDUCno school or N/A -6.480e-03 8.056e-03 -0.804 0.421649
DIVISIONNORTHEAST 6.014e-03 1.247e-03 4.823 1.97e-06 ***
DIVISIONSOUTH 9.501e-04 1.008e-03 0.942 0.346620
DIVISIONWEST 2.074e-03 1.146e-03 1.810 0.071052 .
SCHL_TYPEPrivate School -4.361e-03 2.636e-03 -1.655 0.098719 .
SCHL_TYPEPublic School -5.455e-03 1.604e-03 -3.402 0.000733 ***
HEALTH_INSWith health insurance coverage 5.516e-03 1.446e-03 3.814 0.000157 ***
WEEKS_WORKED14-26 weeks 2.310e-02 6.004e-03 3.847 0.000138 ***
WEEKS_WORKED27-39 weeks 3.331e-02 6.074e-03 5.484 7.15e-08 ***
WEEKS_WORKED40-47 weeks 2.564e-02 5.934e-03 4.320 1.94e-05 ***
WEEKS_WORKED48-49 weeks 3.586e-02 6.101e-03 5.878 8.40e-09 ***
WEEKS_WORKED50-52 weeks 3.389e-02 5.762e-03 5.883 8.19e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.007834 on 425 degrees of freedom
Multiple R-squared: 0.7139, Adjusted R-squared: 0.6957
F-statistic: 39.28 on 27 and 425 DF, p-value: < 2.2e-16
nrow(orig_data) - nrow(orig_data_stud_omit) # number of excluded observations
[1] 47
After dropping 47 observations that were classified as potential outliers model has \(\bar{R}^2\) equal to \(0.6957\). This is notably increased from the \(0.6029\) value before the outliers were dropped.
library(nortest)
ad.test(mlr_transform_orig_bs_omit$residuals)
Anderson-Darling normality test
data: mlr_transform_orig_bs_omit$residuals
A = 0.67046, p-value = 0.07951
Applyting AD test on our final best model reveal that normality assumption is satisfied because we fail to reject the null hypothesis of AD test at \(\alpha = 0.05\).
par(mfrow = c(2, 2), pch = 16, cex = .5)
plot(mlr_transform_orig_bs_omit)
not plotting observations with leverage one:
103
4 plots above seem to indicate that the model without suspected outliers satisfies the regresion assumptions. The Normal Probability Plot seems linear, and the Residuals vs. Fitted Value Plot shows only minor signs of violation of the constant variance assumption (slight fanning out).
Our final model demonstrates satisfactory adherence to the regression assumptions and has a relatively high adjusted R-squared score of 0.6957. It means that we can draw somewhat reasoned conclusions from its interpretation.
Unsurprisingly, some of the most significant predictors are NUM_HOURS_WEEK
and WEEKS_WORKED
because they are usually highly correlated with a person’s income. More interesting relationships are demonstrated by the variables corresponding to Northeastern division of the United States, health insurance coverage, 5+ years of college, public school type, etc. To understand the relationships of these variables to the person’s income their estimated coefficients must be interpreted. Interpreting the coefficients of a model with transformed predicted variable turns out to be a difficult task. Since the value of \(\lambda = 0.01414141\) in our transformation was close to zero, we decided to follow the procedure described here assuming that the transformation was logarithmic. Thus, we can interpret estimated coefficients to be an approximate percentage change in the person’s income. For example, the estimated coefficient for a health insurance variable (HEALTH_INS
) is \(5.516e-03\). \(100(e^{5.516e-03} - 1) = 0.55\) means that the expected difference in yearly income is 0.55% for an individual with health insurance compared to an individual without one.
Some of the most interesting regression analysis results can be summed up as follows:
Since SEX
variable was not included in the final model, we conclude that gender does not have a significant relationship to the individual’s income.
Our preliminary model has adjusted \(\bar{R}^2 = 0.3092\) and F-test indicates that the overall model is significant at \(\alpha=0.05\). However, plotting residuals reveal that the preliminary model doesn’t satisfy the regression assumptions. The manual model doesn’t do any better than the preliminary model in terms of adjusted \(\bar{R}^2\). The vif
test reveals that the preliminary model doesn’t seem to exhibit any serious multicollinearity among the independent variables. The model was further narrowed down by runnning both backward elimination and stepwise regression on preliminary model. Both methods leave us with SEX
, AGE
, NUM_HOURS_WEEK
, EDUC
, and DIVISION
variables to run regression on income variable. This automatic model does slightly better than the preliminary model in terms of adjusted \(\bar{R}^2 = 0.3108\), but the plot of residuals shows that “automatic” model still doesn’t satisfy the regression assumptions.
In order to meet the regression assumptions, we had to transform income, the dependent variable, using the box-cox transformation method. If we box-cox transform our income variable, we get a preliminary transformed model, which has adjusted \(\bar{R}^2 = 0.6029\). Then, we sequentially apply stepwise regression or backward elimination method on this model. Both methods give us two identical models using same independent variables. Finally, we can test for influential observations wrt both y and x values. We decided to exclude outliers wrt y and ran a new regression using reduced data set. At last, we have achieved our best model so far. The preliminary transformed model has adjusted \(\bar{R}^2 = 0.6957\) after stepwise regression, and plots of residuals approximately satisfies the regression assumption even though we have few observations that are possibly outliers wrt to x values.
Lastly, we give an interpretation of the final model. We faced a problem of coefficient interpretation of a transformed model. Here, to interpret the effect of a predictor variable to the outcome variable, we have to:
As a result, exponentiated coefficients approximately correspond to changes in the ratio of the expected geometric means of the original outcome variable. Consequently, we study the relationships of the most influential variables and present a summary of the most important conclusions.
Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. Integrated Public UseMicrodata Series: Version 7.0 [dataset]. Minneapolis, MN: University of Minnesota, 2017.https://doi.org/10.18128/D010.V7.0