In this article, we’ll cover the main functions to conduct a statistical analysis. I’m working on a sample data set that describes the salaries of employees depending on several features.
I. Data exploration
The file I’m working on comes with a small description DES
. The description of the data is the following.
wage educ exper tenure nonwhite female married numdep
smsa northcen south west construc ndurman trcommpu trade
services profserv profocc clerocc servocc lwage expersq tenursq
Obs: 526
1. wage average hourly earnings
2. educ years of education
3. exper years potential experience
4. tenure years with current employer
5. nonwhite =1 if nonwhite
6. female =1 if female
7. married =1 if married
8. numdep number of dependents
9. smsa =1 if live in SMSA
10. northcen =1 if live in north central U.S
11. south =1 if live in southern region
12. west =1 if live in western region
13. construc =1 if work in construc. indus.
14. ndurman =1 if in nondur. manuf. indus.
15. trcommpu =1 if in trans, commun, pub ut
16. trade =1 if in wholesale or retail
17. services =1 if in services indus.
18. profserv =1 if in prof. serv. indus.
19. profocc =1 if in profess. occupation
20. clerocc =1 if in clerical occupation
21. servocc =1 if in service occupation
22. lwage log(wage)
23. expersq exper^2
24. tenursq tenure^2
The first step is to load the data set in our path :
wage1=load('WAGE1.raw')
[n,k]=size(wage1)
Then, we’ll plot a histogram of hourly salaries :
hist(wage1(:,1))
Now, let’s plot the histogram for the number of years of education.
We can take a look at the mean of each feature :
mean(wage1(:,:))'
ans =
5.8961
12.5627
17.0171
5.1046
0.1027
...
We can also take a look at other descriptive statistics :
std(wage1(:,:))'
min(wage1)'
max(wage1)'
cov(wage1)'
corrcoef(wage1(:,:))'
II. Linear regression
Compute coefficients
We now want to assess the effect of columns 2, 3 and 4 (education, experience, years with current employer) on the salary.
\[y_i = \beta_0 + \beta_1 * {x_1}_i + \beta_2 * {x_2}_i + \beta_3 * {x_3}_i + u_i\]y=wage1(:,1)
X=[ones(n,1),wage1(:,[2,3,4])]
We can compute the Beta coefficients and the residuals :
beta=inv(X'*X)*X'*y
u=y-X*beta
The coefficients are the following :
beta =
-2.8727
0.5990
0.0223
0.1693
The first coefficient is constant. Then, the model indicates that an additional year of education should increase the hourly salary by 0.60$.
Residuals
Effect of each feature individually
We can also plot a histogram of the residuals. The distribution of the residual appears to be close to a normal one.
We can decide to remove outliers above 5 for example :
indices = find(u>2.5);
u(indices) = [];
hist(u)
Test hypothesis
Are the coefficients significant?
To assess the significance of each estimator, we need to estimate their standard deviation.
sig2=u'*u/(n-4)
std=sqrt(diag(sig2*inv(X'*X)))
Then, we can easily compute the test statistic :
t = beta./std
t =
-5.6300
16.6857
2.6471
11.1725
What should we compare the test statistics? Well, to the critical value! For example, since we estimate 4 parameters, the 95% critical value is :
C1 = tinv(0.95, n-4)
C1 =
1.6478
The T-stats are all greater than the critical value. For this reason, we reject the hypothesis \(H_0\) that the variables have no impact on the output.
Effect of a single feature
If we are interested in the effect of a single feature, we can specify a value for its effect for example.
t_educ = (beta(2)-0.1)/std(2)
Effect of each feature on log(wage)
Testing the effect of the features on the log(wage)
is a way to assess their importance in terms of the percentage change.
y=log(wage1(:,1))
beta=inv(X'*X)*X'*y
std=sqrt(diag(sig2*inv(X'*X)))
t = beta./std
Test is 2 features have the same effect
Now, let’s test the following hypothesis: the effect of education is the same as the effect of professional experience. This can be rewritten as : \(H_0 : \beta_2 = \beta_3\) . However, estimating the variance of \(\beta_2 - \beta_3\) to build the test hypothesis would be challenging. We can apply a simple trick by building a new feature as the sum of the 2 features to test.
First of all, append the new feature to the model :
capitaltot = X(:,2) + X(:,3)
This introduces some multicolinearity which we should control by removing \(X_3\) :
X=[X(:,[1,2,4]),capitaltot];
y=log(wage1(:,1))
We can run the test : \(H_0 : \beta_4 = 0\)
beta=inv(X'*X)*X'*y
std=sqrt(diag(sig2*inv(X'*X)))
t = beta(4)/std(4)
This heads :
t =
0.4883
The T-test is smaller than the critical value. For this reason, we cannot reject the hypothesis that the effect in the percentage of education is the same as the one of professional experience.
Fisher Test for joint hypothesis
Let us now consider the hypothesis : \(H_0 : \beta_2 = 0, \beta_3 = 0\) . In such case, we should not apply 2 independent tests, since we are interested in the overall hypothesis.
We need to define a non-constrained model, a constrained model and test the difference between those two models.
% non-constrained model
X0=[ones(n,1),wage1(:,[2,3,4])]
y0=log(wage1(:,1))
beta0=inv(X0'*X0)*X0'*y0
u0=y0-X0*beta0
SSR0=u0'*u0
% model contraint
X1=[ones(n,1),wage1(:,[4])]
y1=log(wage1(:,1))
beta1=inv(X1'*X1)*X1'*y1
u1=y1-X1*beta1
SSR1=u1'*u1
The Fisher F-Stat is defined as :
\[\hat{F}_j = \frac { {SSR}_1 - {SSR}_0 } { {SSR}_0} \frac {n - K} {2}\]Where K is the number of variables included in the non-constrained model (4 here).
K = 4
F = (SSR1-SSR0)/(SSR0) * (n-K)/2
This heads :
F =
80.1478
Interaction effect
Let’s define new variables :
X=wage1
educ=X(:,2)
exper=X(:,3)
tenure=X(:,4)
female=X(:,6)
married=X(:,7)
marrmale = (1-female).*married
marrfem = female.*married
singfem=female.*(1-married)
We create interaction variables :
- effect of a married male
- effect of a married female
- effect of a single female
The base case is a single male. We can define test statistics on those variables as we did above.
**Conclusion **: This was a quick introduction to Matlab. I hope you found it useful. Don’t hesitate to drop a comment if you have a question.