In this article, we’ll cover the main functions to conduct a statistical analysis. I’m working on a sample data set that describes the salaries of employees depending on several features.

I. Data exploration

The file I’m working on comes with a small description DES. The description of the data is the following.

wage      educ      exper     tenure    nonwhite  female    married   numdep   
smsa      northcen  south     west      construc  ndurman   trcommpu  trade    
services  profserv  profocc   clerocc   servocc   lwage     expersq   tenursq  

Obs:   526

1. wage                     average hourly earnings
2. educ                     years of education
3. exper                    years potential experience
4. tenure                   years with current employer
5. nonwhite                 =1 if nonwhite
6. female                   =1 if female
7. married                  =1 if married
8. numdep                   number of dependents
9. smsa                     =1 if live in SMSA
10. northcen                 =1 if live in north central U.S
11. south                    =1 if live in southern region
12. west                     =1 if live in western region
13. construc                 =1 if work in construc. indus.
14. ndurman                  =1 if in nondur. manuf. indus.
15. trcommpu                 =1 if in trans, commun, pub ut
16. trade                    =1 if in wholesale or retail
17. services                 =1 if in services indus.
18. profserv                 =1 if in prof. serv. indus.
19. profocc                  =1 if in profess. occupation
20. clerocc                  =1 if in clerical occupation
21. servocc                  =1 if in service occupation
22. lwage                    log(wage)
23. expersq                  exper^2
24. tenursq                  tenure^2

The first step is to load the data set in our path :

wage1=load('WAGE1.raw')
[n,k]=size(wage1)

Then, we’ll plot a histogram of hourly salaries :

hist(wage1(:,1))

image

Now, let’s plot the histogram for the number of years of education.

image

We can take a look at the mean of each feature :

mean(wage1(:,:))'
ans =

5.8961
12.5627
17.0171
5.1046
0.1027
...

We can also take a look at other descriptive statistics :

std(wage1(:,:))'
min(wage1)'
max(wage1)'
cov(wage1)'
corrcoef(wage1(:,:))'

II. Linear regression

Compute coefficients

We now want to assess the effect of columns 2, 3 and 4 (education, experience, years with current employer) on the salary.

\[y_i = \beta_0 + \beta_1 * {x_1}_i + \beta_2 * {x_2}_i + \beta_3 * {x_3}_i + u_i\]
y=wage1(:,1)
X=[ones(n,1),wage1(:,[2,3,4])]

We can compute the Beta coefficients and the residuals :

beta=inv(X'*X)*X'*y
u=y-X*beta

The coefficients are the following :

beta =

-2.8727
0.5990
0.0223
0.1693

The first coefficient is constant. Then, the model indicates that an additional year of education should increase the hourly salary by 0.60$.

Residuals

Effect of each feature individually

We can also plot a histogram of the residuals. The distribution of the residual appears to be close to a normal one.

image

We can decide to remove outliers above 5 for example :

indices = find(u>2.5);
u(indices) = [];
hist(u)

Test hypothesis

Are the coefficients significant?

To assess the significance of each estimator, we need to estimate their standard deviation.

sig2=u'*u/(n-4)
std=sqrt(diag(sig2*inv(X'*X)))

Then, we can easily compute the test statistic :

t = beta./std
t =

-5.6300
16.6857
2.6471
11.1725

What should we compare the test statistics? Well, to the critical value! For example, since we estimate 4 parameters, the 95% critical value is :

C1 = tinv(0.95, n-4)
C1 =

1.6478

The T-stats are all greater than the critical value. For this reason, we reject the hypothesis \(H_0\) that the variables have no impact on the output.

Effect of a single feature

If we are interested in the effect of a single feature, we can specify a value for its effect for example.

t_educ = (beta(2)-0.1)/std(2)

Effect of each feature on log(wage)

Testing the effect of the features on the log(wage) is a way to assess their importance in terms of the percentage change.

y=log(wage1(:,1))
beta=inv(X'*X)*X'*y
std=sqrt(diag(sig2*inv(X'*X)))
t = beta./std

Test is 2 features have the same effect

Now, let’s test the following hypothesis: the effect of education is the same as the effect of professional experience. This can be rewritten as : \(H_0 : \beta_2 = \beta_3\) . However, estimating the variance of \(\beta_2 - \beta_3\) to build the test hypothesis would be challenging. We can apply a simple trick by building a new feature as the sum of the 2 features to test.

First of all, append the new feature to the model :

capitaltot = X(:,2) + X(:,3)

This introduces some multicolinearity which we should control by removing \(X_3\) :

X=[X(:,[1,2,4]),capitaltot];
y=log(wage1(:,1))

We can run the test : \(H_0 : \beta_4 = 0\)

beta=inv(X'*X)*X'*y
std=sqrt(diag(sig2*inv(X'*X)))
t = beta(4)/std(4)

This heads :

t =

0.4883

The T-test is smaller than the critical value. For this reason, we cannot reject the hypothesis that the effect in the percentage of education is the same as the one of professional experience.

Fisher Test for joint hypothesis

Let us now consider the hypothesis : \(H_0 : \beta_2 = 0, \beta_3 = 0\) . In such case, we should not apply 2 independent tests, since we are interested in the overall hypothesis.

We need to define a non-constrained model, a constrained model and test the difference between those two models.

% non-constrained model
X0=[ones(n,1),wage1(:,[2,3,4])]
y0=log(wage1(:,1))
beta0=inv(X0'*X0)*X0'*y0
u0=y0-X0*beta0
SSR0=u0'*u0
% model contraint
X1=[ones(n,1),wage1(:,[4])]
y1=log(wage1(:,1))
beta1=inv(X1'*X1)*X1'*y1
u1=y1-X1*beta1
SSR1=u1'*u1

The Fisher F-Stat is defined as :

\[\hat{F}_j = \frac { {SSR}_1 - {SSR}_0 } { {SSR}_0} \frac {n - K} {2}\]

Where K is the number of variables included in the non-constrained model (4 here).

K = 4
F = (SSR1-SSR0)/(SSR0) * (n-K)/2

This heads :

F =

80.1478

Interaction effect

Let’s define new variables :

X=wage1
educ=X(:,2)
exper=X(:,3)
tenure=X(:,4)
female=X(:,6)
married=X(:,7)
marrmale = (1-female).*married
marrfem = female.*married
singfem=female.*(1-married)

We create interaction variables :

  • effect of a married male
  • effect of a married female
  • effect of a single female

The base case is a single male. We can define test statistics on those variables as we did above.

**Conclusion **: This was a quick introduction to Matlab. I hope you found it useful. Don’t hesitate to drop a comment if you have a question.