what is normality in the regression, some sample from stata

(Comments)

What is normality in regression?

Normality means that the residual in regression is distributed normally, meaning that there is a huge amount of data in the middle and a couple of tails on both the left and right.

Many researchers believe that multiple regression requires normality. This is not the case. Normality of residuals is only required for valid hypothesis testing; that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required to obtain unbiased estimates of the regression coefficients. OLS regression merely requires that the residuals (errors) be identically and independently distributed.

Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case, then we would not be able to use dummy-coded variables in our models.

After we run a regression analysis, we can use the predict command to create residuals and then use commands such as kdensity, qnorm and pnorm to check the normality of the residuals.

Let’s use the elemapi2 data file we saw in Chapter 1 for these analyses.  Let’s predict academic performance (api00) from the percent receiving free meals (meals), percent of English language learners (ell), and percent of teachers with emergency credentials (emer).

use https://stats.idre.ucla.edu/stat/stata/webbooks/reg/elemapi2
regress api00 meals ell emer

  Source |       SS       df       MS                  Number of obs =     400
---------+------------------------------               F(  3,   396) =  673.00
   Model |  6749782.75     3  2249927.58               Prob > F      =  0.0000
Residual |  1323889.25   396  3343.15467               R-squared     =  0.8360
---------+------------------------------               Adj R-squared =  0.8348
   Total |  8073672.00   399  20234.7669               Root MSE      =   57.82

------------------------------------------------------------------------------
   api00 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
   meals |  -3.159189   .1497371    -21.098   0.000      -3.453568   -2.864809
     ell |  -.9098732   .1846442     -4.928   0.000      -1.272878   -.5468678
    emer |  -1.573496    .293112     -5.368   0.000      -2.149746   -.9972456
   _cons |   886.7033    6.25976    141.651   0.000       874.3967    899.0098
------------------------------------------------------------------------------

We then use the predict command to generate residuals.

predict r, resid

Below we use the kdensity command to produce a kernel density plot with the normal option requesting that a normal density be overlaid on the plot. kdensity stands for kernel density estimate. It can be considered a histogram with narrow bins and a moving average.

kdensity r, normal
  
  Image statar35

The pnorm command graphs a standardized normal probability (P-P) plot while qnorm plots the quantiles of a variable against the quantiles of a normal distribution. pnorm is sensitive to non-normality in the middle range of data and qnorm is sensitive to non-normality near the tails. As you see below, the results from pnorm show no indications of non-normality, while the qnorm command shows a slight deviation from normal at the upper tail, as can be seen in the kdensity above.  Nevertheless, this seems to be a minor and trivial deviation from normality. We can accept that the residuals are close to a normal distribution.

pnorm r
  
Image statar36
  
qnorm r

Image statar37

There are also numerical tests for testing normality. One of the tests is written by Lawrence C. Hamilton, Dept. of Sociology, Univ. of New Hampshire, called iqr. You can get this program from Stata by typing search iqr (see How I can used the search command to search for programs and get additional help? for more information about using search).

iqr stands for inter-quartile range and assumes the symmetry of the distribution. Severe outliers consist of those points that are either 3 inter-quartile ranges below the first quartile or 3 inter-quartile ranges above the third quartile. Any severe outliers should be sufficient evidence to reject normality at a 5% significance level. Mild outliers are common in samples of any size. In our case, we don’t have any severe outliers, and the distribution seems pretty symmetric. The residuals have an approximately normal distribution.

iqr r

   mean=  7.4e-08         std.dev.=   57.6          (n= 400)
 median= -3.657    pseudo std.dev.=  56.69        (IQR=  76.47)
10 trim= -1.083
                                               low         high
                                               -------------------
                                inner fences   -154.7       151.2
                           # mild outliers     1           5
                           % mild outliers     0.25%       1.25%

                                outer fences   -269.4       265.9
                           # severe outliers   0           0
                           % severe outliers   0.00%       0.00%

Another test available is the swilk test which performs the Shapiro-Wilk W test for normality. The p-value is based on the assumption that the distribution is normal. In our example, it is very large (.51), indicating that we cannot reject that r is normally distributed.

swilk r

                   Shapiro-Wilk W test for normal data
 Variable |    Obs           W         V          z   Pr > z
 ---------+-------------------------------------------------
        r |    400     0.99641     0.989     -0.025  0.51006

Current rating: 5

Comments

Riddles

22nd Jul- 2020, by: Editor in Chief
524 Shares 4 Comments
Generic placeholder image
20 Oct- 2019, by: Editor in Chief
524 Shares 4 Comments
Generic placeholder image
20Aug- 2019, by: Editor in Chief
524 Shares 4 Comments
10Aug- 2019, by: Editor in Chief
424 Shares 4 Comments
Generic placeholder image
10Aug- 2015, by: Editor in Chief
424 Shares 4 Comments

More News  »

How to create output gap with Python and Anaconda

Recent news
2 months, 1 week ago

Dignity wrapped in Charity

Recent news
3 months, 3 weeks ago

A reflection of using kanban flow and being minimalist

Recent news

Today is the consecutive day I want to use and be consistent with the Kanban flow! It seems it's perfect to limit my parallel and easily distractedness. 

read more
4 months ago

Morning issue with car and my kind of music

Recent news
4 months ago

Podcast Bapak Dimas 2 - pindahan rumah

Recent news

Vlog kali ini adalah terkait pindahan rumah!

read more
4 months ago

Podcast Bapak Dimas - Bapaknya Jozio dan Kaziu - ep 1

Recent news

Seperti yang saya cerita kan sebelumnya, berikut adalah catatan pribadi VLOG kita! Bapak Dimas

read more
4 months ago

Happy new year 2024 and thank you 2023!

Recent news

As the new year starts, I want to revisit what has happened in 2023. 

read more
4 months, 1 week ago

Some notes about python and Zen of Python

Recent news

Explore Python syntax

Python is a flexible programming language used in a wide range of fields, including software development, machine learning, and data analysis. Python is one of the most popular programming languages for data professionals, so getting familiar with its fundamental syntax and semantics will be useful for your future career. In this reading, you will learn about Python’s syntax and semantics, as well as where to find resources to further your learning.

read more
5 months, 2 weeks ago

More News »

Generic placeholder image

Collaboratively administrate empowered markets via plug-and-play networks. Dynamically procrastinate B2C users after installed base benefits. Dramatically visualize customer directed convergence without