Generalized Extreme Studentized Deviate (GESD) test to detect outliers in R

Nov 1, 2024

The GESD statistical test can be used to answer the following question: How many outliers does the data set contain?

Here, I present an implementation of GESD test written in R which can be accessed using the following url: https://github.com/raunakms/GESD

Generalized Extreme Studentized Deviate (GESD) test

The generalized (extreme Studentized deviate) ESD test (Rosner 1983) is a statistical method used to detect one or more outliers in a univariate data set that follows an approximately normal distribution. The primary limitation of the Grubbs test and the Tietjen-Moore test is that the suspected number of outliers, $k$, must be specified exactly. If $k$ is not specified correctly, this can distort the conclusions of these tests. On the other hand, the generalized ESD test (Rosner 1983) only requires that an upper bound for the suspected number of outliers be specified.

Given the upper bound, $r$, the generalized ESD test essentially performs $r$ separate tests: a test for one outlier, a test for two outliers, and so on up to $r$ outliers.

The generalized ESD test is defined for the hypothesis:
- $H_{0}$: There are no outliers in the data set
- $H_{a}$: There are up to r outliers in the data set

Test Statistic: Compute $$R_{i} = \frac{max_{i}|x_{i} - \bar{x}|}{s} $$ with $x$ and $s$ denoting the sample mean and sample standard deviation, respectively. Remove the observation that maximizes $|x_{i} - \bar{x}|$ and then recompute the above statistic with $n-1$ observations. Repeat this process until $r$ observations have been removed. This results in the $r$ test statistics $R_{1}, R_{2}, …, R_{r}$.

Significance Level: alpha $(\alpha)$

Critical Region: Corresponding to the $r$ test statistics, compute the following $r$ critical values $$ \lambda_{i} = \frac{(n-i) * t_{n-i-1,p}}{ \sqrt{(n-i-1 + t^2_{n-i-1,p})*(n-i+1)} } $$

where $i = 1, 2, …, r$, and $t_{d,p}$ is the $p^{th}$ percentile of a $t$ distribution with $d$ degrees of freedom.

$$ p = 1 - \frac{\alpha}{2*(n-i+1)} $$

The number of outliers is determined by finding the largest $i$ such that $ R_{i} > \lambda_{i} $. Simulation studies by Rosner indicate that this critical value approximation is very accurate for $n \ge 25$ and reasonably accurate for $n \ge 15$.

Note that although the generalized ESD is essentially Grubbs test applied sequentially, there are a few important distinctions:

The generalized ESD test makes appropriate adjustments for the critical values based on the number of outliers being tested for that the sequential application of Grubbs test does not.
If there is significant masking, applying Grubbs test sequentially may stop too soon. The example below identifies three outliers at the 5% level when using the generalized ESD test.
However, trying to use Grubbs test sequentially would stop at the first iteration and declare no outliers.

Original Paper Citation

B. Rosner (1983). Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics 25(2), pp. 165-172. http://www.jstor.org/stable/1268549?seq=1

Decription

Computes outliers for the given data using GESD statistics.

Usage

gesd(obs, alpha, value.zscore, r)

Arguments

$obs$ : a vector of observation
$alpha$ : significance Level
$value.zscore$ : if the observation value are already z-normalized. Takes values “YES” or “NO”.
$r$ : upperbound to the number of observations to call as an outlier. If NA, computes GESD test statistic until values of half sample size have been removed from the sample.

Value

Total : The first column gives the total number of outliers
Rest of the columns are for each observation
Value of the outlier indicates the outlier rank in the acending order where 0 inidicates not an outlier, 1 inidicate the highest ranked outlier (i.e. most extreme observation).

R-code

Below is the R-code of the function that implements GESD test

gesd <- function(obs, alpha, value.zscore=NA, r=NA) {
	#### Define and Declare Variables ####
	n <- length(obs)
	if(is.na(r)){r <- floor(n/2)} # by default, set upper bound on number of outliers 'r' to 1/2 sample size
	R <- numeric(length=r) # test statistics for 'r' outliers
	lambda <- numeric(length=r) # critical values for 'r' outliers
	outlier_ind <- numeric(length=r) # removed outlier observation values
	outlier_val <- numeric(length=r) # removed outlier observation values
	m <- 0 # number of outliers
	obs_new <- obs # temporary observation values
	
	#### Find outliers ####
	for(i in 1:r){
	
		#### Compute test statistic ####
		if((value.zscore == "YES") | (value.zscore == "Y")){
			z <- abs(obs_new) # If Z-score is alrealy computed
		}else if((value.zscore == "NO") | (value.zscore == "N")){
			z <- abs(obs_new - mean(obs_new))/sd(obs_new) # Z-scores
		} else{
			print("ERROR! Inappropriate value for value.score=[YES|NO]")
		}
		
		max_ind <- which(z==max(z),arr.ind=T)[1] # in case of ties, return first one
		R[i] <- z[max_ind] # max Z-score
		outlier_val[i] <- obs_new[max_ind] # removed outlier observation values
		outlier_ind[i] <- which(obs_new[max_ind] == obs, arr.ind=T)[1] # index of removed outlier observation values
		obs_new <- obs_new[-max_ind] # remove observation that maximizes |x_i - x_mean|
		
		#### Compute critical values ####
		##n_obs <- length(obs_new) # subset sample size
		p <- 1 - alpha/(2*(n-i+1)) # probability
		t_pv <- qt(p,df=(n-i-1)) # Critical value from Student's t distribution
		lambda[i] <- ((n-i)*t_pv) / (sqrt((n-i-1+t_pv^2)*(n-i+1)))
		
		#### Find exact number of outliers: largest 'i' such that R_i > lambda_i ####
		# print(c(i, R[i], lambda[i]))
		# try ((R[i] > lambda[i]), finally <- print(c(i, R[i], lambda[i]))
		# )
		if(!is.na(R[i]) & !is.na(lambda[i])) { # qt can produce NaNs
			if (R[i] > lambda[i]) {
				m <- i
			}
			# else {
				# break
			# }
			
		}
	}

	vals <- data.frame(NumOutliers=1:r, TestStatistic=R, CriticalValue=lambda)
	# print(vals)
	
	#### Rank outlier observations ####
	outlier_rank <- numeric(length=n)
	if (m > 0) {
		for (i in 1:m) {
			# outlier_rank[outlier_ind[i]] <- i
			outlier_rank[which(obs==outlier_val[i])] <- i
		}
	}
	num_outliers <- sum(outlier_rank != 0) #14 and 25 missing
	res <- c(num_outliers, outlier_rank)
	names(res) <- c("Total", names(obs))
	
	return(res)	
}

statistics