Title: | Tukey's Vacuum Cleaner |
---|---|
Description: | An implementation of three procedures developed by John Tukey: FUNOP (FUll NOrmal Plot), FUNOR-FUNOM (FUll NOrmal Rejection-FUll NOrmal Modification), and vacuum cleaner. Combined, they provide a way to identify, treat, and analyze outliers in two-way (i.e., contingency) tables, as described in his landmark paper "The Future of Data Analysis", Tukey, John W. (1962) <doi:10.1214/aoms/1177704711>. |
Authors: | Ron Sielinski |
Maintainer: | Ron Sielinski <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-03-05 03:16:50 UTC |
Source: | https://github.com/sielinski/vacuum |
Returns the typical value from a unit-normal distribution for the ith ordered observation in an n-sized sample.
This is a helper function for FUNOP, which uses the output of this function as the denominator for its slope calculation.
a_qnorm(i, n)
a_qnorm(i, n)
i |
Non-zero index of an array |
n |
Non-zero length of the array |
Quantile of i
from a unit-normal distribution
Tukey, John W. "The Future of Data Analysis." The Annals of Mathematical Statistics, 33(1), 1962, pp 1-67. JSTOR, https://www.jstor.org/stable/2237638.
a_qnorm(i = 25, n = 42) a_qnorm(21.5, 42)
a_qnorm(i = 25, n = 42) a_qnorm(21.5, 42)
FUNOP stands for FUll NOrmal Plot.
The procedure identifies outliers by calculating their slope (z
),
relative to the vector's median.
The procedure ignores values in the middle third of the ordered vector. The remaining values are all candidates for consideration. The slopes of candidate values are calculated, and the median of their slopes is used as the primary basis for identifying outliers.
Any value whose slope is B
times larger than the median slope is
identified as an outlier. Additionally, any value whose magnitude
is larger than that of the slope-based outliers is also identified as
an outlier.
However, the procedure will not identify as outliers any values
within A
standard deviations of the vector's median (i.e., not
the median of candidate slopes).
funop(x, A = 0, B = 1.5)
funop(x, A = 0, B = 1.5)
x |
Numeric vector to inspect for outliers (does not need to be ordered) |
A |
Number of standard deviations beyond the median of |
B |
Multiples beyond the median slope of candidate values |
A data frame containing one row for every member of x
(in the same
order as x
) and the
following columns:
y
: Original values of vector x
i
: Ordinal position of value y
in the sorted vector x
middle
: Boolean indicating whether ordinal position i
is in the middle third of the vector
a
: Result of a_qnorm(i, length(x))
z
: Slope of y
relative to median(y)
special
: Boolean indicating whether y
is an outlier
Additionally, the data frame will have the following attributes, which FUNOP calculates as part of its procedure:
y_split
: Median of the vector
y_trimmed
: Mean of the top and bottom thirds of the ordered vector
z_split
: Median slope of candidate values
Tukey, John W. "The Future of Data Analysis." The Annals of Mathematical Statistics, 33(1), 1962, pp 1-67. JSTOR, https://www.jstor.org/stable/2237638.
funop(c(1, 2, 3, 11)) funop(table_1) attr(funop(table_1), 'z_split')
funop(c(1, 2, 3, 11)) funop(table_1) attr(funop(table_1), 'z_split')
FUNOR-FUNOM stands for FUll NOrmal Rejection-FUll NOrmal Modification.
The procedure treats a two-way (contingency) table for outliers by isolating residuals from the table's likely systemic effects, which are calculated from the table's grand, row, and column means.
The residuals are passed to separate rejection (FUNOR) and
modification (FUNOM) procedures, which both depend upon FUNOP
to identify outliers. As such, this procedure requires two sets of
A
and B
parameters.
The procedure treats outliers by reducing their residuals, resulting in values that are much closer to their expected values (i.e., combined grand, row, and column effects).
funor_funom(x, A_r = 10, B_r = 1.5, A_m = 0, B_m = 1.5)
funor_funom(x, A_r = 10, B_r = 1.5, A_m = 0, B_m = 1.5)
x |
Two-way table to treat for outliers |
A_r |
A for the FUNOR phase (see FUNOP for details) |
B_r |
B for the FUNOR phase slope |
A_m |
A for the FUNOM phase ( |
B_m |
B for the FUNOM phase |
A two-way table of the same size as x
, treated for outliers.
Tukey, John W. "The Future of Data Analysis." The Annals of Mathematical Statistics, 33(1), 1962, pp 1-67. JSTOR, https://www.jstor.org/stable/2237638.
funor_funom(table_2) which(funor_funom(table_2) != table_2)
funor_funom(table_2) which(funor_funom(table_2) != table_2)
Example data taken from Table 1 of John Tukey's "Future of Data Analysis."
table_1
table_1
A numeric vector containing 14 elements.
Tukey, John W. "The Future of Data Analysis." The Annals of Mathematical Statistics, 33(1), 1962, pp 1-67. JSTOR, https://www.jstor.org/stable/2237638.
table_1 funop(table_1)
table_1 funop(table_1)
Example data taken from Table 2 of John Tukey's "Future of Data Analysis."
table_2
table_2
A 36x15 numeric matrix.
Tukey, John W. "The Future of Data Analysis." The Annals of Mathematical Statistics, 33(1), 1962, pp 1-67. JSTOR, https://www.jstor.org/stable/2237638.
table_2 funor_funom(table_2)
table_2 funor_funom(table_2)
Example data taken from Table 8 of John Tukey's "Future of Data Analysis." Note that this dataset fixes a typo in the original document: Column 23, row 10 contains 0.100, corrected from the original -0.100.
table_8
table_8
A 36x15 numeric matrix.
Tukey, John W. "The Future of Data Analysis." The Annals of Mathematical Statistics, 33(1), 1962, pp 1-67. JSTOR, https://www.jstor.org/stable/2237638.
table_8 vacuum_cleaner(table_8)
table_8 vacuum_cleaner(table_8)
To remove systemic effects from values in a contingency table, vacuum cleaner uses regression to identify the table's main effect (dual regression), row effect (deviations of row regression from dual regression), and column effect (deviations of column regression from dual regression).
Regression is performed twice: First on the table's original values, then on the resulting residuals. The output is a table of residuals "vacuum cleaned" of likely systemic effects.
vacuum_cleaner(x)
vacuum_cleaner(x)
x |
Two-way table to analyze (must be 3x3 or greater). |
Residuals of x
Tukey, John W. "The Future of Data Analysis." The Annals of Mathematical Statistics, 33(1), 1962, pp 1-67. JSTOR, https://www.jstor.org/stable/2237638.
vacuum_cleaner(table_8)
vacuum_cleaner(table_8)