public class KolmogorovSmirnovTest extends Object
The KS test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For onesample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x F_n(x)F(x)\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
Twosample tests are also supported, evaluating the null hypothesis that the two samples
x
and y
come from the same underlying distribution. In this case, the test
statistic is \(D_{n,m}=\sup_t  F_n(t)F_m(t)\) where \(n\) is the length of x
, \(m\) is
the length of y
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of
the values in x
and \(F_m\) is the empirical distribution of the y
values. The
default 2sample test method, kolmogorovSmirnovTest(double[], double[])
works as
follows:
approximateP(double, int, int)
for details on
the approximation.
For small samples (former case), if the data contains ties, random jitter is added
to the sample data to break ties before applying the algorithm above. Alternatively,
the bootstrap(double[],double[],int,boolean,UniformRandomProvider)
method, modeled after ks.boot
in the R Matching package [3], can be used if ties are known to be present in the data.
In the twosample case, \(D_{n,m}\) has a discrete distribution. This makes the pvalue
associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} \ge d \)
by the mass of the observed value \(d\). To distinguish these, the twosample tests use a boolean
strict
parameter. This parameter is ignored for large samples.
The methods used by the 2sample default implementation are also exposed directly:
exactP(double, int, int, boolean)
computes exact 2sample pvaluesapproximateP(double, int, int)
uses the asymptotic distribution The boolean
arguments in the first two methods allow the probability used to estimate the pvalue to be
expressed using strict or nonstrict inequality. See
kolmogorovSmirnovTest(double[], double[], boolean)
.References:
Constructor and Description 

KolmogorovSmirnovTest() 
Modifier and Type  Method and Description 

double 
approximateP(double d,
int n,
int m)
Uses the KolmogorovSmirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\)
is the 2sample KolmogorovSmirnov statistic.

double 
bootstrap(double[] x,
double[] y,
int iterations,
boolean strict,
org.apache.commons.rng.UniformRandomProvider rng)
Estimates the pvalue of a twosample
KolmogorovSmirnov test
evaluating the null hypothesis that
x and y are samples
drawn from the same probability distribution. 
double 
cdf(double d,
int n)
Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme
values given in [2] (see above).

double 
cdf(double d,
int n,
boolean exact)
Calculates
P(D_n < d) using method described in [1] with quick decisions for extreme
values given in [2] (see above). 
double 
cdfExact(double d,
int n)
Calculates
P(D_n < d) . 
double 
exactP(double d,
int n,
int m,
boolean strict)
Computes \(P(D_{n,m} > d)\) if
strict is true ; otherwise \(P(D_{n,m} \ge
d)\), where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic. 
double 
kolmogorovSmirnovStatistic(org.apache.commons.statistics.distribution.ContinuousDistribution distribution,
double[] data)
Computes the onesample KolmogorovSmirnov test statistic, \(D_n=\sup_x F_n(x)F(x)\) where
\(F\) is the distribution (cdf) function associated with
distribution , \(n\) is the
length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at
each of the values in data . 
double 
kolmogorovSmirnovStatistic(double[] x,
double[] y)
Computes the twosample KolmogorovSmirnov test statistic, \(D_{n,m}=\sup_x F_n(x)F_m(x)\)
where \(n\) is the length of
x , \(m\) is the length of y , \(F_n\) is the
empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\)
is the empirical distribution of the y values. 
double 
kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution,
double[] data)
Computes the pvalue, or observed significance level, of a onesample KolmogorovSmirnov test
evaluating the null hypothesis that
data conforms to distribution . 
double 
kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution,
double[] data,
boolean exact)
Computes the pvalue, or observed significance level, of a onesample KolmogorovSmirnov test
evaluating the null hypothesis that
data conforms to distribution . 
boolean 
kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution,
double[] data,
double alpha)
Performs a KolmogorovSmirnov
test evaluating the null hypothesis that
data conforms to distribution . 
double 
kolmogorovSmirnovTest(double[] x,
double[] y)
Computes the pvalue, or observed significance level, of a twosample KolmogorovSmirnov test
evaluating the null hypothesis that
x and y are samples drawn from the same
probability distribution. 
double 
kolmogorovSmirnovTest(double[] x,
double[] y,
boolean strict)
Computes the pvalue, or observed significance level, of a twosample KolmogorovSmirnov test
evaluating the null hypothesis that
x and y are samples drawn from the same
probability distribution. 
double 
ksSum(double t,
double tolerance,
int maxIterations)
Computes \( 1 + 2 \sum_{i=1}^\infty (1)^i e^{2 i^2 t^2} \) stopping when successive partial
sums are within
tolerance of one another, or when maxIterations partial sums
have been computed. 
double 
monteCarloP(double d,
int n,
int m,
boolean strict,
int iterations,
org.apache.commons.rng.UniformRandomProvider rng)
Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the
2sample KolmogorovSmirnov statistic.

double 
pelzGood(double d,
int n)
Computes the PelzGood approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.

public KolmogorovSmirnovTest()
public double kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, boolean exact)
data
conforms to distribution
. If
exact
is true, the distribution used to compute the pvalue is computed using
extended precision. See cdfExact(double, int)
.distribution
 reference distributiondata
 sample being being evaluatedexact
 whether or not to force exact computation of the pvaluedata
is a sample from
distribution
InsufficientDataException
 if data
does not have length at least 2NullArgumentException
 if data
is nullpublic double kolmogorovSmirnovStatistic(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
distribution
, \(n\) is the
length of data
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at
each of the values in data
.distribution
 reference distributiondata
 sample being evaluatedInsufficientDataException
 if data
does not have length at least 2NullArgumentException
 if data
is nullpublic double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict)
x
and y
are samples drawn from the same
probability distribution. Specifically, what is returned is an estimate of the probability
that the kolmogorovSmirnovStatistic(double[], double[])
associated with a randomly
selected partition of the combined sample into subsamples of sizes x.length
and
y.length
will strictly exceed (if strict
is true
) or be at least as
large as (if strict
is false
) as kolmogorovSmirnovStatistic(x, y)
.x
 first sample dataset.y
 second sample dataset.strict
 whether or not the probability to compute is expressed as
a strict inequality (ignored for large samples).x
and
y
represent samples from the same distribution.InsufficientDataException
 if either x
or y
does
not have length at least 2.NullArgumentException
 if either x
or y
is null.NotANumberException
 if the input arrays contain NaN values.bootstrap(double[],double[],int,boolean,UniformRandomProvider)
public double kolmogorovSmirnovTest(double[] x, double[] y)
x
and y
are samples drawn from the same
probability distribution. Assumes the strict form of the inequality used to compute the
pvalue. See kolmogorovSmirnovTest(ContinuousDistribution, double[], boolean)
.x
 first sample datasety
 second sample datasetx
and y
represent
samples from the same distributionInsufficientDataException
 if either x
or y
does not have length at
least 2NullArgumentException
 if either x
or y
is nullpublic double kolmogorovSmirnovStatistic(double[] x, double[] y)
x
, \(m\) is the length of y
, \(F_n\) is the
empirical distribution that puts mass \(1/n\) at each of the values in x
and \(F_m\)
is the empirical distribution of the y
values.x
 first sampley
 second samplex
and
y
represent samples from the same underlying distributionInsufficientDataException
 if either x
or y
does not have length at
least 2NullArgumentException
 if either x
or y
is nullpublic double kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data)
data
conforms to distribution
.distribution
 reference distributiondata
 sample being being evaluateddata
is a sample from
distribution
InsufficientDataException
 if data
does not have length at least 2NullArgumentException
 if data
is nullpublic boolean kolmogorovSmirnovTest(org.apache.commons.statistics.distribution.ContinuousDistribution distribution, double[] data, double alpha)
data
conforms to distribution
.distribution
 reference distributiondata
 sample being being evaluatedalpha
 significance level of the testdata
is a sample from distribution
can be rejected with confidence 1  alpha
InsufficientDataException
 if data
does not have length at least 2NullArgumentException
 if data
is nullpublic double bootstrap(double[] x, double[] y, int iterations, boolean strict, org.apache.commons.rng.UniformRandomProvider rng)
x
and y
are samples
drawn from the same probability distribution.
This method estimates the pvalue by repeatedly sampling sets of size
x.length
and y.length
from the empirical distribution
of the combined sample.
When strict
is true, this is equivalent to the algorithm implemented
in the R function ks.boot
, described in Jasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 152.
x
 First sample.y
 Second sample.iterations
 Number of bootstrap resampling iterations.strict
 Whether or not the null hypothesis is expressed as a strict inequality.rng
 RNG for creating the sampling sets.public double cdf(double d, int n)
cdfExact(double, int)
because calculations are based on
double
rather than BigFraction
.d
 statisticn
 sample sizeMathArithmeticException
 if algorithm fails to convert h
to a
BigFraction
in expressing d
as
\((k  h) / m\) for integer k, m
and \(0 \le h < 1\)public double cdfExact(double d, int n)
P(D_n < d)
. The result is exact in the sense that BigFraction/BigReal is
used everywhere at the expense of very slow execution time. Almost never choose this in real
applications unless you are very sure; this is almost solely for verification purposes.
Normally, you would choose cdf(double, int)
. See the class
javadoc for definitions and algorithm description.d
 statisticn
 sample sizeMathArithmeticException
 if the algorithm fails to convert h
to a
BigFraction
in expressing d
as
\((k  h) / m\) for integer k, m
and \(0 \le h < 1\)public double cdf(double d, int n, boolean exact)
P(D_n < d)
using method described in [1] with quick decisions for extreme
values given in [2] (see above).d
 statisticn
 sample sizeexact
 whether the probability should be calculated exact using
BigFraction
everywhere at the expense of
very slow execution time, or if double
should be used convenient places to
gain speed. Almost never choose true
in real applications unless you are very
sure; true
is almost solely for verification purposes.MathArithmeticException
 if algorithm fails to convert h
to a
BigFraction
in expressing d
as
\((k  h) / m\) for integer k, m
and \(0 \le h < 1\).public double pelzGood(double d, int n)
d
 value of dstatistic (x in [2])n
 sample sizepublic double ksSum(double t, double tolerance, int maxIterations)
tolerance
of one another, or when maxIterations
partial sums
have been computed. If the sum does not converge before maxIterations
iterations a
TooManyIterationsException
is thrown.t
 argumenttolerance
 Cauchy criterion for partial sumsmaxIterations
 maximum number of partial sums to computeTooManyIterationsException
 if the series does not convergepublic double exactP(double d, int n, int m, boolean strict)
strict
is true
; otherwise \(P(D_{n,m} \ge
d)\), where \(D_{n,m}\) is the 2sample KolmogorovSmirnov statistic. See
kolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).
The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] (class javadoc).
d
 Dstatistic valuen
 first sample sizem
 second sample sizestrict
 whether or not the probability to compute is expressed as a strict inequalityd
public double approximateP(double d, int n, int m)
kolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).
Specifically, what is returned is \(1  k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2
\sum_{i=1}^\infty (1)^i e^{2 i^2 t^2}\). See ksSum(double, double, int)
for
details on how convergence of the sum is determined.
d
 Dstatistic valuen
 first sample sizem
 second sample sized
public double monteCarloP(double d, int n, int m, boolean strict, int iterations, org.apache.commons.rng.UniformRandomProvider rng)
kolmogorovSmirnovStatistic(double[], double[])
for the definition of \(D_{n,m}\).
The simulation generates iterations
random partitions of m + n
into an
n
set and an m
set, computing \(D_{n,m}\) for each partition and returning
the proportion of values that are greater than d
, or greater than or equal to
d
if strict
is false
.
d
 Dstatistic value.n
 First sample size.m
 Second sample size.iterations
 Number of random partitions to generate.strict
 whether or not the probability to compute is expressed as a strict inequalityrng
 RNG used for generating the partitions.d
.Copyright © 2003–2021 The Apache Software Foundation. All rights reserved.