The statistics package provides frameworks and implementations for basic Descriptive statistics, frequency distributions, bivariate regression, and t-, chi-square and ANOVA test statistics.
Descriptive statistics
Frequency distributions
Simple Regression
Multiple Regression
Rank transformations
Covariance and correlation
Statistical Tests
The stat package includes a framework and default implementations for the following Descriptive statistics:
With the exception of percentiles and the median, all of these statistics can be computed without maintaining the full list of input data values in memory. The stat package provides interfaces and implementations that do not require value storage as well as implementations that operate on arrays of stored values.
The top level interface is
org.apache.commons.math.stat.descriptive.UnivariateStatistic.
This interface, implemented by all statistics, consists of
evaluate()
methods that take double[] arrays as arguments
and return the value of the statistic. This interface is extended by
StorelessUnivariateStatistic, which adds increment(),
getResult()
and associated methods to support
"storageless" implementations that maintain counters, sums or other
state information as values are added using the increment()
method.
Abstract implementations of the top level interfaces are provided in AbstractUnivariateStatistic and AbstractStorelessUnivariateStatistic respectively.
Each statistic is implemented as a separate class, in one of the subpackages (moment, rank, summary) and each extends one of the abstract classes above (depending on whether or not value storage is required to compute the statistic). There are several ways to instantiate and use statistics. Statistics can be instantiated and used directly, but it is generally more convenient (and efficient) to access them using the provided aggregates, DescriptiveStatistics and SummaryStatistics.
DescriptiveStatistics
maintains the input data in memory
and has the capability of producing "rolling" statistics computed from a
"window" consisting of the most recently added values.
SummaryStatistics
does not store the input data values
in memory, so the statistics included in this aggregate are limited to those
that can be computed in one pass through the data without access to
the full array of values.
Aggregate | Statistics Included | Values stored? | "Rolling" capability? |
---|---|---|---|
DescriptiveStatistics | min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance, percentiles, skewness, kurtosis, median | Yes | Yes |
SummaryStatistics | min, max, mean, geometric mean, n, sum, sum of squares, standard deviation, variance | No | No |
SummaryStatistics
can be aggregated using
AggregateSummaryStatistics. This class can be used to concurrently gather statistics for multiple
datasets as well as for a combined sample including all of the data.
MultivariateSummaryStatistics
is similar to SummaryStatistics
but handles n-tuple values instead of scalar values. It can also compute the
full covariance matrix for the input data.
Neither DescriptiveStatistics
nor SummaryStatistics
is
thread-safe.
SynchronizedDescriptiveStatistics and
SynchronizedSummaryStatistics, respectively, provide thread-safe versions for applications that
require concurrent access to statistical aggregates by multiple threads.
SynchronizedMultivariateSummaryStatistics provides threadsafe MultivariateSummaryStatistics.
There is also a utility class, StatUtils, that provides static methods for computing statistics directly from double[] arrays.
Here are some examples showing how to compute Descriptive statistics.
DescriptiveStatistics
aggregate
(values are stored in memory):
// Get a DescriptiveStatistics instance using factory method DescriptiveStatistics stats = DescriptiveStatistics.newInstance(); // Add the data from the array for( int i = 0; i < inputArray.length; i++) { stats.addValue(inputArray[i]); } // Compute some statistics double mean = stats.getMean(); double std = stats.getStandardDeviation(); double median = stats.getMedian();
SummaryStatistics
aggregate (values are
not stored in memory):
// Get a SummaryStatistics instance using factory method SummaryStatistics stats = SummaryStatistics.newInstance(); // Read data from an input stream, // adding values and updating sums, counters, etc. while (line != null) { line = in.readLine(); stats.addValue(Double.parseDouble(line.trim())); } in.close(); // Compute the statistics double mean = stats.getMean(); double std = stats.getStandardDeviation(); //double median = stats.getMedian(); <-- NOT AVAILABLE
StatUtils
utility class:
// Compute statistics directly from the array // assume values is a double[] array double mean = StatUtils.mean(values); double std = StatUtils.variance(values); double median = StatUtils.percentile(50); // Compute the mean of the first three values in the array mean = StatuUtils.mean(values, 0, 3);
DescriptiveStatistics
instance with
window size set to 100
// Create a DescriptiveStats instance and set the window size to 100 DescriptiveStatistics stats = DescriptiveStatistics.newInstance(); stats.setWindowSize(100); // Read data from an input stream, // displaying the mean of the most recent 100 observations // after every 100 observations long nLines = 0; while (line != null) { line = in.readLine(); stats.addValue(Double.parseDouble(line.trim())); if (nLines == 100) { nLines = 0; System.out.println(stats.getMean()); } } in.close();
SynchronizedDescriptiveStatistics
instance
// Create a SynchronizedDescriptiveStatistics instance and // use as any other DescriptiveStatistics instance DescriptiveStatistics stats = DescriptiveStatistics.newInstance(SynchronizedDescriptiveStatistics.class);
AggregateSummaryStatistics.
The first is to use an AggregateSummaryStatistics
instance to accumulate
overall statistics contributed by SummaryStatistics
instances created using
AggregateSummaryStatistics.createContributingStatistics():
// Create a AggregateSummaryStatistics instance to accumulate the overall statistics // and AggregatingSummaryStatistics for the subsamples AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics(); SummaryStatistics setOneStats = aggregate.createContributingStatistics(); SummaryStatistics setTwoStats = aggregate.createContributingStatistics(); // Add values to the subsample aggregates setOneStats.addValue(2); setOneStats.addValue(3); setTwoStats.addValue(2); setTwoStats.addValue(4); ... // Full sample data is reported by the aggregate double totalSampleSum = aggregate.getSum();
addValue
calls must be synchronized on the
SummaryStatistics
instance maintained by the aggregate and each value addition updates the
aggregate as well as the subsample. For applications that can wait to do the aggregation until all values
have been added, a static
aggregate method is available, as shown in the following example.
This method should be used when aggregation needs to be done across threads.
// Create a AggregateSummaryStatistics instance to accumulate the overall statistics
// and AggregatingSummaryStatistics for the subsamples
AggregateSummaryStatistics aggregate = new AggregateSummaryStatistics();
SummaryStatistics setOneStats = aggregate.createContributingStatistics();
SummaryStatistics setTwoStats = aggregate.createContributingStatistics();
// Add values to the subsample aggregates
setOneStats.addValue(2);
setOneStats.addValue(3);
setTwoStats.addValue(2);
setTwoStats.addValue(4);
...
// Get a StatisticalSummary
describing the full set of data
Collection<SummaryStatistics> aggregate = new ArrayList<SummaryStatistics>();
aggregate.add(setOneStats);
aggregate.add(setTwoStats);
StatisticalSummary aggregatedStats = AggregateSummaryStatistics.aggregate(aggregate);
org.apache.commons.math.stat.descriptive.Frequency provides a simple interface for maintaining counts and percentages of discrete values.
Strings, integers, longs and chars are all supported as value types,
as well as instances of any class that implements Comparable.
The ordering of values used in computing cumulative frequencies is by
default the natural ordering, but this can be overriden by supplying a
Comparator
to the constructor. Adding values that are not
comparable to those that have already been added results in an
IllegalArgumentException.
Here are some examples.
Frequency f = new Frequency(); f.addValue(1); f.addValue(new Integer(1)); f.addValue(new Long(1)); f.addValue(2); f.addValue(new Integer(-1)); System.out.prinltn(f.getCount(1)); // displays 3 System.out.println(f.getCumPct(0)); // displays 0.2 System.out.println(f.getPct(new Integer(1))); // displays 0.6 System.out.println(f.getCumPct(-2)); // displays 0 System.out.println(f.getCumPct(10)); // displays 1
Frequency f = new Frequency(); f.addValue("one"); f.addValue("One"); f.addValue("oNe"); f.addValue("Z"); System.out.println(f.getCount("one")); // displays 1 System.out.println(f.getCumPct("Z")); // displays 0.5 System.out.println(f.getCumPct("Ot")); // displays 0.25
Frequency f = new Frequency(String.CASE_INSENSITIVE_ORDER); f.addValue("one"); f.addValue("One"); f.addValue("oNe"); f.addValue("Z"); System.out.println(f.getCount("one")); // displays 3 System.out.println(f.getCumPct("z")); // displays 1
org.apache.commons.math.stat.regression.SimpleRegression provides ordinary least squares regression with one independent variable, estimating the linear model:
y = intercept + slope * x
Standard errors for intercept
and slope
are
available as well as ANOVA, r-square and Pearson's r statistics.
Observations (x,y pairs) can be added to the model one at a time or they can be provided in a 2-dimensional array. The observations are not stored in memory, so there is no limit to the number of observations that can be added to the model.
Usage Notes:
NaN
. At least two observations with
different x coordinates are requred to estimate a bivariate regression
model.Implementation Notes:
Here are some examples.
regression = new SimpleRegression(); regression.addData(1d, 2d); // At this point, with only one observation, // all regression statistics will return NaN regression.addData(3d, 3d); // With only two observations, // slope and intercept can be computed // but inference statistics will return NaN regression.addData(3d, 3d); // Now all statistics are defined.
System.out.println(regression.getIntercept()); // displays intercept of regression line System.out.println(regression.getSlope()); // displays slope of regression line System.out.println(regression.getSlopeStdErr()); // displays slope standard error
System.out.println(regression.predict(1.5d) // displays predicted y value for x = 1.5
double[][] data = { { 1, 3 }, {2, 5 }, {3, 7 }, {4, 14 }, {5, 11 }}; SimpleRegression regression = new SimpleRegression(); regression.addData(data);
System.out.println(regression.getIntercept()); // displays intercept of regression line System.out.println(regression.getSlope()); // displays slope of regression line System.out.println(regression.getSlopeStdErr()); // displays slope standard error
org.apache.commons.math.stat.regression.MultipleLinearRegression provides ordinary least squares regression with a generic multiple variable linear model, which in matrix notation can be expressed as:
y=X*b+u
where y is an n-vector
regressand, X is a [n,k]
matrix whose k
columns are called
regressors, b is k-vector
of regression parameters and u
is an n-vector
of error terms or residuals. The notation is quite standard in literature,
cf eg Davidson and MacKinnon, Econometrics Theory and Methods, 2004.
Two implementations are provided: org.apache.commons.math.stat.regression.OLSMultipleLinearRegression and org.apache.commons.math.stat.regression.GLSMultipleLinearRegression
Observations (x,y and covariance data matrices) can be added to the model via the addData(double[] y, double[][] x, double[][] covariance)
method.
The observations are stored in memory until the next time the addData method is invoked.
Usage Notes:
addData(double[] y, double[][] x, double[][] covariance)
method and
IllegalArgumentException
is thrown when inappropriate.
null
.Here are some examples.
MultipleLinearRegression regression = new OLSMultipleLinearRegression(); double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; double[] x = new double[6][]; x[0] = new double[]{1.0, 0, 0, 0, 0, 0}; x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0}; x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0}; x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0}; x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0}; x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0}; regression.addData(y, x, null); // we don't need covariance
MultipleLinearRegression
interface:
double[] beta = regression.estimateRegressionParameters(); double[] residuals = regression.estimateResiduals(); double[][] parametersVariance = regression.estimateRegressionParametersVariance(); double regressandVariance = regression.estimateRegressandVariance();
MultipleLinearRegression regression = new GLSMultipleLinearRegression(); double[] y = new double[]{11.0, 12.0, 13.0, 14.0, 15.0, 16.0}; double[] x = new double[6][]; x[0] = new double[]{1.0, 0, 0, 0, 0, 0}; x[1] = new double[]{1.0, 2.0, 0, 0, 0, 0}; x[2] = new double[]{1.0, 0, 3.0, 0, 0, 0}; x[3] = new double[]{1.0, 0, 0, 4.0, 0, 0}; x[4] = new double[]{1.0, 0, 0, 0, 5.0, 0}; x[5] = new double[]{1.0, 0, 0, 0, 0, 6.0}; double[][] omega = new double[6][]; omega[0] = new double[]{1.1, 0, 0, 0, 0, 0}; omega[1] = new double[]{0, 2.2, 0, 0, 0, 0}; omega[2] = new double[]{0, 0, 3.3, 0, 0, 0}; omega[3] = new double[]{0, 0, 0, 4.4, 0, 0}; omega[4] = new double[]{0, 0, 0, 0, 5.5, 0}; omega[5] = new double[]{0, 0, 0, 0, 0, 6.6}; regression.addData(y, x, omega); // we do need covariance
MultipleLinearRegression
interface as
the OLS regression.
Some statistical algorithms require that input data be replaced by ranks. The org.apache.commons.math.stat.ranking package provides rank transformation. RankingAlgorithm defines the interface for ranking. NaturalRanking provides an implementation that has two configuration options.
Examples:
NaturalRanking ranking = new NaturalRanking(NaNStrategy.MINIMAL, TiesStrategy.MAXIMUM); double[] data = { 20, 17, 30, 42.3, 17, 50, Double.NaN, Double.NEGATIVE_INFINITY, 17 }; double[] ranks = ranking.rank(exampleData);
ranks
containing {6, 5, 7, 8, 5, 9, 2, 2, 5}.
new NaturalRanking(NaNStrategy.REMOVED,TiesStrategy.SEQUENTIAL).rank(exampleData);
{5, 2, 6, 7, 3, 8, 1, 4}.
The default NaNStrategy
is NaNStrategy.MAXIMAL. This makes NaN
values larger than any other value (including Double.POSITIVE_INFINITY
). The
default TiesStrategy
is TiesStrategy.AVERAGE,
which assigns tied
values the average of the ranks applicable to the sequence of ties. See the
NaturalRanking for more examples and
TiesStrategy and NaNStrategy
for details on these configuration options.
The org.apache.commons.math.stat.correlation package computes covariances and correlations for pairs of arrays or columns of a matrix. Covariance computes covariances, PearsonsCorrelation provides Pearson's Product-Moment correlation coefficients and SpearmansCorrelation computes Spearman's rank correlation.
Implementation Notes
cov(X, Y) = sum [(xi - E(X))(yi - E(Y))] / (n - 1)
where E(X)
is the mean of X
and E(Y)
is the mean of the Y
values. Non-bias-corrected estimates use
n
in place of n - 1.
Whether or not covariances are
bias-corrected is determined by the optional parameter, "biasCorrected," which
defaults to true.
cor(X, Y) = sum[(xi - E(X))(yi - E(Y))] / [(n - 1)s(X)s(Y)]
E(X)
and E(Y)
are means of X
and Y
and s(X)
, s(Y)
are standard deviations.
Examples:
x
and y
, use:
new Covariance().covariance(x, y)
covariance(x, y, false)
data
can be computed using
new Covariance().computeCovarianceMatrix(data)
data.
As above, to get non-bias-corrected covariances,
use
computeCovarianceMatrix(data, false)
x
and y
, use:
new PearsonsCorrelation().correlation(x, y)
data
can be computed using
new PearsonsCorrelation().computeCorrelationMatrix(data)
data.
PearsonsCorrelation
instance
PearsonsCorrelation correlation = new PearsonsCorrelation(data);
data
is either a rectangular array or a RealMatrix.
Then the matrix of standard errors is
correlation.getCorrelationStandardErrors();
SEr = ((1 - r2) / (n - 2))1/2
r
is the estimated correlation coefficient and
n
is the number of observations in the source dataset.correlation.getCorrelationPValues()
getCorrelationPValues().getEntry(i,j)
is the
probability that a random variable distributed as tn-2
takes
a value with absolute value greater than or equal to |rij|((n - 2) / (1 - rij2))1/2
,
where rij
is the estimated correlation between the ith and jth
columns of the source array or RealMatrix. This is sometimes referred to as the
significance of the coefficient.data
is a RealMatrix with 2 columns and 10 rows, then
new PearsonsCorrelation(data).getCorrelationPValues().getEntry(0,1)
data
. If this value is less than .01, we can say that the correlation
between the two columns of data is significant at the 99% level.
x
and y
:
new SpearmansCorrelation().correlation(x, y)
RankingAlgorithm ranking = new NaturalRanking(); new PearsonsCorrelation().correlation(ranking.rank(x), ranking.rank(y))
The interfaces and implementations in the
org.apache.commons.math.stat.inference package provide
Student's t,
Chi-Square and
One-Way ANOVA test statistics as well as
p-values associated with t-
,
Chi-Square
and One-Way ANOVA
tests. The
interfaces are
TTest,
ChiSquareTest, and
OneWayAnova with provided implementations
TTestImpl,
ChiSquareTestImpl and
OneWayAnovaImpl, respectively.
The
TestUtils class provides static methods to get test instances or
to compute test statistics directly. The examples below all use the
static methods in TestUtils
to execute tests. To get
test object instances, either use e.g.,
TestUtils.getTTest()
or use the implementation constructors
directly, e.g.,
new TTestImpl()
.
Implementation Notes
distributions
package. Examples:
t
testsdouble[] observed = {1d, 2d, 3d}; double mu = 2.5d; System.out.println(TestUtils.t(mu, observed));
observed
values against
mu.
double[] observed ={1d, 2d, 3d}; double mu = 2.5d; SummaryStatistics sampleStats = null; sampleStats = SummaryStatistics.newInstance(); for (int i = 0; i < observed.length; i++) { sampleStats.addValue(observed[i]); } System.out.println(TestUtils.t(mu, observed));
double[] observed = {1d, 2d, 3d}; double mu = 2.5d; System.out.println(TestUtils.tTest(mu, observed));
observed
values are drawn equals mu.
TestUtils.tTest(mu, observed, alpha);
0 < alpha < 0.5
is the significance level of
the test. The boolean value returned will be true
iff the
null hypothesis can be rejected with confidence 1 - alpha
.
To test, for example at the 95% level of confidence, use
alpha = 0.05
double[]
arrays
sample1
and sample2
is zero.
To compute the t-statistic:
TestUtils.pairedT(sample1, sample2);
To compute the p-value:
TestUtils.pairedTTest(sample1, sample2);
To perform a fixed significance level test with alpha = .05:
TestUtils.pairedTTest(sample1, sample2, .05);
true
iff the p-value
returned by TestUtils.pairedTTest(sample1, sample2)
is less than .05
StatisticalSummary
instances, without assuming that
subpopulation variances are equal.
First create the StatisticalSummary
instances. Both
DescriptiveStatistics
and SummaryStatistics
implement this interface. Assume that summary1
and
summary2
are SummaryStatistics
instances,
each of which has had at least 2 values added to the (virtual) dataset that
it describes. The sample sizes do not have to be the same -- all that is required
is that both samples have at least 2 elements.
Note: The SummaryStatistics
class does
not store the dataset that it describes in memory, but it does compute all
statistics necessary to perform t-tests, so this method can be used to
conduct t-tests with very large samples. One-sample tests can also be
performed this way.
(See Descriptive statistics for details
on the SummaryStatistics
class.)
To compute the t-statistic:
TestUtils.t(summary1, summary2);
To compute the p-value:
TestUtils.tTest(sample1, sample2);
To perform a fixed significance level test with alpha = .05:
TestUtils.tTest(sample1, sample2, .05);
In each case above, the test does not assume that the subpopulation variances are equal. To perform the tests under this assumption, replace "t" at the beginning of the method name with "homoscedasticT"
long[]
array of observed counts and a double[]
array of expected counts, use:
long[] observed = {10, 9, 11}; double[] expected = {10.1, 9.8, 10.3}; System.out.println(TestUtils.chiSquare(expected, observed));
sum((expected[i] - observed[i])^2 / expected[i])
observed
conforms to expected
use:
TestUtils.chiSquareTest(expected, observed);
observed
conforms to
expected
with alpha
siginficance level
(equiv. 100 * (1-alpha)%
confidence) where
0 < alpha < 1
use:
TestUtils.chiSquareTest(expected, observed, alpha);
true
iff the null hypothesis
can be rejected with confidence 1 - alpha
.
counts
array viewed as a two-way table, use:
TestUtils.chiSquareTest(counts);
count[0], ... , count[count.length - 1].
sum((counts[i][j] - expected[i][j])^2/expected[i][j])
where the sum is taken over all table entries and
expected[i][j]
is the product of the row and column sums at
row i
, column j
divided by the total count.
TestUtils.chiSquareTest(counts);
alpha
siginficance level (equiv. 100 * (1-alpha)%
confidence)
where 0 < alpha < 1
use:
TestUtils.chiSquareTest(counts, alpha);
true
iff the null
hypothesis can be rejected with confidence 1 - alpha
.
double[] classA = {93.0, 103.0, 95.0, 101.0, 91.0, 105.0, 96.0, 94.0, 101.0 }; double[] classB = {99.0, 92.0, 102.0, 100.0, 102.0, 89.0 }; double[] classC = {110.0, 115.0, 111.0, 117.0, 128.0, 117.0 }; List classes = new ArrayList(); classes.add(classA); classes.add(classB); classes.add(classC);
OneWayAnova
instance or TestUtils
methods:
double fStatistic = TestUtils.oneWayAnovaFValue(classes); // F-value double pValue = TestUtils.oneWayAnovaPValue(classes); // P-value
TestUtils.oneWayAnovaTest(classes, 0.01); // returns a boolean // true means reject null hypothesis