Estimate means

Title stata.com

mean — Estimate means

Description Quick start Menu Syntax

Options Remarks and examples Stored results Methods and formulas

References Also see

Description

mean produces estimates of means, along with standard errors.

Quick start

Mean, standard error, and 95% conﬁdence interval for v1

mean v1

Also compute statistics for v2

mean v1 v2

Same as above, but for each level of categorical variable catvar1

mean v1 v2, over(catvar1)

Weighting by probability weight wvar

mean v1 v2 [pweight=wvar]

Population mean using svyset data

svy: mean v3

Subpopulation means for each level of categorical variable catvar2 using svyset data

svy: mean v3, over(catvar2)

Test equality of two subpopulation means

svy: mean v3, over(catvar2)

test [email protected] = [email protected]

Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Means

2 mean — Estimate means

Syntax

mean varlist





weight



, options



options Description

Model

stdize(varname) variable identifying strata for standardization

stdweight(varname) weight variable for standardization

nostdrescale do not rescale the standard weight variable

if/in/over

over(varlist

) group over subpopulations deﬁned by varlist

SE/Cluster

vce(vcetype) vcetype may be analytic, cluster clustvar, bootstrap, or

jackknife

Reporting

level(#) set conﬁdence level; default is level(95)

noheader suppress table header

display options control column formats, line width, display of omitted variables

and base and empty cells, and factor-variable labeling

coeflegend display legend instead of statistics

varlist may contain factor variables; see [U] 11.4.3 Factor variables.

bootstrap, collect, jackknife, mi estimate, rolling, statsby, and svy are allowed; see [U] 11.1.10 Preﬁx

commands.

vce(bootstrap) and vce(jackknife) are not allowed with the mi estimate preﬁx; see [MI] mi estimate.

Weights are not allowed with the bootstrap preﬁx; see [R] bootstrap.

aweights are not allowed with the jackknife preﬁx; see [R] jackknife.

vce() and weights are not allowed with the svy preﬁx; see [SVY] svy.

fweights, aweights, iweights, and pweights are allowed; see [U] 11.1.6 weight.

coeflegend does not appear in the dialog box.

See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands.

Options



 

Model



stdize(varname) speciﬁes that the point estimates be adjusted by direct standardization across the

strata identiﬁed by varname. This option requires the stdweight() option.

stdweight(varname) speciﬁes the weight variable associated with the standard strata identiﬁed in

the stdize() option. The standardization weights must be constant within the standard strata.

nostdrescale prevents the standardization weights from being rescaled within the over() groups.

This option requires stdize() but is ignored if the over() option is not speciﬁed.



 

if/in/over



over(varlist

) speciﬁes that estimates be computed for multiple subpopulations, which are identiﬁed

by the different values of the variables in varlist

. Only numeric, nonnegative, integer-valued

variables are allowed in over(varlist

mean — Estimate means 3



 

SE/Cluster



vce(vcetype) speciﬁes the type of standard error reported, which includes types that are derived from

asymptotic theory (analytic), that allow for intragroup correlation (cluster clustvar), and that

use bootstrap or jackknife methods (bootstrap, jackknife); see [R] vce option.

vce(analytic), the default, uses the analytically derived variance estimator associated with the

sample mean.



 

Reporting



level(#); see [R] Estimation options.

noheader prevents the table header from being displayed.

display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels,

nofvlabel, fvwrap(#), fvwrapon(style), cformat(% fmt), and nolstretch; see [R] Estima-

tion options.

The following option is available with mean but is not shown in the dialog box:

coeflegend; see [R] Estimation options.

Remarks and examples stata.com

Example 1

Using the fuel data from example 3 of [R] ttest, we estimate the average mileage of the cars

without the fuel treatment (mpg1) and those with the fuel treatment (mpg2).

. use https://www.stata-press.com/data/r18/fuel

. mean mpg1 mpg2

Mean estimation Number of obs = 12

Mean Std. err. [95% conf. interval]

mpg1 21 .7881701 19.26525 22.73475

mpg2 22.75 .9384465 20.68449 24.81551

Using these results, we can test the equality of the mileage between the two groups of cars.

. test mpg1 = mpg2

( 1) mpg1 - mpg2 = 0

F( 1, 11) = 5.04

Prob > F = 0.0463

4 mean — Estimate means

Example 2

In example 1, the joint observations of mpg1 and mpg2 were used to estimate a covariance between

their means.

. matrix list e(V)

symmetric e(V)[2,2]

mpg1 mpg2

mpg1 .62121212

mpg2 .4469697 .88068182

If the data were organized this way out of convenience but the two variables represent independent

samples of cars (coincidentally of the same sample size), we should reshape the data and use the

over() option to ensure that the covariance between the means is zero.

. use https://www.stata-press.com/data/r18/fuel

. stack mpg1 mpg2, into(mpg) clear

. rename _stack trt

. label define trt_lab 1 "without" 2 "with"

. label values trt trt_lab

. label var trt "Fuel treatment"

. mean mpg, over(trt)

Mean estimation Number of obs = 24

Mean Std. err. [95% conf. interval]

c.mpg@trt

without 21 .7881701 19.36955 22.63045

with 22.75 .9384465 20.80868 24.69132

. matrix list e(V)

symmetric e(V)[2,2]

c.mpg@ c.mpg@

1.trt 2.trt

[email protected] .62121212

[email protected] 0 .88068182

Now, we can test the equality of the mileage between the two independent groups of cars.

. test [email protected] = [email protected]

( 1) [email protected] - [email protected] = 0

F( 1, 23) = 2.04

Prob > F = 0.1667

mean — Estimate means 5

Example 3: standardized means

Suppose that we collected the blood pressure data from example 2 of [R] dstdize, and we wish to

obtain standardized high blood pressure rates for each city in 1990 and 1992, using, as the standard,

the age, sex, and race distribution of the four cities and two years combined. Our rate is really the

mean of a variable that indicates whether a sampled individual has high blood pressure. First, we

generate the strata and weight variables from our standard distribution, and then use mean to compute

the rates.

. use https://www.stata-press.com/data/r18/hbp, clear

. egen strata = group(age race sex) if inlist(year, 1990, 1992)

(675 missing values generated)

. by strata, sort: gen stdw = _N

. mean hbp, over(city year) stdize(strata) stdweight(stdw)

Mean estimation

N. of std strata = 24 Number of obs = 455

Mean Std. err. [95% conf. interval]

c.hbp@city#year

1 1990 .058642 .0296273 .0004182 .1168657

1 1992 .0117647 .0113187 -.0104789 .0340083

2 1990 .0488722 .0238958 .0019121 .0958322

2 1992 .014574 .007342 .0001455 .0290025

3 1990 .1011211 .0268566 .0483425 .1538998

3 1992 .0810577 .0227021 .0364435 .1256719

5 1990 .0277778 .0155121 -.0027066 .0582622

5 1992 .0548926 0 . .

The standard error of the high blood pressure rate estimate is missing for city 5 in 1992 because

there was only one individual with high blood pressure; that individual was the only person observed

in the stratum of white males 30–35 years old.

By default, mean rescales the standard weights within the over() groups. In the following, we

use the nostdrescale option to prevent this, thus reproducing the results in [R] dstdize.

. mean hbp, over(city year) stdize(strata) stdweight(stdw) nostdrescale

Mean estimation

N. of std strata = 24 Number of obs = 455

Mean Std. err. [95% conf. interval]

c.hbp@city#year

1 1990 .0073302 .0037034 .0000523 .0146082

1 1992 .0015432 .0014847 -.0013745 .004461

2 1990 .0078814 .0038536 .0003084 .0154544

2 1992 .0025077 .0012633 .000025 .0049904

3 1990 .0155271 .0041238 .007423 .0236312

3 1992 .0081308 .0022772 .0036556 .012606

5 1990 .0039223 .0021904 -.0003822 .0082268

5 1992 .0088735 0 . .

6 mean — Estimate means

Example 4: proﬁle plots and contrasts

The ﬁrst example in [R] marginsplot shows how to use margins and marginsplot to get proﬁle

plots from a linear regression. We can similarly explore the data using marginsplot after mean with

the over() option. Here we use marginsplot to plot the means of systolic blood pressure for each

age group.

. use https://www.stata-press.com/data/r18/nhanes2, clear

. mean bpsystol, over(agegrp)

Mean estimation Number of obs = 10,351

Mean Std. err. [95% conf. interval]

c.bpsystol@agegrp

20--29 117.3466 .3247329 116.71 117.9831

30--39 120.2374 .4095845 119.4345 121.0402

40--49 126.9442 .532033 125.9013 127.9871

50--59 135.6754 .6061842 134.4872 136.8637

60--69 141.5227 .4433527 140.6537 142.3918

70+ 148.1765 .8321116 146.5454 149.8076

. marginsplot

Variables that uniquely identify means:

110

120

130

140

150

20–29 30–39 40–49 50–59 60–69 70+

Age group

Estimated means of bpsystol with 95% CIs

We see that the mean systolic blood pressure increases with age. We can use contrast to formally

test whether each mean is different from the mean in the previous age group using the ar. contrast

operator; see [R] contrast for more information on this command.

mean — Estimate means 7

. contrast ar.agegrp#c.bpsystol, effects nowald

Contrasts of means

Contrast Std. err. t P>|t| [95% conf. interval]

agegrp#

c.bpsystol

(30--39

20--29) 2.89081 .5226958 5.53 0.000 1.866225 3.915394

(40--49

30--39) 6.706821 .6714302 9.99 0.000 5.390688 8.022954

(50--59

40--49) 8.731263 .8065472 10.83 0.000 7.150275 10.31225

(60--69

50--59) 5.847282 .7510133 7.79 0.000 4.375151 7.319413

(70+

60--69) 6.653743 .9428528 7.06 0.000 4.80557 8.501917

The ﬁrst row of the output reports that the mean systolic blood pressure for the 30–39 age group

is 2.89 higher than the mean for the 20–29 age group. The mean for the 40–49 age group is 6.71

higher than the mean for the 30–39 age group, and so on. Each of these differences is signiﬁcantly

different from zero.

We can include both agegrp and sex in the over() option to estimate means separately for men

and women in each age group.

. mean bpsystol, over(agegrp sex)

Mean estimation Number of obs = 10,351

Mean Std. err. [95% conf. interval]

c.bpsystol@agegrp#sex

20--29#Male 123.8862 .4528516 122.9985 124.7739

20--29#Female 111.2849 .3898972 110.5206 112.0492

30--39#Male 124.6818 .5619855 123.5802 125.7834

30--39#Female 116.2207 .5572103 115.1284 117.3129

40--49#Male 129.0033 .7080788 127.6153 130.3912

40--49#Female 125.0468 .7802558 123.5174 126.5763

50--59#Male 136.0864 .855435 134.4096 137.7632

50--59#Female 135.3164 .8556015 133.6393 136.9935

60--69#Male 140.7451 .6059786 139.5572 141.9329

60--69#Female 142.2368 .6427981 140.9767 143.4968

70+#Male 146.3951 1.141126 144.1583 148.6319

70+#Female 149.6599 1.189975 147.3273 151.9924

8 mean — Estimate means

. marginsplot

Variables that uniquely identify means:

110

120

130

140

150

20–29 30–39 40–49 50–59 60–69 70+

Age group

Male

Female

Estimated means of bpsystol with 95% CIs

Are the means different for men and women within each age group? We can again perform the

tests using contrast. This time, we will use r.sex to obtain contrasts comparing men and women

and use @agegrp to request that the tests are performed for each age group.

. contrast r.sex#c.bpsystol@agegrp, effects nowald

Contrasts of means

Contrast Std. err. t P>|t| [95% conf. interval]

sex@agegrp#

c.bpsystol

(Female

Male)

20--29 -12.60132 .5975738 -21.09 0.000 -13.77268 -11.42996

(Female

Male)

30--39 -8.461161 .7913981 -10.69 0.000 -10.01245 -6.909868

(Female

Male)

40--49 -3.956451 1.053648 -3.76 0.000 -6.021805 -1.891097

(Female

Male)

50--59 -.7699782 1.209886 -0.64 0.525 -3.141588 1.601631

(Female

Male)

60--69 1.491684 .8834022 1.69 0.091 -.2399545 3.223323

(Female

Male)

70+ 3.264762 1.648699 1.98 0.048 .0329927 6.496531

mean — Estimate means 9

Using a 0.05 signiﬁcance level, we ﬁnd that the mean systolic blood pressure is different for men

and women in all age groups except the ﬁfties and sixties.

Video example

Descriptive statistics in Stata

Stored results

mean stores the following in e():

Scalars

e(N) number of observations

e(N over) number of subpopulations

e(N stdize) number of standard strata

e(N clust) number of clusters

e(k eq) number of equations in e(b)

e(df r) sample degrees of freedom

e(rank) rank of e(V)

Macros

e(cmd) mean

e(cmdline) command as typed

e(varlist) varlist

e(stdize) varname from stdize()

e(stdweight) varname from stdweight()

e(wtype) weight type

e(wexp) weight expression

e(title) title in estimation output

e(clustvar) name of cluster variable

e(over) varlist from over()

e(vce) vcetype speciﬁed in vce()

e(vcetype) title used to label Std. err.

e(properties) b V

e(estat

cmd) program used to implement estat

e(marginsnotok) predictions disallowed by margins

Matrices

e(b) vector of mean estimates

e(V) (co)variance estimates

e(sd) vector of standard deviation estimates

e( N) vector of numbers of nonmissing observations

e( N stdsum) number of nonmissing observations within the standard strata

e( p stdize) standardizing proportions

e(error) error code corresponding to e(b)

Functions

e(sample) marks estimation sample

In addition to the above, the following is stored in r():

Matrices

r(table) matrix containing the coefﬁcients with their standard errors, test statistics, p-values,

and conﬁdence intervals

Note that results stored in r() are updated when the command is replayed and will be replaced when

any r-class command is run after the estimation command.

10 mean — Estimate means

Methods and formulas

Methods and formulas are presented under the following headings:

The mean estimator

Survey data

The survey mean estimator

The standardized mean estimator

The poststratiﬁed mean estimator

The standardized poststratiﬁed mean estimator

Subpopulation estimation

The mean estimator

Let y be the variable on which we want to calculate the mean and y

an individual observation on

y, where j = 1, . . . , n and n is the sample size. Let w

be the weight, and if no weight is speciﬁed,

deﬁne w

= 1 for all j. For aweights, the w

are normalized to sum to n. See The survey mean

estimator for pweighted data.

Let W be the sum of the weights

W =

j=1

The mean is deﬁned as

y =

j=1

The default variance estimator for the mean is

V (y) =

W (W − 1)

j=1

− y)

The standard error of the mean is the square root of the variance.

If x, x

, and x are similarly deﬁned for another variable (observed jointly with y), the covariance

estimator between x and y is

Cov(x, y) =

W (W − 1)

j=1

− x)(y

− y)

Survey data

See [SVY] Variance estimation, [SVY] Direct standardization, and [SVY] Poststratiﬁcation for

discussions that provide background information for the following formulas. The following formulas

are derived from the fact that the mean is a special case of the ratio estimator where the denominator

variable is one, x

= 1; see [R] ratio.

mean — Estimate means 11

The survey mean estimator

Let Y

be a survey item for the jth individual in the population, where j = 1, . . . , M and M

is the size of the population. The associated population mean for the item of interest is Y = Y/M

where

Y =

j=1

Let y

be the survey item for the jth sampled individual from the population, where j = 1, . . . , m

and m is the number of observations in the sample.

The estimator for the mean is y =

Y /

M, where

Y =

j=1

and

M =

j=1

and w

is a sampling weight. The score variable for the mean estimator is

(y) =

− y

−

The standardized mean estimator

Let D

denote the set of sampled observations that belong to the gth standard stratum and deﬁne

(j) to indicate if the jth observation is a member of the gth standard stratum; where g = 1, . . . ,

and L

is the number of standard strata. Also, let π

denote the fraction of the population that

belongs to the gth standard stratum, thus π

+ · · · + π

= 1. π

is derived from the stdweight()

option.

The estimator for the standardized mean is

g =1

where

j=1

(j) w

and

j=1

(j) w

The score variable for the standardized mean is

) =

g =1

(j)

−

12 mean — Estimate means

The poststratiﬁed mean estimator

Let P

denote the set of sampled observations that belong to poststratum k and deﬁne I

(j)

to indicate if the jth observation is a member of poststratum k; where k = 1, . . . , L

and L

the number of poststrata. Also let M

denote the population size for poststratum k. P

and M

are

identiﬁed by specifying the poststrata() and postweight() options on svyset; see [SVY] svyset.

The estimator for the poststratiﬁed mean is

where

k=1

j=1

(j) w

and

k=1

= M

The score variable for the poststratiﬁed mean is

) =

(

)

k=1

(j)

−

The standardized poststratiﬁed mean estimator

The estimator for the standardized poststratiﬁed mean is

g =1

where

k=1

g ,k

k=1

j=1

(j)I

(j) w

and

k=1

g ,k

k=1

j=1

(j)I

(j) w

The score variable for the standardized poststratiﬁed mean is

) =

g =1

(

) −

(

)

(

)

where

(

) =

k=1

(j)

(

(j)y

−

g ,k

)

and

(

) =

k=1

(j)

(

(j) −

g ,k

)

mean — Estimate means 13

Subpopulation estimation

Let S denote the set of sampled observations that belong to the subpopulation of interest, and

deﬁne I

(j) to indicate if the jth observation falls within the subpopulation.

The estimator for the subpopulation mean is y

, where

j=1

(j) w

and

j=1

(j) w

Its score variable is

) = I

(j)

− y

= I

(j)

−

(

)

The estimator for the standardized subpopulation mean is

g =1

where

j=1

(j)I

(j) w

and

j=1

(j)I

(j) w

Its score variable is

) =

g=1

(j)I

(j)

−

(

)

The estimator for the poststratiﬁed subpopulation mean is

P S

where

P S

k=1

j=1

(j)I

(j) w

and

P S

k=1

j=1

(j)I

(j) w

Its score variable is

P S

) =

P S

(

P S

) −

P S

(

P S

)

(

P S

)

where

(

P S

) =

k=1

(j)

(

(j) y

−

)

14 mean — Estimate means

and

(

P S

) =

k=1

(j)

(

(j) −

)

The estimator for the standardized poststratiﬁed subpopulation mean is

DP S

g =1

P S

where

P S

k=1

g ,k

k=1

j=1

(j)I

(j) w

and

P S

k=1

g ,k

k=1

j=1

(j)I

(j) w

Its score variable is

DP S

) =

g =1

P S

(

P S

) −

P S

(

P S

)

(

P S

)

where

(

P S

) =

k=1

(j)

(

(j)I

(j) y

−

g ,k

)

and

(

P S

) =

k=1

(j)

(

(j)I

(j) −

g ,k

)

References

Bakker, A. 2003. The early history of average values and implications for education. Journal of Statistics Education

11(1). http://www.amstat.org/publications/jse/v11n1/bakker.html.

Cochran, W. G. 1977. Sampling Techniques. 3rd ed. New York: Wiley.

Manski, C. F., and M. Tabord-Meehan. 2017. Evaluating the maximum MSE of mean estimators with missing data.

Stata Journal 17: 723–735.

Stuart, A., and J. K. Ord. 1994. Kendall’s Advanced Theory of Statistics: Distribution Theory, Vol. 1. 6th ed. London:

Arnold.

mean — Estimate means 15

Also see

[R] mean postestimation — Postestimation tools for mean

[R] ameans — Arithmetic, geometric, and harmonic means

[R] proportion — Estimate proportions

[R] ratio — Estimate ratios

[R] summarize — Summary statistics

[R] total — Estimate totals

[MI] Estimation — Estimation commands for use with mi estimate

[SVY] Direct standardization — Direct standardization of means, proportions, and ratios

[SVY] Poststratiﬁcation — Poststratiﬁcation for survey data

[SVY] Subpopulation estimation — Subpopulation estimation for survey data

[SVY] svy estimation — Estimation commands for survey data

[SVY] Variance estimation — Variance estimation for survey data

[U] 20 Estimation and postestimation commands

Stata, Stata Press, and Mata are registered trademarks of StataCorp LLC. Stata and

Stata Press are registered trademarks with the World Intellectual Property Organization

of the United Nations. StataNow and NetCourseNow are trademarks of StataCorp

LLC. Other brand and product names are registered trademarks or trademarks of their

respective companies. Copyright

 1985–2023 StataCorp LLC, College Station, TX,

For suggested citations, see the FAQ on citing Stata documentation.