What America's Users Spend on Illegal Drugs 19881998
December 2000
Spoken Math version of Appendix D
Appendix D
Imputations for Missing Data on Marijuana Use
Calculations of the amount of marijuana used by household members was straightforward. We multiplied the number of marijuana users per month, by the average number of joints smoked per user, by the average weight of a joint. The result was then multiplied by twelve months to give a year's estimate. The principal problems when making this calculation are dealing with missing data and with responses that represent a range. The latter presents a problem because the ranges are not suitable for our calculations. Because the Substance Abuse and Mental Health Services Administration had already imputed responses when there was missing data about recent use, this was not a problem. This appendix explains how we imputed responses when either the number of joints smoked or the amount of marijuana smoked were missing or were reported as a range.
Imputing the Number of Joints Smoked
From the National Household Survey for 1991, analysts selected respondents who said they used marijuana in the past month and who gave valid responses to three related questions. The first question was the number of days they smoked marijuana in the past month (DAYS). Valid responses were 130 days. The second question was the number of marijuana cigarettes smoked per day in the past month (JOINTS). From the responses to these two questions, analysts created a variable
The third question was the amount of marijuana used during the last month (AMOUNT). This is exactly the question that the analysts sought to answer, but the AMOUNT question was not directly useful for this purpose because it was specified as a range. The acceptable answers to AMOUNT were:
- 110 joints
- 1120 joints
- 1 ounce
- 2 ounces
- 34 ounces
- 56 ounces
The analysts' problem was to infer the amount of marijuana used by people who said they used marijuana in the last month based on the variables TOTAL JOINTS and AMOUNT.
As short-hand, let J represent TOTAL JOINTS, let A represent AMOUNT, and let W equal the weight of marijuana used in ounces. The analysts wanted to estimate W.
Now, W is unknown, but it might be represented as:
where
is the weight per joint and
is a random error term, which will be discussed below. Equation [1] says that, on average, a person who smokes J joints will use W ounces of marijuana, because
is the average weight of a single joint. Of course, some people who smoke J joints use a little less; some use a little more. This variation about what is typical is reflected in the term
.
Assume that
is distributed normally with a mean of zero, a standard deviation of
, and that the error terms are independently and identically distributed. It turns out that these assumptions about the distribution of
are hard to justify, and alternative assumptions are adopted later. However, this simple, if somewhat unrealistic, specification is useful for explaining the approach.
Although W is unknown to the analysts, it is known to the respondent, and by assumption the value of W determines the respondent's answer for AMOUNT. Specifically, the respondent will say that he used
The logic here is that the respondent will select the usage category that most closely describes his use, although it seems reasonable to suppose that he makes errors when making this translation. Two terms are unknown,
and
. The first,
, is presumably the weight of 10.5 joints. The second is harder to interpret, but
is some value that distinguishes the response "10 to 2" joints from "1 ounce," at least in the eyes of the respondent.
There are four parameters to be estimated here:
,
,
and
. These parameters can be estimated by maximum likelihood once a probability has been assigned to every response.
where
is the standard normal distribution function.
This approach is similar to an ordered probit model. There is an important difference between this approach and a traditional probit model, however. Specifically, the threshold values of 1.5, 2.5, and 4.5 are known although
and
are unknown. This allows the parameter
to be identified and estimated. In turn, this allows
to be identified and interpreted as the weight of a marijuana cigarette.
One further extension is to assume that:
That is, the parameter
equals the weight of 10.5 joints, because the weight of 10.5 joints is the threshold value between the responses "110 joints" and "1120 joints." There are only three remaining parameters to estimate:
,
, and
.
As stated, this model is an unacceptable representation of the relationship between the number of joints smoked and the amount of marijuana smoked. A more convincing model is:
This implies that the average joint weighs - ounces, but that the weight varies across users. This variation is represented by the distribution of
. The model would be complete once the distribution of
is specified.
The distribution of
has to satisfy some a priori constraints. First, W must be positive, so
has a lower limit that depends on -J. Second, the distribution of
should account for an apparent upward skew: inspection of the data shows that some users seem to use much more than the average amount of marijuana, but nobody can use much less because zero is a lower limit. Third, the error term is heteroscedastic.
A new specification is more useful, given these a priori constraints:
where
. Here,
has a lognormal distribution, and thus
J is always positive and
is skewed upward. In this specification:
Taking logarithms on both sides of [3], we have
where
. As with the earlier, less realistic model, the parameters can be estimated using maximum likelihood. A simple extension is to let
. The "100" is just a scale factor that has no effect on analysis. This specification allows frequent smokers to smoke larger or smaller joints than average smokers.
The most important estimate is E(
), the average weight of a marijuana cigarette. An estimate of W, then, is:
W = E(
)J
This tells us that if a respondent says he smoked J joints during the month (TOTAL JOINTS), then E(
)J is the best estimate of the quantity (in ounces) of marijuana smoked.
Table D presents parameter estimates based on an analysis of 1623 smokers who reported DAYS, JOINTS, and AMOUNT. Before estimating these parameters, the analysts changed some of the data
Table D1
Regression Results: The Total Amount of Marijuana Smoked in the Past Month
Before calculating TOTAL JOINTS, responses of more than 30 for JOINTS (number of marijuana cigarettes smoked per day in the past month) were truncated to 30. These extreme responses represented only about 0.1% of the total number of monthly users.
After calculating TOTAL JOINTS, analysts compared TOTAL JOINTS with AMOUNT and corrected for extreme inconsistencies between (or highly unlikely combinations of) the two variables. If JOINTS >= 100 and AMOUNT <= 20 joints or if JOINTS >= 200 and AMOUNT <= 2 ounces, then analysts assumed that the respondents had mistakenly given the total number of joints they had smoked in the past month for the question on JOINTS (number of marijuana cigarettes smoked per day in the past month). For these respondents, analysts treated JOINTS as TOTAL JOINTS in calculating the quantity estimates.
Results from the analysis imply that a person who smokes 1 joint per month uses 0.013 ounces (0.37 grams per joint) of marijuana. A person who smokes thirty joints per month uses 0.4 ounces (0.38 grams per joint) of marijuana. A person who smokes 120 joints per month uses 1.79 ounces (0.43 grams per joint) of marijuana. Applying the parameter estimates from Table D1, Equation [7] was then used to compute the average weight per joint (W/J) for every respondent in each year of the NHSDA. Results, which appear in Table 6 of the main report, are used in the calculations reported in the body of this report.
Imputing Joints
A related problem is that the variable JOINTS was sometimes missing. We could not just substitute the average response when JOINTS were known, because those with missing data seemed to have different usage patterns from those who did not have missing data. Instead, we estimated regressions where JOINTS was the dependent variable and MJFREQ was the independent variable. MJFREQ is "frequency used marijuana in the past 12 months." We used results from these regressions to impute responses when JOINTS was missing.
MJFREQ is coded:
- 1 several times a day;
- 2 daily;
- 3 almost daily (3 to 6 days a week);
- 4 1 or 2 times a week;
- 5 several times a month (about 25 to 51 days a year);
- 6 1 or 2 times a month (12 to 24 days a year);
- 7 every other month or so (6 to 11 days a year);
- 8 3 to 5 days in the past 12 months;
- 9 1 or 2 days in the past 12 months.
- 2 daily;
We treated this variable as a continuous measure. To capture nonlinearities, we added an additional independent variable MJFREQ2 = MJFREQ
MJFREQ.
The regression had two special features. The first was that the respondent could have said that he used zero joints during the month before the interview. After all, marijuana use during the year (MJFREQ) does not imply marijuana use during the month before the survey (JOINTS). To take this special feature into account, the regression specification was written:
where

Note that in this specification the error term is heteroscedastic and a linear function of the underlying latent variable Z.
Table D2 shows regression results.
Table D2
Regression Results: The Average Number of Joints Smoked in the Past Month
The table shows two regressions. Model 1 was estimated for the 1418 respondents who reported use of marijuana in the 1991 NHSDA survey. Model 2 was estimated for the 190 respondents whose use of marijuana was imputed by SAMHSA. We estimated two separate models because specification testing showed that estimates based on the 1418 cases did not work well for the 190 cases and vice versa.
The regressions over predict slightly. Based on the 1418 cases, the regressions predict 23.4 joints on average per month. In reality, respondents said they used an average of 21.6 joints per month. For the 190 cases, the prediction was 10.7 joints on average per month and the actual was 8.5 joints. Because these predictions were only used when responses were missing for the variable JOINTS, we considered them to be close enough for our purposes.



