class: center, middle, inverse, title-slide .title[ # Survey Weights II
] .author[ ### Jason Thomas ] .institute[ ### R Working Group ] .date[ ### April 4th, 2025 ] --- # Introduction * Today we dive into the **survey** package ``` r install.packages("survey") ``` + *Complex Surveys: A Guide to Analysis Using R* -- Thomas Lumley * Basic concepts (quickly) * Survey Design + `svydesign()` * Analysis + `svyFUNCTION()`: e.g., `svytotal()` or `svymean()` --- # Survey Weights * design-based inference vs. model-based inference * survey package treats survey weights as 1 / probability of being sampled * finite population correction (sampling without replacement) + practically, you KNOW the population size, and want to adjust (reduce uncertainty) for the fact that you observe a fraction of the population + `fpc` argument for `svydesign()` --- class: inverse, center, middle # Survey Design --- # Survey Design * **Simple Random Sample** (SRS) + every member in the population has the same chance of being selected + example datasets + `apipop` — population of schools in California (with at least 100 students) with a total of 3,811,472 students + `apisrs` — SRS of size 200 from apipop * Stratified Samples * (Multistage) Cluster Sampling --- # Example data (apisrs) ``` r library(survey) data(api) ls() ## [1] "apiclus1" "apiclus2" "apipop" "apisrs" "apistrat" # help(api) ``` Explore `apisrs` (a simple random sample)
--- class: codefs-55 # Simple Random Sample ``` r *srs_design <- svydesign(ids = ~1, # individual rows (schools) are sampled * weights = ~pw, * data = apisrs) summary(srs_design) ``` ``` ## Independent Sampling design (with replacement) ## svydesign(ids = ~1, weights = ~pw, data = apisrs) ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.03229 0.03229 0.03229 0.03229 0.03229 0.03229 ## Data variables: ## [1] "cds" "stype" "name" "sname" "snum" "dname" ## [7] "dnum" "cname" "cnum" "flag" "pcttest" "api00" ## [13] "api99" "target" "growth" "sch.wide" "comp.imp" "both" ## [19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3" ## [25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col" ## [31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll" ## [37] "api.stu" "pw" "fpc" ``` ``` r svytotal(~enroll, srs_design) ## (truth: 3,811,472 students) ``` ``` ## total SE ## enroll 3621074 172325 ``` --- # Stratified Sample * divide the population into groups and sample within groups * adds precision, but need to know the stratum membership & size ``` r table(apistrat$stype) # type of school ``` ``` ## ## E H M ## 100 50 50 ``` ``` r unique(apistrat$fpc[apistrat$stype == "E"]) # Elementary ## [1] 4421 unique(apistrat$fpc[apistrat$stype == "M"]) # Middle ## [1] 1018 unique(apistrat$fpc[apistrat$stype == "H"]) # High School ## [1] 755 ``` --- class: codefs-55 # Stratified Sample ``` r strat_design <- svydesign(ids = ~1, * strata = ~stype, weights = ~pw, * # fpc = ~fpc * data = apistrat) summary(strat_design) ``` ``` ## Stratified Independent Sampling design (with replacement) ## svydesign(ids = ~1, strata = ~stype, weights = ~pw, data = apistrat) ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.02262 0.02262 0.03587 0.04014 0.05339 0.06623 ## Stratum Sizes: ## E H M ## obs 100 50 50 ## design.PSU 100 50 50 ## actual.PSU 100 50 50 ## Data variables: ## [1] "cds" "stype" "name" "sname" "snum" "dname" ## [7] "dnum" "cname" "cnum" "flag" "pcttest" "api00" ## [13] "api99" "target" "growth" "sch.wide" "comp.imp" "both" ## [19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3" ## [25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col" ## [31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll" ## [37] "api.stu" "pw" "fpc" ``` --- class: slide-font-25, codefs-55 # Stratified Sample (cont.) Since we know the number of stratum (needed to calculate `pw`), there is no uncertainty. ``` r svytotal(~stype, strat_design) ``` ``` ## total SE ## stypeE 4421 0 ## stypeH 755 0 ## stypeM 1018 0 ``` Compare standard errors to get an idea of improved efficiency offered by stratifying ``` r svytotal(~enroll, srs_design) ## (truth: 3,811,472 students) ``` ``` ## total SE ## enroll 3621074 172325 ``` ``` r svytotal(~enroll, strat_design) ``` ``` ## total SE ## enroll 3687178 117319 ``` --- # Cluster Sampling * Divide the population into (mutually exclusive) groups/clusters, draw a (small) random sample of clusters, and collect data on all items in the cluster * Example dataset for single-stage cluster sampling: `apiclus1` + draw a sample of 15 school districts (out of 757 total) + include each school in the sampled districts * Example datasets for multistage cluster sampling: `nhanes` & `apiclus2` + stratified sampling & cluster sampling within each stratum (`nhanes`) + cluster sampling at both stages (`apiclus2`) --- # Example data (apiclus1) ``` r nrow(apiclus1) ## sample size ``` ``` ## [1] 183 ``` ``` r length(unique(apipop$dnum)) ## number of school districts ``` ``` ## [1] 757 ``` ``` r unique(apiclus1$fpc) ## check n districts ``` ``` ## [1] 757 ``` ``` r table(apiclus1$dnum) ## sampled districts ``` ``` ## ## 61 135 178 197 255 406 413 437 448 510 568 637 716 778 815 ## 13 34 4 13 16 2 1 4 12 21 9 11 37 2 4 ``` ``` r summary(apiclus1$pw) ## should be 757/15 !!! ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 33.85 33.85 33.85 33.85 33.85 33.85 ``` --- class: codefs-55 # Single-stage cluster sample ``` r *clus1_design <- svydesign(id = ~dnum, # cluster id weights = ~pw, #fpc=~fpc, data = apiclus1) summary(clus1_design) ``` ``` ## 1 - level Cluster Sampling design (with replacement) ## With (15) clusters. ## svydesign(id = ~dnum, weights = ~pw, data = apiclus1) ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.02954 0.02954 0.02954 0.02954 0.02954 0.02954 ## Data variables: ## [1] "cds" "stype" "name" "sname" "snum" "dname" ## [7] "dnum" "cname" "cnum" "flag" "pcttest" "api00" ## [13] "api99" "target" "growth" "sch.wide" "comp.imp" "both" ## [19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3" ## [25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col" ## [31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll" ## [37] "api.stu" "fpc" "pw" ``` --- # Relative efficiency (truth: 3,811,472 students) ``` r svytotal(~enroll, srs_design) ``` ``` ## total SE ## enroll 3621074 172325 ``` ``` r svytotal(~enroll, strat_design) ``` ``` ## total SE ## enroll 3687178 117319 ``` ``` r svytotal(~enroll, clus1_design) ``` ``` ## total SE ## enroll 3404940 941611 ``` --- # Example data (nhanes) A data frame with 8,591 observations on the following 7 variables. ``` r # help(nhanes) data(nhanes) ``` * SDMVPSU — Primary sampling units * SDMVSTRA — Sampling strata * WTMEC2YR — Sampling weights (text is quoted from `nhanes` help file) --- # Multistage Design ``` r nhanes_design <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, nest = TRUE, # PSU id is recycled in strata data=nhanes) summary(nhanes_design) ``` ``` ## Stratified 1 - level Cluster Sampling design (with replacement) ## With (31) clusters. ## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, ## nest = TRUE, data = nhanes) ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 6.323e-06 2.470e-05 4.482e-05 5.097e-05 6.984e-05 2.330e-04 ## Stratum Sizes: ## 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 ## obs 650 634 711 689 564 630 577 605 571 532 499 757 459 504 209 ## design.PSU 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 ## actual.PSU 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 ## Data variables: ## [1] "SDMVPSU" "SDMVSTRA" "WTMEC2YR" "HI_CHOL" "race" "agecat" "RIAGENDR" ``` --- # nest = TRUE ``` r table(nhanes$SDMVPSU, nhanes$SDMVSTRA) ``` ``` ## ## 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 ## 1 307 321 389 310 290 323 267 295 270 282 239 256 207 260 87 ## 2 343 313 322 379 274 307 310 310 301 250 260 291 252 244 122 ## 3 0 0 0 0 0 0 0 0 0 0 0 210 0 0 0 ``` --- class: codefs-50 # Another example Cluster sampling at both stages ``` r *clus2_design <- svydesign(id = ~dnum + snum, weights = ~pw, #fpc = ~fpc1 + fpc2, data = apiclus2) summary(clus2_design) ``` ``` ## 2 - level Cluster Sampling design (with replacement) ## With (40, 126) clusters. ## svydesign(id = ~dnum + snum, weights = ~pw, data = apiclus2) ## Probabilities: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.003669 0.037743 0.052840 0.042390 0.052840 0.052840 ## Data variables: ## [1] "cds" "stype" "name" "sname" "snum" "dname" ## [7] "dnum" "cname" "cnum" "flag" "pcttest" "api00" ## [13] "api99" "target" "growth" "sch.wide" "comp.imp" "both" ## [19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3" ## [25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col" ## [31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll" ## [37] "api.stu" "pw" "fpc1" "fpc2" ``` ``` r svytotal(~enroll, clus2_design, na.rm=TRUE) ``` ``` ## total SE ## enroll 2639273 820261 ``` --- class: inverse, center, middle # Analysis of Survey Data --- # Analysis of Survey Data * Descriptive statistics + creating new variables in survey design object * Estimates for Subpopulations * Regression --- # Descriptives — mean ``` r mean(apipop$api99) ## true value in population ``` ``` ## [1] 631.913 ``` ``` r svymean(~api99, srs_design) ## mean SE ## api99 624.68 9.6575 svymean(~api99, strat_design) ## mean SE ## api99 629.39 10.097 svymean(~api99, clus1_design) ## mean SE ## api99 606.98 24.469 svymean(~api99, clus2_design) ## mean SE ## api99 645.03 30.308 ``` --- # Descriptives — mean (cont.) ``` r mean(apipop$api99) ## value in pop ``` ``` ## [1] 631.913 ``` ``` r svymean(~ api99 + api00, clus2_design) ``` ``` ## mean SE ## api99 645.03 30.308 ## api00 670.81 30.712 ``` ``` r confint(svymean(~ api99 + api00, clus2_design), level = 0.95) ``` ``` ## 2.5 % 97.5 % ## api99 585.6321 704.4358 ## api00 610.6182 731.0054 ``` --- # Descriptives — mean (cont.) ``` r nhanes_tab <- svymean(~interaction(race, agecat), nhanes_design) nhanes_tab ``` ``` ## mean SE ## interaction(race, agecat)1.(0,19] 0.0426715 0.0077 ## interaction(race, agecat)2.(0,19] 0.1194048 0.0080 ## interaction(race, agecat)3.(0,19] 0.0291613 0.0024 ## interaction(race, agecat)4.(0,19] 0.0165118 0.0023 ## interaction(race, agecat)1.(19,39] 0.0562393 0.0107 ## interaction(race, agecat)2.(19,39] 0.1732181 0.0104 ## interaction(race, agecat)3.(19,39] 0.0378246 0.0039 ## interaction(race, agecat)4.(19,39] 0.0261259 0.0043 ## interaction(race, agecat)1.(39,59] 0.0371274 0.0081 ## interaction(race, agecat)2.(39,59] 0.2103115 0.0125 ## interaction(race, agecat)3.(39,59] 0.0348247 0.0030 ## interaction(race, agecat)4.(39,59] 0.0210259 0.0041 ## interaction(race, agecat)1.(59,Inf] 0.0145142 0.0040 ## interaction(race, agecat)2.(59,Inf] 0.1544932 0.0107 ## interaction(race, agecat)3.(59,Inf] 0.0175686 0.0022 ## interaction(race, agecat)4.(59,Inf] 0.0089771 0.0016 ``` --- # Descriptives — mean (cont.) ``` r nhanes_ftab <- ftable(nhanes_tab, rownames = list(race = c("Hisp", "NHW", "NHB", "Other"), agecat = c("0-19", "19-39", "39-59", "59+"))) round(nhanes_ftab * 100, 1) ``` ``` ## race Hisp NHW NHB Other ## agecat ## 0-19 mean 4.3 11.9 2.9 1.7 ## SE 0.8 0.8 0.2 0.2 ## 19-39 mean 5.6 17.3 3.8 2.6 ## SE 1.1 1.0 0.4 0.4 ## 39-59 mean 3.7 21.0 3.5 2.1 ## SE 0.8 1.2 0.3 0.4 ## 59+ mean 1.5 15.4 1.8 0.9 ## SE 0.4 1.1 0.2 0.2 ``` --- # Contrasts & Updates ``` r api_means <- svymean(~ api00 + api99, clus1_design) svycontrast(api_means, c(api00 = 1, api99 = -1)) ``` ``` ## contrast SE ## contrast 37.191 3.1162 ``` ``` r clus1_design <- update(clus1_design, apidiff = api00 - api99) clus1_design <- update(clus1_design, apipct = apidiff/api99) svymean(~apidiff + apipct, clus1_design) ``` ``` ## mean SE ## apidiff 37.191257 3.1162 ## apipct 0.067339 0.0064 ``` --- # Variance & Ratio ``` r svyvar(~api99, srs_design) ``` ``` ## variance SE ## api99 18654 1400.2 ``` ``` r svyratio(~api.stu, ~enroll, srs_design) # api.stu = num of students tested ``` ``` ## Ratio estimator: svyratio.survey.design2(~api.stu, ~enroll, srs_design) ## Ratios= ## enroll ## api.stu 0.8253537 ## SEs= ## enroll ## api.stu 0.01023543 ``` --- # Quantiles ``` r svyquantile(~api99, srs_design, c(.25, .50, .75), ci = TRUE) ``` ``` ## $api99 ## quantile ci.2.5 ci.97.5 se ## 0.25 513 485 531 11.66354 ## 0.5 618 593 643 12.67776 ## 0.75 738 704 758 13.69199 ## ## attr(,"hasci") ## [1] TRUE ## attr(,"class") ## [1] "newsvyquantile" ``` --- # Subpopulations `emerg` — percent of teachers with emergency qualifications ``` r emerg_high <- subset(strat_design, emer > 20) emerg_low <- subset(strat_design, emer == 0) svymean(~ api00 + api99, emerg_high) ``` ``` ## mean SE ## api00 558.52 21.983 ## api99 523.99 21.853 ``` ``` r svymean(~ api00 + api99, emerg_low) ``` ``` ## mean SE ## api00 749.09 17.739 ## api99 720.07 19.305 ``` --- # Subpopulations (cont.) * `sch.wide` — met school-wide growth target (0/1) * `hsg` — % of parents who are high-school graduates ``` r strat_design <- update(strat_design, hsg_gt25 = hsg > 25) bys <- svyby(~api00, ~sch.wide + hsg_gt25, svymean, design = strat_design) print(bys, digits = 3) ``` ``` ## sch.wide hsg_gt25 api00 se ## No.FALSE No FALSE 643 20.0 ## Yes.FALSE Yes FALSE 712 13.3 ## No.TRUE No TRUE 516 24.8 ## Yes.TRUE Yes TRUE 623 14.3 ``` --- class: codefs-55 # Regression — linear model ``` r mod1 <- svyglm(HI_CHOL ~ as.factor(race) + as.factor(agecat) + RIAGENDR, design = nhanes_design) summary(mod1) ``` ``` ## ## Call: ## svyglm(formula = HI_CHOL ~ as.factor(race) + as.factor(agecat) + ## RIAGENDR, design = nhanes_design) ## ## Survey design: ## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, ## nest = TRUE, data = nhanes) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.011831 0.011101 -1.066 0.3143 ## as.factor(race)2 -0.006547 0.007099 -0.922 0.3804 ## as.factor(race)3 -0.034668 0.010817 -3.205 0.0107 * ## as.factor(race)4 -0.012214 0.028693 -0.426 0.6803 ## as.factor(agecat)(19,39] 0.069709 0.009083 7.675 3.08e-05 *** ## as.factor(agecat)(39,59] 0.169166 0.012563 13.465 2.87e-07 *** ## as.factor(agecat)(59,Inf] 0.144529 0.013859 10.429 2.52e-06 *** ## RIAGENDR 0.020132 0.007915 2.543 0.0315 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.09634641) ## ## Number of Fisher Scoring iterations: 2 ``` --- class: codefs-55 # Regression — logistic regression ``` r mod2 <- svyglm(HI_CHOL ~ as.factor(race) + as.factor(agecat) + RIAGENDR, family = quasibinomial(link = "logit"), design = nhanes_design) summary(mod2) ``` ``` ## ## Call: ## svyglm(formula = HI_CHOL ~ as.factor(race) + as.factor(agecat) + ## RIAGENDR, design = nhanes_design, family = quasibinomial(link = "logit")) ## ## Survey design: ## svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, ## nest = TRUE, data = nhanes) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.95074 0.28789 -17.196 3.42e-08 *** ## as.factor(race)2 -0.08489 0.07988 -1.063 0.3156 ## as.factor(race)3 -0.43322 0.15119 -2.865 0.0186 * ## as.factor(race)4 -0.14621 0.33642 -0.435 0.6741 ## as.factor(agecat)(19,39] 2.27973 0.32702 6.971 6.53e-05 *** ## as.factor(agecat)(39,59] 3.21236 0.35587 9.027 8.33e-06 *** ## as.factor(agecat)(59,Inf] 3.02997 0.35057 8.643 1.19e-05 *** ## RIAGENDR 0.21276 0.08461 2.515 0.0331 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for quasibinomial family taken to be 1.015249) ## ## Number of Fisher Scoring iterations: 7 ``` --- # Regression The `survey` package has additional functions for fitting regression models * `svyolr` — ordinal logistic regression * `svyloglin` — loglinear model * `svycoxph` — Cox proportional hazards model There are also useful functions for creating plots * `svyplot` * `svyboxplot` * `svyhist` & `svysmooth` * `svycoplot`