Skip to main content
Log in

A novel cluster sampling design that couples multiple surveys to support multiple inferential objectives

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

In the United States the number of health systems that own practices or hospitals have increased in number and complexity leading to interest in assessing the relationship between health organization factors and health outcomes. However, the existence of multiple types of organizations combined with the nesting of some hospitals and practices within health systems and the nesting of some health systems within larger health systems generates numerous analytic objectives and complicates the construction of optimal survey designs. An objective function that explicitly weighs all objectives is theoretically appealing but becomes unwieldy and increasingly ad hoc as the number of objectives increases. To overcome this problem, we develop an alternative approach based on constraining the sampling design to satisfy desired statistical properties. For example, to support evaluations of the comparative importance of factors measured in different surveys on health system performance, a constraint that requires at least one organization of each type (corporate owner, hospital, practice) to be sampled whenever any component of a system is sampled may be enforced. Multiple such constraints define a nonlinear system of equations that “couples” the survey sampling designs whose solution yields the sample inclusion probabilities for each organization in each survey. A Monte Carlo algorithm is developed to solve the simultaneous system of equations to determine the sampling probabilities and extract the samples for each survey. We illustrate the new sampling methodology by developing the constraints and solving the ensuing systems of equations to obtain the sampling design for the National Surveys of United States Health Care Systems, Hospitals and Practices. We illustrate the virtues of “coupled sampling” by comparing the proportion of eligible systems for whom the corporate owner and both a hospital and a practice that are expected to be sampled to that expected under alternative sampling designs. Comparative and descriptive analyses that illustrate features of the sampling design are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • AHRQ: Comparative Health System Performance. AHRQ-Funded Center of Excellence: Dartmouth-Havard-Mayo Clinic, Berkeley (2019)

    Google Scholar 

  • Biemer, P.P., Christ, S.L.: Weighting survey data. In: de Leeuw, E.D., Hox, J.J., Dillman, D.A. (eds.) International Handbook of Survey Methodology, pp. 317–341. Routledge, London (2008)

    Google Scholar 

  • Boldi, P., Santini, M., Vigna, S.: Do your worst to make the best: paradoxical effects in PageRank incremental computations. In: Leonardi, S. (ed.) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol. 3243. Springer, Berlin (2004)

    Google Scholar 

  • Bollen, K.A., Biemer, P.P., Tueller, S., Berzofsky, M.E.: Are survey weights needed? A review of diagnostic tests in regression analysis. Annu. Rev. Stat. Appl. 3, 375–392 (2016)

    Article  Google Scholar 

  • Bucher, C.G.: Adaptive sampling—an iterative fast Monte-Carlo procedure. J. Struct. Saf. 5(2), 119–126 (1988)

    Article  Google Scholar 

  • Chen, P.C.: Heuristic sampling: a method for predicting the performance of tree searching programs. SIAM J. Comput. 21(2), 295–315 (1992)

    Article  Google Scholar 

  • Cohen, M.P.: Determining samples sizes for surveys with data analyzed by hierarchical linear models. J. Off. Stat. 14, 267–275 (1998)

    Google Scholar 

  • Cohen, R., Havlin, S., Ben-Avraham, D.: Efficient immunization strategies for computer networks and populations. Phys. Rev. Lett. 91(24), 247901 (2003)

    Article  PubMed  Google Scholar 

  • Cook, R.D., Wong, W.K.: On the equivalence of constrained and compound optimal designs. J. Am. Stat. Assoc. 89, 687–692 (1994)

    Article  Google Scholar 

  • Currie, J.: Early childhood education programs. J. Econ. Perspect. 15(2), 213–238 (2001)

    Article  Google Scholar 

  • Diez Roux, A.V.: Investigating neighborhood and area effects on health. Am. J. Public Health 91(11), 1783–1789 (2001)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Durlauf, S.N.: Neighborhood Effects: Handbook of Regional and Urban Economics, vol. 4, pp. 2173–2242. Elsevier, Amsterdam (2004)

    Google Scholar 

  • Etikan, I., Musa, S.A., Alkassim, R.S.: Comparison of convenience sampling and purposive sampling. Am. J. Theor. Appl. Stat. 5(1), 1–4 (2016)

    Article  Google Scholar 

  • Faber, J., Sharkey, P.: Neighborhood Effects. International Encyclopedia of the Social & Behavioral Sciences, 2nd edn, pp. 443–449. Elsevier Inc, Amsterdam (2015)

    Book  Google Scholar 

  • Fisher, E., Shortell, S., O’Malley, A.J., Fraze, T., Wood, A., Palm, M., Colla, C., Rosenthal, M., Rodriguez, H., Lewis, V., Woloshin, S., Shah, N., Meara, E.: Do integrated systems adopt more care delivery and payment reforms? results from a national survey. In: Health Affairs (2020)

  • Garner, C.L., Raudenbush, S.W.: Neighborhood effects on educational attainment: a multilevel analysis. Sociol. Educ. 64(4), 251–262 (1991)

    Article  Google Scholar 

  • Gelman, A.: Struggles with survey weighting and regression modeling. Stat. Sci. 22(2), 153–164 (2007)

    Article  Google Scholar 

  • Gilks, W.R.: Derivative-free adaptive rejection sampling for gibbs sampling. In: Bernardo, J., Berger, J.O., Dawid, A.P., Smith, A.F.M. (eds.) Bayesian Statistics, vol. 4, pp. 641–649. Oxford University Press, Oxford (1992)

    Google Scholar 

  • Gilks, W.R., Wild, P.: Adaptive rejection sampling for Gibbs sampling. J. R. Stat. Soc. Ser. C (Appl. Stat.) 41(2), 337–348 (1992)

    Google Scholar 

  • Goodman, L.A.: Snowball sampling. Ann. Math. Stat. 32, 148–170 (1961)

    Article  Google Scholar 

  • Handcock, M.S., Gile, K.J.: Comment: on the concept of snowball sampling. Sociol. Methodol. 41(1), 367–371 (2011)

    Article  Google Scholar 

  • Harbitz, A.: Efficient and Accurate Probability of Failure Calculation by Use of the Importance Sampling Technique. ICASP4, Firenze (1983)

    Google Scholar 

  • Heckathorn, D.D.: Respondent-driven sampling: a new approach to the study of hidden populations. Soc. Probl. 44(2), 174–199 (1997)

    Article  Google Scholar 

  • IQVIA: OneKey Data. (2019). https://www.onekeydata.com/about

  • Kaminska, O., Lynn, P.: The implications of alternative allocation criteria in adaptive design for panel surveys. J. Off. Stat. 33(3), 781–799 (2017)

    Article  Google Scholar 

  • Kish, L.: Weighting: why, when, and how. In: Proceedings of the Survey Research Methods Section. Joint Statistical Meetings (1990)

  • Little, R.J.: Inference with survey weights. J. Off. Stat. 7(4), 405 (1991)

    Google Scholar 

  • Little, R.J.: To model or not to model? Competing modes of inference for finite population sampling. J. Am. Stat. Assoc. 99(466), 546–556 (2004)

    Article  Google Scholar 

  • Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Brooks/Cole, Boston (2009)

    Google Scholar 

  • Maiya, A.S., Berger-Wolf, T.Y. (2011) Benefits of bias: towards better characterization of network sampling. In: 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

  • Moerbeek, M., Wong, K.W.: Multiple-objective optimal designs for the hierarchical linear model. J. Off. Stat. 18(2), 291–303 (2002)

    Google Scholar 

  • Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. J. R. Stat. Soc. 97(4), 558–625 (1934)

    Article  Google Scholar 

  • Palinkas, L.A., Horwitz, S.M., Green, C.A., Wisdom, J.P., Duan, N., Hoagwood, K.: Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Adm. Policy Ment. Health Ment. Health Serv. Res. 42(5), 533–544 (2015)

    Article  Google Scholar 

  • Patton, M.Q.: Qualitative Research and Evaluation Methods. Sage, Thousand Oaks (2002)

    Google Scholar 

  • Pfeffermann, D.: The role of sampling weights when modeling survey data. Int. Stat. Rev. 61(2), 317–337 (1993)

    Article  Google Scholar 

  • Pfeffermann, D.: The use of sampling weights for survey data analysis. Stat. Methods Med. Res. 5(3), 239–261 (1996)

    Article  CAS  PubMed  Google Scholar 

  • Pfeffermann, D., Skinner, C.J., Holmes, D.J., Goldstein, H., Rasbash, J.: Weighting for unequal selection probabilities in multilevel models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 60, 23–40 (1998)

    Article  Google Scholar 

  • Rao, J.N.K., Verret, F., Hidiroglou, M.A.: A weighted composite likelihood approach to inference for two-level models from survey data. Surv. Methodol. 39(2), 263–282 (2013)

    Google Scholar 

  • Sarndal, C.-E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (1992)

    Book  Google Scholar 

  • StataCorp: Stata Statistical Software. StataCorp LLC, College Station (2017). (Release 15)

    Google Scholar 

  • Stutzbach, D., Rejaie, R., Dueld, N., Sen, S., Willinger, W.: On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans. Netw. 17(2), 377–390 (2009)

    Article  Google Scholar 

  • Tille, Y., Favre, A.-C.: Optimal allocation in balanced sampling. Stat. Probab. Lett. 74(1), 31–37 (2005)

    Article  Google Scholar 

  • Yi, G.Y., Rao, J.N., Li, H.: A weighted composite likelihood approach for analysis of survey data under two-level models. Stat. Sin. 26(2), 569–587 (2016)

    Google Scholar 

Download references

Funding

This work was supported by the Agency for Healthcare Research and Quality’s (AHRQ’s) Comparative Health System Performance Initiative under Grant # 1U19HS024075, which studies how health care delivery systems promote evidence-based practices and patient-centered outcomes research in delivering care. The findings and conclusions in this article are those of the author(s) and do not necessarily reflect the views of AHRQ. The statements, findings, conclusions, views, and opinions contained and expressed in this article are based in part on data obtained under license from IQVIA information services: OneKey subscription information services 2010–2017, IQVIA incorporated all rights reserved. The statements, findings, conclusions, views, and opinions contained and expressed herein are not necessarily those of IQVIA Incorporated or any of its affiliated or subsidiary entities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. James O’Malley.

Ethics declarations

Conflict of interest

Neither author has received honorariums from for-profit companies, non-profit organizations, or government agencies; or owns stock in any company that creates a conflict of interest in relation to this paper. As such, neither author declares any conflict of interest.

Ethical approval

This paper does not contain any studies with animals performed by any of the authors. This paper does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was not required as the study subjects are organizations, not individuals, and a commercial data set was the basis of the research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Multiple-objectives optimal designs

Classical survey designs often involve some form of Neyman allocation. For example, the objective is often minimization of the variance of an unbiased estimator of the quantity being estimated subject to a budgetary or sample-size constraint. With multiple targets of inference within a single regression model the situation is more complicated, let alone the case when multiple regression models will be estimated. When one is interested in evaluating the effects of multiple factors in one or more regression models, a multiple objectives design problem obtains. If the success of meeting K objectives is quantified by statistical efficiency measures denoted \({\text{Eff}}_{k} \left( {Y,X,n} \right)\), a multiple objectives optimizing function that combines them additively is:

$${\text{Eff}}\left( {Y,X,n} \right) = \mathop \sum \limits_{k = 1}^{K} w_{k} {\text{Eff}}_{k} \left( {Y,X,n} \right)$$

where \(w_{k} \ge 0\) and \(\mathop \sum \nolimits_{k = 1}^{K} w_{k} = 1\).

The design-efficiency of a standard cluster randomized design with equal sample-sizes per cluster is \(1 + \left( {m - 1} \right)\rho\), where m is the number of within unit samples, \(\rho = \sigma_{b}^{2} /\left( {\sigma_{b}^{2} + \sigma_{w}^{2} } \right)\) is the intraclass correlation coefficient, \(\sigma_{b}^{2}\) and \(\sigma_{w}^{2}\) are the between-unit and within-unit variance components. Let \(n\) be the number of clusters. If the cost of sampling a cluster is \(C_{u}\) and the cost of sampling a unit within the cluster is \(C_{k}\), the total cost is \(C = n\left( {C_{u} + mC_{k} } \right)\). The optimal design for estimating the coefficient of a within-cluster predictor (e.g., \(\beta_{4}\) in the first regression model) maximizes the total number of observations, which for \(C_{u} > 0\) occurs when \(n = 1\) and \(m = \left( {C - C_{u} } \right)/C_{k}\). Note the indifference of the solution to \(\rho\). However, the optimal design for estimating a cluster-level predictor (e.g., \(\beta_{1}\) in the first regression model) is given by

$$m = { \hbox{max} }\left\{ {1,\left( {\frac{{C_{u} \left( {1 - \rho } \right)}}{{C_{k} \rho }}} \right)^{0.5} } \right\}$$

and

$$n = \frac{C}{{C_{u} + mC_{k} }}$$

If \(\rho \approx 0\) the optimal designs are essentially equivalent. Yet if \(\rho \approx 1\) they are polar opposites. Furthermore, if \(\rho\) and \(C_{u}\) are large there is a great loss of statistical efficiency from using the \(\beta_{4}\) optimal design to estimate \(\beta_{1}\) while in general using the \(\beta_{1}\) optimal design fails to identify let alone estimate \(\beta_{4}\) efficiently. Therefore, even in this simple case, different objectives lead to drastically different optimal designs. If the above two objectives were combined, the relative weight of each would have a substantial impact. However, the specification of such weights might be arbitrary.

To avoid the above predicament, we favor the specification of the design by directly specifying the type of solution that is known to be amenable to the analytic scenarios of interest, such as the estimability of complex hierarchical models. The paper develops a novel computational procedure that solves a system of equations to yield a numerical solution for the optimal sampling design (i.e., determining the sampling probabilities) that satisfy the constraints for the design. This approach essentially specifies the weights \(w_{k}\) for each objective implicitly (i.e., in the sense of being inversely-defined from the specified optimal-solution constraints) as opposed to being specified upfront and held fixed while the optimal-design (and thus its form) is determined. However, making a formal connection between the two approaches (i.e., establishing a primal problem–dual problem) was not an objective of this paper.

1.2 GitHub site and code

The code used to perform the calculations in this paper is an R script available at the GitHub site maintained by the first author: https://github.com/kiwijomalley/Novel-Sampling-Design-Algorithm. The script takes as an input data that contains summary information about health systems and owner subsidiaries and their underlying hospitals and physician practices. The data set provided on the GitHub site is made up because the Data Use Agreement for the project prohibits sharing the actual data. However, it allows the computations performed in the paper to be fully illustrated.

1.3 Accounting for sampling design in statistical analyses in Stata

Statistical analyses that use the survey weights can be operationalized with relative ease. The sampling design may be accounted for in advance by using the svyset command in Stata. The presence of hospitals and practices nested within a corporate owner and of owner subsidiaries nested within corporate parents, leads to a three-level hierarchical data structure. The appropriate svyset command has the form:

$${\text{svyset}}\,{\text{CP}}\_{\text{ID}},{\text{weight}}\left( {{\text{CP}}\_{\text{weight}}} \right)\left| {\left| {{\text{OS}}\_{\text{ID}},{\text{weight}}\left( {{\text{OS}}\_{\text{weight}}} \right)} \right|} \right|{\text{HP}}\_{\text{ID}},{\text{weight}}\left( {{\text{HP}}\_{\text{weight}}} \right)$$

where CP_ID, OS_ID and HP_ID denote the identification codes for the corporate parent, owner subsidiary and the hospital or practice and CP_weight, OS_weight and HP_weight denote the inverses of the inclusion probability of the corporate parent and the conditional inclusion probabilities of the owner subsidiary, hospital and practices. As noted in Sect. 3.2, the conditional inclusion probabilities for owner subsidiaries equal the inclusion probability determined by Algorithm 1 divided by the inclusion probability of their corporate parent. The conditional inclusion probability for hospitals and practices are determined from the sampling design used within systems and owner subsidiaries to select hospital and practices. For example, under simple random sampling (SRS) these sampling probabilities equal the number of surveys allocated to hospitals (practices) divided by the number of hospitals (practices) within the organization. Expanding on Sect. 2.1, to allow for the fact that survey designs with three different structures (CP–OS–HP, CP–HP and independent HP) may be combined in a single analysis, we set OS_ID = HP_ID if OS_ID is not defined and CP_ID = HP_ID if HP_ID is not defined (e.g., as for an independent hospital or practice). This ensures that the IDs are defined for all hospitals and practices allowing statistical models and procedures to be applied to the combined data.

The meglm command in Stata allows for the estimation of mixed effects models with survey weights. For a binary valued outcome, the code

$${\text{svy}}:{\text{melogit}}\,\left\{ {\text{model}} \right\}\left| {\left| {{\text{CP}}\_{\text{ID}}:} \right|} \right|{\text{OS}}\_{\text{ID}}:$$

Or

$${\text{meglm}}\left\{ {\text{model}} \right\}\left[ {{\text{pweight}} = {\text{HP}}\_{\text{weight}}} \right]\left| {\left| {{\text{CP}}\_{\text{ID}}:,{\text{pweight}}\left( {{\text{CP}}\_{\text{weight}}} \right)} \right|} \right|{\text{OS}}\_{\text{ID}}:,{\text{pweight}}\left( {{\text{OS}}\_{\text{weight}}} \right),{\text{family}}\left( {\text{binomial}} \right){\text{link}}\left( {\text{logit}} \right)$$

could be used. The difference between the two specifications is that the latter does not rely on the sampling design having been specified via svyset. In general, it is best to set the design in advance as some procedures do not allow sampling design weights.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

O’Malley, A.J., Park, S. A novel cluster sampling design that couples multiple surveys to support multiple inferential objectives. Health Serv Outcomes Res Method 20, 85–110 (2020). https://doi.org/10.1007/s10742-020-00210-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-020-00210-y

Keywords

Navigation