Abstract
In the United States the number of health systems that own practices or hospitals have increased in number and complexity leading to interest in assessing the relationship between health organization factors and health outcomes. However, the existence of multiple types of organizations combined with the nesting of some hospitals and practices within health systems and the nesting of some health systems within larger health systems generates numerous analytic objectives and complicates the construction of optimal survey designs. An objective function that explicitly weighs all objectives is theoretically appealing but becomes unwieldy and increasingly ad hoc as the number of objectives increases. To overcome this problem, we develop an alternative approach based on constraining the sampling design to satisfy desired statistical properties. For example, to support evaluations of the comparative importance of factors measured in different surveys on health system performance, a constraint that requires at least one organization of each type (corporate owner, hospital, practice) to be sampled whenever any component of a system is sampled may be enforced. Multiple such constraints define a nonlinear system of equations that “couples” the survey sampling designs whose solution yields the sample inclusion probabilities for each organization in each survey. A Monte Carlo algorithm is developed to solve the simultaneous system of equations to determine the sampling probabilities and extract the samples for each survey. We illustrate the new sampling methodology by developing the constraints and solving the ensuing systems of equations to obtain the sampling design for the National Surveys of United States Health Care Systems, Hospitals and Practices. We illustrate the virtues of “coupled sampling” by comparing the proportion of eligible systems for whom the corporate owner and both a hospital and a practice that are expected to be sampled to that expected under alternative sampling designs. Comparative and descriptive analyses that illustrate features of the sampling design are also presented.
Similar content being viewed by others
References
AHRQ: Comparative Health System Performance. AHRQ-Funded Center of Excellence: Dartmouth-Havard-Mayo Clinic, Berkeley (2019)
Biemer, P.P., Christ, S.L.: Weighting survey data. In: de Leeuw, E.D., Hox, J.J., Dillman, D.A. (eds.) International Handbook of Survey Methodology, pp. 317–341. Routledge, London (2008)
Boldi, P., Santini, M., Vigna, S.: Do your worst to make the best: paradoxical effects in PageRank incremental computations. In: Leonardi, S. (ed.) Algorithms and Models for the Web-Graph. WAW 2004. Lecture Notes in Computer Science, vol. 3243. Springer, Berlin (2004)
Bollen, K.A., Biemer, P.P., Tueller, S., Berzofsky, M.E.: Are survey weights needed? A review of diagnostic tests in regression analysis. Annu. Rev. Stat. Appl. 3, 375–392 (2016)
Bucher, C.G.: Adaptive sampling—an iterative fast Monte-Carlo procedure. J. Struct. Saf. 5(2), 119–126 (1988)
Chen, P.C.: Heuristic sampling: a method for predicting the performance of tree searching programs. SIAM J. Comput. 21(2), 295–315 (1992)
Cohen, M.P.: Determining samples sizes for surveys with data analyzed by hierarchical linear models. J. Off. Stat. 14, 267–275 (1998)
Cohen, R., Havlin, S., Ben-Avraham, D.: Efficient immunization strategies for computer networks and populations. Phys. Rev. Lett. 91(24), 247901 (2003)
Cook, R.D., Wong, W.K.: On the equivalence of constrained and compound optimal designs. J. Am. Stat. Assoc. 89, 687–692 (1994)
Currie, J.: Early childhood education programs. J. Econ. Perspect. 15(2), 213–238 (2001)
Diez Roux, A.V.: Investigating neighborhood and area effects on health. Am. J. Public Health 91(11), 1783–1789 (2001)
Durlauf, S.N.: Neighborhood Effects: Handbook of Regional and Urban Economics, vol. 4, pp. 2173–2242. Elsevier, Amsterdam (2004)
Etikan, I., Musa, S.A., Alkassim, R.S.: Comparison of convenience sampling and purposive sampling. Am. J. Theor. Appl. Stat. 5(1), 1–4 (2016)
Faber, J., Sharkey, P.: Neighborhood Effects. International Encyclopedia of the Social & Behavioral Sciences, 2nd edn, pp. 443–449. Elsevier Inc, Amsterdam (2015)
Fisher, E., Shortell, S., O’Malley, A.J., Fraze, T., Wood, A., Palm, M., Colla, C., Rosenthal, M., Rodriguez, H., Lewis, V., Woloshin, S., Shah, N., Meara, E.: Do integrated systems adopt more care delivery and payment reforms? results from a national survey. In: Health Affairs (2020)
Garner, C.L., Raudenbush, S.W.: Neighborhood effects on educational attainment: a multilevel analysis. Sociol. Educ. 64(4), 251–262 (1991)
Gelman, A.: Struggles with survey weighting and regression modeling. Stat. Sci. 22(2), 153–164 (2007)
Gilks, W.R.: Derivative-free adaptive rejection sampling for gibbs sampling. In: Bernardo, J., Berger, J.O., Dawid, A.P., Smith, A.F.M. (eds.) Bayesian Statistics, vol. 4, pp. 641–649. Oxford University Press, Oxford (1992)
Gilks, W.R., Wild, P.: Adaptive rejection sampling for Gibbs sampling. J. R. Stat. Soc. Ser. C (Appl. Stat.) 41(2), 337–348 (1992)
Goodman, L.A.: Snowball sampling. Ann. Math. Stat. 32, 148–170 (1961)
Handcock, M.S., Gile, K.J.: Comment: on the concept of snowball sampling. Sociol. Methodol. 41(1), 367–371 (2011)
Harbitz, A.: Efficient and Accurate Probability of Failure Calculation by Use of the Importance Sampling Technique. ICASP4, Firenze (1983)
Heckathorn, D.D.: Respondent-driven sampling: a new approach to the study of hidden populations. Soc. Probl. 44(2), 174–199 (1997)
IQVIA: OneKey Data. (2019). https://www.onekeydata.com/about
Kaminska, O., Lynn, P.: The implications of alternative allocation criteria in adaptive design for panel surveys. J. Off. Stat. 33(3), 781–799 (2017)
Kish, L.: Weighting: why, when, and how. In: Proceedings of the Survey Research Methods Section. Joint Statistical Meetings (1990)
Little, R.J.: Inference with survey weights. J. Off. Stat. 7(4), 405 (1991)
Little, R.J.: To model or not to model? Competing modes of inference for finite population sampling. J. Am. Stat. Assoc. 99(466), 546–556 (2004)
Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Brooks/Cole, Boston (2009)
Maiya, A.S., Berger-Wolf, T.Y. (2011) Benefits of bias: towards better characterization of network sampling. In: 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Moerbeek, M., Wong, K.W.: Multiple-objective optimal designs for the hierarchical linear model. J. Off. Stat. 18(2), 291–303 (2002)
Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. J. R. Stat. Soc. 97(4), 558–625 (1934)
Palinkas, L.A., Horwitz, S.M., Green, C.A., Wisdom, J.P., Duan, N., Hoagwood, K.: Purposeful sampling for qualitative data collection and analysis in mixed method implementation research. Adm. Policy Ment. Health Ment. Health Serv. Res. 42(5), 533–544 (2015)
Patton, M.Q.: Qualitative Research and Evaluation Methods. Sage, Thousand Oaks (2002)
Pfeffermann, D.: The role of sampling weights when modeling survey data. Int. Stat. Rev. 61(2), 317–337 (1993)
Pfeffermann, D.: The use of sampling weights for survey data analysis. Stat. Methods Med. Res. 5(3), 239–261 (1996)
Pfeffermann, D., Skinner, C.J., Holmes, D.J., Goldstein, H., Rasbash, J.: Weighting for unequal selection probabilities in multilevel models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 60, 23–40 (1998)
Rao, J.N.K., Verret, F., Hidiroglou, M.A.: A weighted composite likelihood approach to inference for two-level models from survey data. Surv. Methodol. 39(2), 263–282 (2013)
Sarndal, C.-E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (1992)
StataCorp: Stata Statistical Software. StataCorp LLC, College Station (2017). (Release 15)
Stutzbach, D., Rejaie, R., Dueld, N., Sen, S., Willinger, W.: On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans. Netw. 17(2), 377–390 (2009)
Tille, Y., Favre, A.-C.: Optimal allocation in balanced sampling. Stat. Probab. Lett. 74(1), 31–37 (2005)
Yi, G.Y., Rao, J.N., Li, H.: A weighted composite likelihood approach for analysis of survey data under two-level models. Stat. Sin. 26(2), 569–587 (2016)
Funding
This work was supported by the Agency for Healthcare Research and Quality’s (AHRQ’s) Comparative Health System Performance Initiative under Grant # 1U19HS024075, which studies how health care delivery systems promote evidence-based practices and patient-centered outcomes research in delivering care. The findings and conclusions in this article are those of the author(s) and do not necessarily reflect the views of AHRQ. The statements, findings, conclusions, views, and opinions contained and expressed in this article are based in part on data obtained under license from IQVIA information services: OneKey subscription information services 2010–2017, IQVIA incorporated all rights reserved. The statements, findings, conclusions, views, and opinions contained and expressed herein are not necessarily those of IQVIA Incorporated or any of its affiliated or subsidiary entities.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Neither author has received honorariums from for-profit companies, non-profit organizations, or government agencies; or owns stock in any company that creates a conflict of interest in relation to this paper. As such, neither author declares any conflict of interest.
Ethical approval
This paper does not contain any studies with animals performed by any of the authors. This paper does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was not required as the study subjects are organizations, not individuals, and a commercial data set was the basis of the research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Multiple-objectives optimal designs
Classical survey designs often involve some form of Neyman allocation. For example, the objective is often minimization of the variance of an unbiased estimator of the quantity being estimated subject to a budgetary or sample-size constraint. With multiple targets of inference within a single regression model the situation is more complicated, let alone the case when multiple regression models will be estimated. When one is interested in evaluating the effects of multiple factors in one or more regression models, a multiple objectives design problem obtains. If the success of meeting K objectives is quantified by statistical efficiency measures denoted \({\text{Eff}}_{k} \left( {Y,X,n} \right)\), a multiple objectives optimizing function that combines them additively is:
where \(w_{k} \ge 0\) and \(\mathop \sum \nolimits_{k = 1}^{K} w_{k} = 1\).
The design-efficiency of a standard cluster randomized design with equal sample-sizes per cluster is \(1 + \left( {m - 1} \right)\rho\), where m is the number of within unit samples, \(\rho = \sigma_{b}^{2} /\left( {\sigma_{b}^{2} + \sigma_{w}^{2} } \right)\) is the intraclass correlation coefficient, \(\sigma_{b}^{2}\) and \(\sigma_{w}^{2}\) are the between-unit and within-unit variance components. Let \(n\) be the number of clusters. If the cost of sampling a cluster is \(C_{u}\) and the cost of sampling a unit within the cluster is \(C_{k}\), the total cost is \(C = n\left( {C_{u} + mC_{k} } \right)\). The optimal design for estimating the coefficient of a within-cluster predictor (e.g., \(\beta_{4}\) in the first regression model) maximizes the total number of observations, which for \(C_{u} > 0\) occurs when \(n = 1\) and \(m = \left( {C - C_{u} } \right)/C_{k}\). Note the indifference of the solution to \(\rho\). However, the optimal design for estimating a cluster-level predictor (e.g., \(\beta_{1}\) in the first regression model) is given by
and
If \(\rho \approx 0\) the optimal designs are essentially equivalent. Yet if \(\rho \approx 1\) they are polar opposites. Furthermore, if \(\rho\) and \(C_{u}\) are large there is a great loss of statistical efficiency from using the \(\beta_{4}\) optimal design to estimate \(\beta_{1}\) while in general using the \(\beta_{1}\) optimal design fails to identify let alone estimate \(\beta_{4}\) efficiently. Therefore, even in this simple case, different objectives lead to drastically different optimal designs. If the above two objectives were combined, the relative weight of each would have a substantial impact. However, the specification of such weights might be arbitrary.
To avoid the above predicament, we favor the specification of the design by directly specifying the type of solution that is known to be amenable to the analytic scenarios of interest, such as the estimability of complex hierarchical models. The paper develops a novel computational procedure that solves a system of equations to yield a numerical solution for the optimal sampling design (i.e., determining the sampling probabilities) that satisfy the constraints for the design. This approach essentially specifies the weights \(w_{k}\) for each objective implicitly (i.e., in the sense of being inversely-defined from the specified optimal-solution constraints) as opposed to being specified upfront and held fixed while the optimal-design (and thus its form) is determined. However, making a formal connection between the two approaches (i.e., establishing a primal problem–dual problem) was not an objective of this paper.
1.2 GitHub site and code
The code used to perform the calculations in this paper is an R script available at the GitHub site maintained by the first author: https://github.com/kiwijomalley/Novel-Sampling-Design-Algorithm. The script takes as an input data that contains summary information about health systems and owner subsidiaries and their underlying hospitals and physician practices. The data set provided on the GitHub site is made up because the Data Use Agreement for the project prohibits sharing the actual data. However, it allows the computations performed in the paper to be fully illustrated.
1.3 Accounting for sampling design in statistical analyses in Stata
Statistical analyses that use the survey weights can be operationalized with relative ease. The sampling design may be accounted for in advance by using the svyset command in Stata. The presence of hospitals and practices nested within a corporate owner and of owner subsidiaries nested within corporate parents, leads to a three-level hierarchical data structure. The appropriate svyset command has the form:
where CP_ID, OS_ID and HP_ID denote the identification codes for the corporate parent, owner subsidiary and the hospital or practice and CP_weight, OS_weight and HP_weight denote the inverses of the inclusion probability of the corporate parent and the conditional inclusion probabilities of the owner subsidiary, hospital and practices. As noted in Sect. 3.2, the conditional inclusion probabilities for owner subsidiaries equal the inclusion probability determined by Algorithm 1 divided by the inclusion probability of their corporate parent. The conditional inclusion probability for hospitals and practices are determined from the sampling design used within systems and owner subsidiaries to select hospital and practices. For example, under simple random sampling (SRS) these sampling probabilities equal the number of surveys allocated to hospitals (practices) divided by the number of hospitals (practices) within the organization. Expanding on Sect. 2.1, to allow for the fact that survey designs with three different structures (CP–OS–HP, CP–HP and independent HP) may be combined in a single analysis, we set OS_ID = HP_ID if OS_ID is not defined and CP_ID = HP_ID if HP_ID is not defined (e.g., as for an independent hospital or practice). This ensures that the IDs are defined for all hospitals and practices allowing statistical models and procedures to be applied to the combined data.
The meglm command in Stata allows for the estimation of mixed effects models with survey weights. For a binary valued outcome, the code
Or
could be used. The difference between the two specifications is that the latter does not rely on the sampling design having been specified via svyset. In general, it is best to set the design in advance as some procedures do not allow sampling design weights.
Rights and permissions
About this article
Cite this article
O’Malley, A.J., Park, S. A novel cluster sampling design that couples multiple surveys to support multiple inferential objectives. Health Serv Outcomes Res Method 20, 85–110 (2020). https://doi.org/10.1007/s10742-020-00210-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-020-00210-y