Recent advances in single-cell sequencing technologies provide the opportunity to perform large-scale association studies, such as case-control studies to define disease associated cell phenotypes, stimulus response studies where cell phenotypes may be altered by a stimulus, or QTL studies where genotype may be associated with cell type abundances. However, the best practices for properly designing these experiments with sufficient statistical power remains understudied. We simulated single cell RNA sequencing datasets under various conditions and used MASC - a method for performing association testing with single cell data while controlling for interindividual differences and technical effects that can confound analysis - to detect differentially abundant populations in simulated datasets. For example, we considered an scRNA-seq study containing 100,000 cells from equal numbers of cases and controls, where a 5% frequency cell population was on average 20% more abundant in cases than controls. A 100 sample dataset with 1000 cells collected per sample was powered to detect this differential abundance in 49% of simulations (at p < 0.05), while a 50 sample dataset with 2000 cells collected per sample only reached 26% power. Similarly, we consistently found that sample size affected power to detect differential abundance significantly more than than the number of cells collected per sample across a range of potential effect sizes. We believe that our findings will allow researchers to design properly powered single-cell RNA sequencing experiments given the frequency and magnitude of differential abundance between cases and controls, which will be key to defining disease-linked immunological subpopulations.
Chamith Fonseka– Doctoral Student, Harvard Medical School
Ilya Korsunsky– Postdoctoral Research Fellow, Harvard Medical School
Nghia Millard– Doctoral Student, Harvard Medical School
Soumya Raychaudhuri– Professor of Medicine, Brigham and Women's Hospital/Harvard Medical School