[HTML][HTML] Response to Orlova et al. “Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets”

Y Saeys, S Van Gassen, B Lambrecht - Nature Reviews Immunology, 2018 - nature.com
Y Saeys, S Van Gassen, B Lambrecht
Nature Reviews Immunology, 2018nature.com
Unsupervised learning techniques such as clustering and dimensionality reduction have
been widely used in many high-dimensional biological settings where they shed light on the
internal problem structure. In their correspondence on our Review (Computational flow
cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol.
16, 449–462 2016) 1, Orlova et al. argue against the use of these techniques to identify cell
populations in high-dimensional flow and mass cytometry data, based on arguments related …
Unsupervised learning techniques such as clustering and dimensionality reduction have been widely used in many high-dimensional biological settings where they shed light on the internal problem structure. In their correspondence on our Review (Computational flow cytometry: helping to make sense of high-dimensional immunology data. Nat. Rev. Immunol. 16, 449–462 2016) 1, Orlova et al. argue against the use of these techniques to identify cell populations in high-dimensional flow and mass cytometry data, based on arguments related to the curse of dimensionality (Science not art: statistically sound methods for identifying subsets in multi-dimensional flow and mass cytometry data sets. Nat. Rev. Immunol. http://dx. doi. org/10.1038/nri. 2017.150-c1) 2.
The curse of dimensionality states that the number of samples needed to fit a model to an arbitrary degree of precision increases exponentially as the number of parameters that describe the data increases. This in itself might not be problematic for cytometry data, as the number of parameters are still relatively low (a few tens of markers) and sample sizes are large (up to millions of cells). Other high-dimensional biological settings, such as transcriptomics, measure many more parameters (for example, 10,000 transcripts) for fewer samples (typically a few tens or hundreds of samples), thus resulting in far more challenging situations from a statistical point of view. Nevertheless, even in these situations clustering techniques have proved useful to highlight grouping structures in such high-dimensional, low-sample settings.
nature.com