We often ask what the subtypes of COPD may be, but it’s less common to ask what we intend to do with the subtypes we want to discover. What are COPD subtypes for? Why do we need to discover them? There are usually two things that people want from a COPD subtype:
- the subtype accurately captures an aspect of COPD biology
- the subtype is correlated to an important clinical characteristic or outcome of COPD
An ideal subtype would give us both things at the same time, but I don’t think we’re there yet. There is still a gap between clearly “biological” subtypes and the best predictions we can get using supervised models. When we have subtypes that are clearly “biological” that are also essential for accurate prediction of important COPD outcomes, then perhaps we will have attained something like an updated, biologically-driven COPD classification. It’s clear that we should think of the search for COPD subtypes as a long-term project. Since subtypes can mean different things to different people, trying to solve all of our subtyping needs in one step may not be the most productive approach.
To put things a bit more bluntly, there is a kind of machine learning analysis cycle that usually leads to disappointment, and that I think we should try to avoid. It goes like this:
- Step 1 (This is going to be great) : We have lots of great data, and we’re going to do machine learning!
- Step 2 (Unbiased machine learning is great) : We’re going to be unbiased, so we’re going to define some variables/outcomes that we want our subtypes to be associated to, and we’re going to hold these out of the machine learning process to validate what we find.
- Step 3 (Why isn’t this working) : We did machine learning and found stuff, but that stuff isn’t as strongly associated to our validation measures as we hoped.
- Step 4 (The unacknowledged descent into supervised learning) : Let’s try to modify our set of input variables and tweak our parameters so that our “validation” metrics look better.
- Step 5 (The results desert) : We’ve done a lot of things, but we haven’t found a great solution, and we don’t have a good rationale for picking which one of our many clustering results should be the main one (as described in detail in this post).
- Step 6 (Strategic retreat) : These results aren’t what we’ve hoped, but let’s make some ad-hoc justifications and write this up.
Sound familiar? I’ve been there many times, and it can be frustrating. But it’s important to be honest about the challenges of unsupervised clustering in COPD with the intent of finding the best possible way forward. I think it helps to acknowledge the following things.
We’re a long way from the dream of true, biologically correct COPD subtypes. COPD is complex and difficult to study. We don’t fully understand the natural history and causative factors of COPD, and dividing COPD into its biological subtypes implies a level of understanding that doesn’t yet exist. Paradoxically, we won’t really understand COPD biology and natural history without accounting for its subtypes. So, we need to pull ourselves up by our bootstraps so to speak and iterate between subtype definition and biological discovery.
We should use supervised learning methods rather than focusing exclusively on unsupervised clustering. The ability of supervised models to help identify (or construct) important features can really help us in the first stage of data exploration. Too often, unsupervised clustering yields multiple possible solutions with only modest associations to COPD outcomes and weak justification for choosing a single, “optimal” solution. Supervised methods are much better equipped to discriminate between models and reduce data dimensionality in a way that is relevant to COPD-related outcomes, as discussed here. Using “all the data” in an unbiased manner will not, in and of itself, be sufficient.
Integrating prior knowledge into machine learning studies is essential, and this is for two reasons. First, when machine learning methods are applied to biological problems, we care not only about the accuracy but the interpretability of the model. “Interpretability” in this case means comparing the structure of our models to already known biology (think of gene set enrichment analysis as a crude example of this). Second, even though biology is now a data science, we still don’t have the volume of data required to learn all the necessary parameters for most models by “brute force” (see an excellent review here). The best applications of machine learning to biology typically use biological knowledge to guide parameter selection or model structure. For COPD and other chronic diseases, we need to be prepared to assist our algorithms by “hard-coding” biological knowledge into our models (as here, for example).
We need to be more specific about what we mean by subtype. With respect to subtypes, do we think there a single set of subtypes that capture the essential differences between people with COPD? Maybe. But think about it this way – is there a single set of lung cancer subtypes that capture the important differences between subjects with lung cancer? It’s complicated. Pathology matters – there are important differences between adenocarcinoma and small cell cancers. But molecular markers also matter, especially in light of available treatments. EGFR mutation and expression patterns matter, but so does ALK1, RET, and other genes. Do these markers define subtypes because they are more biologically important than other genes, or is this driven primarily by the fact that we have effective treatments targeting these genes? We want our subtypes to be biologically relevant, clinically useful, and we want machine learning to find this all for us in an “unbiased” manner. But if we’re not careful we end up asking machine learning methods to ask questions that really can’t be answered, at least not with the data we have provided. This is a reason for focusing on outcome-specific subtypes (rather than one-size-fits all subtypes) as an interim measure.
Where this will ultimately end for COPD? We are still in the relatively early stages of learning how to best utilize big datasets of images and genomics. While it seems clear that supervised approaches have promise, it is also true unbiased genomics screens have advanced our biological knowledge significantly. From genome-wide association studies, we now know more than 80 genetic loci that contribute to COPD risk. Detailed functional analysis of individual loci has led to novel functional insights that implicate specific genes, regulatory elements, and cell types, as has been shown for genetic influences on TGFB2 expression in lung fibroblasts. These rich datasets have already produced new biological insights. But we are also now in the post-GWAS era where the challenge is not only to discover disease-associated molecules but also to put them in context and better define their utility and applicability. For these second-order projects, of which subtyping is an example, some degree of supervision will probably lead to more fruitful applications of machine learning.