Crossover designs, saxophones, resonators, and musician perceptions

Journal Club: Plastic or Metal? Saxophone pad resonators

Brief summary: Fun discussion about study design and randomization as it relates to crossover study designs. The article is a study of musician perceptions on a kind of modification to saxophones called “resonators.” This study was introduced to me by my friend Allan Walkey, a pulmonologist, health services researcher, and a pretty good sax player as well. Thanks to Gary Scavone, the senior author of the study who joined us at the end to give his insights!

Guests: Allan Walkey (AW) – https://www.bumc.bu.edu/pulmonary/about-us/people/allanwalkey/

Gary Scavone (GS) – https://www.mcgill.ca/music/gary-scavone

Article Links:

http://www.music.mcgill.ca/caml/lib/exe/fetch.php?media=publications:eveno_pads_aaa_2015.pdf

Plastic or Metal? Pad resonators blind test | Inside the saxophone

PC: I loved this study, which was presented in two separate posts online and as a full paper. I thought it would be fun to start journal club with something non-medical. I’ll explain my positive response to it. The two things I liked most about this study were:

  1. The question is incredibly clear and the practical importance is obvious. Do I need pad resonators for my saxophone and if so, what kind?
  2. The elegance of studying the question on multiple levels.

To expand on the second point, there is “objective” data (measurements of impedance, amplitude, and efficiency) and there is “subjective” data on musicians assessment of “brightness.” To translate it into the biomedical realm, it was like a paper where a convincing trial outcome is also supported by a strong and plausible biological mechanism. What was your main reaction?

AW: OK, first a little (OK, a lot of) background. What are resonators? When each key on a saxophone opens and closes, it changes the note. A closed key uses a leather pad on the bottom to make a seal so no air escapes and the note is strong, in tune, and doesn’t squeak. If you imagine a sax with closed keys as a metal tube that conducts sound, you can also imagine that the material on the bottom of the closed keys might matter in terms of the quality of the sound being produced. Additionally, you might expect a ‘dose response’ in which the more keys that are needed to produce a note the greater the effect on the quality of the sound. “Resonators” are material placed over the center of the leather pads on the bottom of the keys and are designed to reflect the sound from the closed key back through the horn. More accurately, they are ‘sound reflectors’ than resonators. You can buy resonators in brass, nickel, silver, gold, plastic, or use no resonator and go with all leather on the bottom of the key.

About 15 years ago I graduated medical school, started residency, and was excited to buy myself a present for completing my intern year: my first professional-grade alto sax. After some initial research I bought a Yamaha unlacquered ‘custom Z’ sax (saxophones are typically coated in a coat of lacquer to maintain a shiny finish, which can change the sound) in order to try to get the purest most free-blowing sound unencumbered by the vanity of lacquer. So I was a little disappointed when I saw that my new sax arrived with plastic resonators on the pads, rather than ‘pure brass’. My dream of having the purest sax sound in history was foiled by this unnatural plastic underneath my keys! From that moment, I dreamed of the day that I would place metal resonators on my horn to allow it to realize its full brass potential.

Fast forward to today. After 15 years of wear and tear (on me and my sax), my pads were literally torn up, the springs rusty, and my sax was in need of an overhaul. In an overhaul, a sax is disassembled, cleaned, oiled, and the pads and springs are removed and replaced with new fresh parts. It’s a big, expensive  job that takes a sax repair person probably a week of work. But I saw this as an opportunity – while my sax was disassembled, (for about $150 extra) I could have new metal resonators placed to finally achieve my brass dreams. 

Would it be worth the expense and changes from my 15 years of familiarity with my plastic resonator sax to make the change?

PC: Ahh. Now I know what a resonator is! So this was the question that brought you to the article. Two questions – 1) how did you find it? And 2) what did you think of the article?

AW: 1) I just Googled “do sax resonators make a difference” or something like that. And the first hit was this article called “plastic or metal”. It was posted in the website of Syos, a company that makes custom sax mouthpieces, and I thought it was going to be an advertisement for a new line of resonators. But it wasn’t. It was a summary of a “real” research study. 

2) This was a “practice-changing” study for me. It used a mixed-methods approach (similar to what we use in our own research center) of quantitative analysis and qualitative evaluation of stateholder perceptions to get a full “picture” of the effect of different resonator materials on sax sound quality. It used randomization, blinding, and a within subjects crossover design to minimize both random and systematic bias. It had clear figures and presentation of results. Finally, the results seemed unambiguous. In some ways, the “negative” result here reassured me that this study was not biased by commercial interests. 

PC: Some interesting points there. Let’s talk about the study design, and specifically as it relates to the results in this figure.

To fill in some detail, you have four different saxophones with three outcomes. Saxophone 38 is the “control” (no resonators), sax 40 has metal resonators, and for some reason both sax 37 and 39 have plastic resonators. Each saxophone is then graded according to three measures by 13 musicians. It’s interesting to think how this could be translated into a typical randomized clinical trial. The way I see this, the saxophones are like people, and the resonator status is like the treatment or exposure. The ratings from each of the 13 musicians are the study outcomes. So, if you look at it this way, it’s not quite as good a study as it could have been, because the randomization is done only with respect to the outcome measurement, but not respect to the saxophone and the “treatment.” Do you agree, or do you see it differently?

AW: Interesting. I saw it differently. To me the musicians were the subjects (their perceptions are being measured), the 4 different saxes were the treatments or exposures (plastic, metal, or none), and the outcomes were the 3 different perceptions: brightness, evenness, and ease of play. So I viewed this as a 13 subject, randomized, blinded, crossover study in which the treatments were identical saxes with different resonator types. For these perception outcomes, the scale is 0-1, and the best sax had to get a 1 and worst a 0. They repeated the measurements for each player. In the full manuscript there are lots of additional results, including intra and interindividual concordance ratings for each criteria, results according to sax experience, as well as repeated measures anova for main results. There’s some interesting results hidden there as well. 

PC: Got it. My point is that technically they are not “identical” saxes. In this study, the specific saxophone and the exposure (resonator type) are perfectly confounded. But I think you are saying that, if one assumes that the saxes are identical, then this isn’t a problem. But if that’s not a safe assumption, then I say this is a potential problem, and if it would be easy to swap out resonators (is it?), it would have been better to randomly assign the resonator to different sax/musician pairings. So, if one were to distill this down to a single concept, it is that the purpose of randomization is to address confounding that cannot be handled in other ways, but that proper randomization must be done with respect to your exposure. In other words, the randomization must intervene between your exposure and your study subjects, which is why it is important to decide whether we think the study subjects are the saxophones or the musicians here. Capiche? I drew up my conception of three ways of viewing the study (shown below). The first one is my idea of your view, the second and the third are alternate ways of assigning labels of study subject and treatment/exposure. The interesting thing about the third one is that, in this study, the people are essentially viewed as the equivalent of laboratory equipment – they are the measurement device that gauges the effect of the different resonator types. My contention is that the most accurate way to represent the study is option 3, but that in reality none of these is the optimal study design (which I would be happy to expound on). But I would say that if one is willing to assume that the saxophones are truly exchangeable, then I think options 1 and 2 are both defensible. The problem with 3 is that the intervention is not located in its proper place between study subjects and the exposure of interest. Thoughts?

AW: Right, an individual sax = resonator type. So there is potential confounding by sax. In the investigator’s defense, they use the same exact model, with consecutive serial numbers, and then test them for acoustic qualities at baseline and there seems to be no objective difference in sound before they switch out the resonators. But there’s still a risk of confounding by intrinsic sonic differences to each sax. It’s not easy to switch resonators. One way to reduce the risk of confounding by sax further would be to have 2 saxes for each resonator exposure because it would be unlikely that random variation in sax sound would correlate between 2 saxes. They do this only for plastic resonator, but results are reassuring since they are similar for the 2 plastic resonator saxes. I’m having a little trouble with your figure. It seems strange to me to say the study subjects are the resonators. It’s like saying the study subjects are the drugs under study. Yes, resonator is the study subject, but the subjects that are subjected to the treatment (resonator) here are the musicians. The sax here is just a delivery device for the treatment (resonator). 

Let’s make an analogy: You want to know if inhaled albuterol (plastic resonsator), inhaled ipratropium (metal), or inhaled saline (no resonator) result in different perceptions of breathlessness for people with asthma (musician). In a random order (to account for confounding by order effects), you have 13 asthmatics use the albuterol inhaler, the ipratropium inhaler, saline inhaler, and another albuterol inhaler and they rate breathlessness on a scale. Now you could say – wait! How do we know that the inhalers themselves don’t work very differently to deliver the medication! Which is a good point, but  we can be reassured by prior testing of the inhaler and that the same factory made them all, etc.   

So our causal diagram to me, is as follows:

If we assume that confounding by musician differences is taken away by the within-subject design (i.e., exposure of all musicians to all resonators and saxes), that differences between saxes is negligible (like differences between the same model inhaler), and order effects are taken away by randomized order, then there should be negligible confounding to this study.

Thoughts?

PC: I like it. I actually completely agree after hearing you explain it that way. I think the key for helping me understand it is to key in on the musician as the subject (natural), which then highlights the crossover nature of the trial (everybody experiences every resonator), and also highlights that the randomization is related to the order in which they experience the resonators to account for that potential bias, as your diagram illustrates. So to return to the original diagram, #1 is the right way to think of the study, and saxophone is theoretically a perfect confounder, but if one is willing to assume it isn’t then it is a good study! One could also imagine a way in which the relationship between sax and resonator could be randomized so as to avoid that bias, but perhaps that is not a matter of practical importance. On that note, let’s hear some thoughts from Gary Scavone, the senior author of this study.

GS: I helped organize and supervise this study, so I can add a few perspectives on the design: 

The saxophones were loaned for free by Yamaha for the study. As mentioned in the paper, they were consecutive serial numbers off the manufacturing line of a well respected company and their tolerances seem to be quite low, so they are about as similar to one another as could be expected. That said, when I was looking to buy a new Selmer saxophone back in 1989, I tried 6 different horns in a shop in Bordeaux and finally settled on one that “seemed” the best match for my priorities. It is hard to compare horns quickly, as you have to set one instrument down, take the mouthpiece off, put it on the next instrument, and so on. I remember thinking that there were some slight differences between those saxophones but nothing super significant. For sure, metal saxophones made on manufacturing lines are significantly more similar to each other compared to instruments made from wood and by hand (such as violins), due to natural variations in the materials. When we received the 4 Yamaha saxophones, I tried them all and could not detect any particular differences between them. My feeling was that either Yamaha has better tolerances than Selmer and/or the manufacturing process has become more consistent compared to the 1980s.

More importantly, in my opinion, is that all 4 saxophones came with plastic resonators. Thus, we had to send 2 of them to a shop to have the pads changed. That was the biggest weakness of the study because we cannot say for sure that extra variations between the horns were not introduced during that process. We even had an issue with leaks after the new pads were installed and had to take one of the horns to another shop to get it adjusted. I don’t know how this limitation could be minimized, but it was an issue.

That all said, when I played the 4 saxophones myself, the horn without resonators was very noticeably “darker” in tone quality, while the other 3 were relatively similar to each other. I was not one of the subjects in the study but I was not surprised by the results … this was one of the clearest perceptual studies with musical instruments that I have supervised. We have done many other studies with violins and it has normally been hard to get consensus on the results due to large variations in preferences.

PC: Thanks Gary. One of the things that strikes me is that, as much as we try to put all of the relevant information into our scientific papers, I rarely speak to the authors of a study directly without getting some useful insight into matters of practical importance. It’s interesting to hear your point that, on the scale of things that you have studied in this way, the effect of the resonators was large and pretty clear. With respect to saxophone as a potential confounder, it sounds to me like it’s possible but not particularly likely based on your and Allan’s previous experiences and perceptions of variation between saxophones. During the course of our discussion we touched on issues related to the design and analysis of crossover studies, and I’d like to return to this in a future post if we can get some of my statistician friends to join, but I think this is a good place to end our current discussion. Again, really appreciate your time and input here Gary, and congratulations on the work you and your group did on this very interesting and well done study.

So Allan, let’s end with the real world implications of this work on an end user. This all started with your question, “Would it be worth the expense and changes from my 15 years of familiarity with my plastic resonator sax to make the change?” So what did you end up doing based on reading this study?

AW: I discussed putting on metal resonators with my sax repair guy. His opinion was also that the material of the resonator did not seem to be very important, but thought size might matter (for resonators). He sent me links to many different choices in different textures and finishes. Depending on the type of resonator, a switch to metal would add between $75-150 to the cost of the overhaul. Because I was unsure if or how new resonators would change my sound, I elected to stick with my old plastic ones. 

PC: And there it is. Science in action. Enjoy your overhauled sax Allan!

Predicting COPD Progression: Impact of regression to the mean for noisy variables

One of the holy grails of COPD research is being able to identify “rapid progressors”, i.e. individuals who are destined for rapid loss of lung function. One of the challenges in doing this in COPD is that the rate of disease progression is usually slow, so that it (probably) takes years of observation to distinguish a “rapid progressor” from a “nonprogressor.”

In the COPDGene Study, our spirometric datapoints come at 5-year follow-up intervals. After the COPDGene 5-year study visit, there was great interest in being able to identify “rapid progressors” who could be studied intensively at the 10-year visit. At first glance, it might seem natural to look for rapid progressors in the set of individuals who had the greatest lung function decline between visit 1 and visit 2; however as I show below this approach can backfire pretty badly, and it’s a nice illustration of both regression to the mean and the practical importance of measurement error. The three COPDGene study visits are shown below. When we built the prediction model we were at the 5-year time point using the visit 1 and 2 data to try and predict behavior at visit 3. Now, we have roughly have of the visit 3 data available to us, so we can learn some things about how successful we were at prognosticating from the five-year visit.

When we tried out different ways to identify rapid progressors, we compared two approaches: 1) Defining rapid progressors based on the observed change in FEV1 from Visit 1 to Visit 2 (observed delta FEV1V1-V2) or 2) Defining rapid progressors based on the predicted change in FEV1 (predicted delta FEV1V1-V2) using the output of a random forests model using a standard set of clinical and radiographic predictors. The model’s fit to the observed delta FEV1V1-V2 was pretty good. The scatterplot is shown below, and the model explained 87% of the variance of the observed delta FEV1V1-V2.

The real question though is how well the model predicted future change in lung function, i.e. the FEV1 change from V2-V3 (observed delta FEV1V2-V3). Here is the correlation of the observed delta FEV1V2-V3 to predicted delta FEV1V2-V3 obtained by using V2 values of the predictor variables entered into the same prediction model as described above. This model explains 10% of the variance of observed delta FEV1V2-V3.

Not great. But interestingly so much better than using the observed delta FEV1V1-V2 to predict the observed delta FEV1V2-V3. Take a look at the correlation between those two below. It’s negative! In other words, if you want to find rapid V2-V3 progressors, you’d be better off selecting SLOW V1-V2 progressors than you would be selecting rapid V1-V2 progressors.

Sometimes it easier to see this when subjects are broken into groups, so we identified groups of 500 fast and slow progressors (stratified by gender to account for differences in lung size) according the their observed delta FEV1V1-V2 and their predicted delta FEV1V1-V2, and then we compared the prospective 5-year progression of these groups (observed delta FEV1V2-V3). These results are shown side-by-side below. It’s apparent that for the groups defined by observed change you actually get the opposite of what you want. The slow group progresses more rapidly over the next five years, but if you rely on the output from the prediction model you get the desired behavior. A t-test comparing the V2-V3 FEV1 change between rapid and slow groups is highly significant.

This is simple regression to the mean, but can still be counterintuitive. We had to look at the data a bit to really absorb the magnitude of this effect. I think one of the learning points here is that when FEV1 is observed only twice over relatively short time period, the delta FEV1 variable can be really noisy. When cutoffs are applied to a variable with low signal to noise ratio, the selected subjects at the extremes of the distribution also happen to be the ones with the largest measurement errors. Those subjects are not actually the ones with the greatest rate of “true progression,” they are just the ones with the largest measurement error that, by virtue of the thresholding, tend to all be aligned in the same direction. Then, if the measurement errors at visit 2 and visit 3 are independent, then the measured FEV1s of those subjects will move in the opposite direction at the next time point since those measurement errors will be evenly distributed around a mean of zero.

Interpretablity in machine learning for genomics

“All science is either physics or stamp collecting.” The implication of this famous quote, attributed to Ernest Rutherford, is that physics, with its mathematical quantification of natural laws, is superior to disciplines like biology that accumulate observations without synthesizing them through mathematical description. From this math-centric perspective, the data revolution in biology is a step up the purity ladder towards physics and away from the social sciences. Despite the odiousness of this reductive view, the biology-physics analogy is enlightening to the extent that it prompts critical thinking about where biology is headed and what the phrase “biology is a data science” might imply. Can the proliferation of new biological data be leveraged to discover the fundamental rules that govern biological systems? Or will we simply end up with billions of stamps and expensive storage bills?

https://imgs.xkcd.com/comics/purity.png

Now that biologists can generate millions of data points from a single sample, it is often assumed that fundamental insights will follow. If one takes the beginning of the Human Genome Project(1990) as the start of the genomics era, we are now 30 years in and still at the stage of large-scale stamp accumulation. The sheer size of a new data set remains one of the major selling points for new papers, and significant time and resources continue to be devoted to the generation of new biological data. Whereas one might have wished for early quantitative breakthroughs from all of this new data, the most obvious short-term effect of genomic technology has been to turn biologists from artisinal into industrial stamp collectors. The fact that the word “landscape” is a staple of high profile genomics papers reflects the degree to which data generation continues to be a primary focus of genomics research.

So, after describing all these landscapes, what comes next? Will genomics be the Bob Ross of science or can we hope that the descriptions of all these landscapes are building towards some kind of quantitative synthesis? Like Einstein in the patent office, is biology’s next quantitative genius is currently designing recommender systems by day and dreaming about fundamental biological principles in her spare time? While it would be great for a new biological synthesis to emerge unexpectedly, most people think that the breakthrough insights in biology will emerge in a more systematic manner through the application of algorithms to big biological datasets. This is where machine learning (ML) comes in.

It is important to avoid a naive faith that ML will magically extract truth from big data. I encounter this ML-centric magical thinking with depressing regularity in biomedicine, and I can personally attest to the many useless ways that ML can be applied to biological data. Too often, the goals of ML-based biology projects are poorly defined, with ML playing a deus ex machina kind of role. These projects don’t adequately account for biological complexity and the limitations of biological data. It’s much easier to use algorithms to identify cats in pictures than to identify the molecular drivers of colon cancer, never mind more detailed questions like why the incidence of early colon cancer is increasing in high income countries. As expert ML practitioners know, domain knowledge is usually an essential ingredient for successful projects.

A compelling roadmap for ML in biology can be found in Yu and Ideker’s “Visible Machine Learning,” which makes use of the Visible V8, a toy engine, as its organizing principle. Just as mechanics need to understand how an engine works in order to repair it, so biologists need to understand how organisms work in order to understand biology and cure diseases. The Visible V8 approximates a true engine closely enough that important lessons about real engines can be learned by studying it. Thus the authors propose that when ML is applied to biological data, the emphasis should be on developing visible (i.e. interpretable) ML models, because the biological insights will come primarily from the structure of the models themselves rather than the model outputs.

Yu et al., Cell 2018

This is a great way of thinking about how to use ML with biological data. In this paradigm biology, in its fully-realized quantitative form, will be more like engineering than physics as a discipline. Like engineers, most biologists are interested in learning about how a specific system works. For example, since my research focus is human lung disease, I am relatively (but not entirely) uninterested in understanding lung disease in hamsters or horses. Biologists study contingent systems with specific evolutionary histories, and therefore biology truths are more context-dependent than physics truths. Much of biological research is motivated by understanding how specific organisms solve specific problems within the constraints of physical laws. In biology, context almost always matters, a fact that informs the main arguments of Yu et al. in favor of visible machine learning in biology.

Data Heterogeneity

Yu et al. point to data heterogeneity as a major challenge for applying ML algorithms to biological datasets. In their words, “biological systems are almost certainly more complex than those addressed by machine learning in other areas.” Imagine that you are given a dataset with one million pictures of cats. A standard ML problem might be to identify the cat in each of the pictures. In contrast, a standard biology ML problem would be more akin to getting one thousand fuzzy pictures with noisy labels, and the task is to find out what kind of animals might be in the pictures.

To get more concrete, a fairly well-formulated ML problem is to take a series of gene expression datasets and infer the gene regulatory model that generated these data. Gene expression is partly governed by other genes, and the regulatory connections between genes can be expressed as a network or graph. The problem can then be formulated as one of learning which graph, from the space of all possible graphs, represents the true gene regulatory model that produced the training data. Data heterogeneity complicates this problem in two ways:

  • the true model may contain redundancies such that identical output can arise from different inputs
  • the data may arise from more than one model, or may be informative for only certain aspects one overall model

With respect to real gene regulatory networks, we know that these networks work differently in different cell types. Some genes, like RFX1, can activate gene expression in one state while inhibiting gene expression under different conditions. The information content of any given biological dataset is often low with respect to the complexity of the generative model, thus, even the biggest biological datasets aren’t really that big relative to the scope of the problem. This is one reason why, 30 years into the genomics era, we are still mapping biological landscapes and we haven’t yet begun to exhaust the space of possible biological states in need of characterization.

Visible, Interpretable Biological Models

So what do Yu et al. mean by “visible” biological models? They don’t give a precise definition, but they state that visual models incorporate “prior knowledge of biological structure.” Interestingly, this presupposes that biological knowledge is encoded and accessible to algorithms, which is its own challenge. But if one assumes that appropriate encodings are available, visible algorithms are a tool for synthesizing prior biological knowledge and novel data. The defining feature of visible models is that their “internal states” can be accessed for further study. Here, visible essentially means interpretable, and the authors make the strong claim that interpretable biological models reflect causal processes in biological systems.

But how can we know that an interpretable model faithfully recapitulates causal processes? One big problem with the Visible V8 analogy is that we already know that the model is a faithful representation of the real thing, but with biological models there isn’t a frame of reference based on reliable ground truth. Our prior biological knowledge is not extensive enough to be at all comparable to the Visible V8. Yu et al. propose that algorithms should include prior information on biological structures, but this does not really ensure that the “internal states” of these models recapitulate the underlying biological reality at a meaningful level of detail. In fact, as I’m sure the authors would agree, they surely don’t. There is a chicken and egg problem here – if we really knew the biological models we wouldn’t need to do ML, but the model is so complex that we are hoping that ML + data can help us to discover it. Some might respond that “all models are wrong but some are useful,” but the visible ML argument relies upon interpretation of internal model states as if they were causal factors, so it is of course important that the states have some meaningful connection to the true model.

Rashomon sets

An important problem for the Visible Models framework is that accurate models don’t necessarily have internal states that reflect causal processes. In an excellent opinion piece on interpretable models, Cynthia Rudin points out that equivalent predictive accuracy for the same task is often achieved by several different methods, implying that there is not a single “best” model but rather a set of different but functionally equivalent models. This set of equivalent models is a Rashomon set (a reference to a famous Japanese movie about multiple perspectives on the same event), and Rudin argues that when the Rashomon set is large there is a reasonable chance that it includes an interpretable (but not necessarily causal) model.

Interestingly, Rudin briefly entertains the causal argument – “Why is it that accurate interpretable models could possibly exist in so many different domains? Is it really possible that many aspects of nature have simple truths that are waiting to be discovered by ML?” – but then she opts for a more conservative argument for interpretable models based on Rashomon sets. The fact that the argument for visible models in biology depends on the claim that these models are approximately accurate reflections of nature is problematic. I personally think it is reasonable, but the onus is on the biology ML community to demonstrate that it is possible to generate models in which studying their internal states produces meaningful biological insights. While there are multiple examples of using prior biological knowledge to guide ML, some of which we have adopted in COPD genomics, I would say that the effectiveness of this approach has not yet been conclusively demonstrated.

Encoding (and updating) biological knowledge

Current ML models don’t closely approximate biological systems, except in very specialized experimental situations. Humans, the systems that we care most about, are incredibly complex multi-cellular, multi-tissue systems with long life spans such that even the most intensive genomic data collection would only capture a small fraction of our biological states. Large-scale landscape projects like the ENCODE, Roadmap Epigenomics, and FANTOM projects have generated tens of thousands of datasets using hundreds of assays to capture biological states in human and murine cell types, and we still know relatively little about how these landscapes change in specific disease or cell activation states. And we haven’t really begun to examine the connectivity patterns between cell types in tissues from a genomic perspective. The bottom line is that, in the short and medium term, we will be in a data poor situation with respect to the complexity of the true biological model. Accordingly, the visible model approach leans heavily on the incorporation of biological knowledge to constrain the model space, presumably ensuring that the resulting models will be representative of “true biology.”

This solution runs the risk of being trivial if, as is often done, results are validated by referencing back to known biology in standard and potentially circular ways. If we are going to use biological knowledge as a constraint, we should focus on examining the internal model states that are initialized by prior knowledge before and after training to determine how they have changed, and we need ways to determine whether these changes are good or bad. Yu et al. propose experimental validation as a means to verify biological predictions arising from the examination of internal model states, which is reasonable but resource intensive. Another alternative is to demonstrate that biologically constrained models objectively improve prediction accuracy relative to unconstrained models. Before you object and state that unconstrained models will always outperform constrained or interpretable ones, consider Rudin’s warning against precisely this assumption. To quote her directly, “There is a widespread belief that more complex models are more accurate, meaning that a complicated black box is necessary for top predictive performance. However, this is often not true, particularly when the data are structured, with a good representation in terms of naturally meaningful features.” After all, the ideal amount of complexity for a model is the amount required to capture the true model, additional complexity just invites overfitting.

I agree with Rudin, because we have shown this to be true in our own domain. By incorporating information related to gene splicing into a previously published gene expression predictive model for smoking behavior, we significantly improved prediction accuracy in test data.

The best performing models were deep neural nets that included an isoform-mapping-layer (IML) manually encoded to represent known relationships between exons and transcript isoforms. Interestingly, we know that this information is only partially accurate, but it still provides enough information to clearly boost accuracy. With additional modifications, it should be possible to update the IML with strong patterns of correlated expression between exons that indicate as yet unidentified transcript isoforms. In this scenario, ML is part of a virtuous cycle in which prior biological knowledge in coded structures guides ML discovery, and ML discovery updates these structures. These structures then become continuously evolving repositories of biological knowledge. I would argue that it is this cycle and these structures that are the desired endpoint of biology as a data science – not static formulas like E = mc2, but a continuously evolving encyclopedia of algorithms and data structures.

Deep neural net architecture including isoform-mapping-layer

Final thoughts

So, how should we go from describing genomic landscapes to understanding the rules that shape these landscapes? I agree with Yu et al. in their overall endorsement of interpretable algorithms that integrate prior knowledge with new data. At present the methods to interrogate internal states are vague, and further development in that area is needed. As Rudin states, interpretability is domain specific, and while Yu et al. have begun to define what interpretability means in the biological domain, this needs to be defined more precisely so that standardized comparisons between algorithms can be more readily made. Finally, the biggest challenge is to prove that the internal states of ML models mean anything at all. For many biological problems, the size of the Roshomon set is large. The use of prior biological knowledge to constrain ML models is a natural solution to this problem, but this establishes a new problem of encoding a wealth of biological facts into data structures that can interact with new data via algorithms. Biologists should and will continue to characterize biological landscapes, but we also need an expandable canvas of algorithms and data structures that can link these disparate landscapes into interpretable models of biological systems.

#2: Reproducibility: Disease axes > clusters, a 10 cohort validation study

In a 2017 collaboration among multiple American and European COPD research groups, we assessed the reproducibility of multiple clustering approaches and PCA in ten independent cohorts. This project remains one of my favorites, because the result was surprising and important. The published paper is fairly dense and complex, so the point here is to distill the essentials from that work. The primary goal of the project was to discover clusters that were replicable across all of our participating cohorts. For the clustering methods, we used the approaches described by Horvath and Langfelder, which are extensions of k-mediods and hierarchical clustering.

Since the choice of clustering parameters is often critical for the final results, we systematically varied k, minimum cluster size, and other more complex parameters for pruning the resulting hierarchical trees. This resulted in 23 clustering solutions within each of the 10 cohorts. The next challenge was how to compare these clustering results across studies. To do this, we built supervised prediction models in each dataset to predict the cluster membership for each of its 23 clustering results. These predictive models were then shared between the groups to “transfer” clusters from one cohort to another. This allowed for all of the 230 (23 solutions x 10 cohorts) clustering solutions to be compared within each cohort. The schematic of this workflow is shown below.

So what did we find when we compared these clustering results? A disappointingly low level of reproducibility. To give a specific example of what reproducibility means here, imagine that we generated two clustering results using the exact same methods, variables, and parameter settings in the two COPD cohorts, COPDGene and ECLIPSE. In each cohort then, we have an “original” clustering result, that looks like this.

In each cohort, we then build a predictive model to predict cluster membership. We trade models between groups, and we end up with the ECLIPSE clusters in COPDGene and vice versa. So now the cluster assignments look like this.

So, using the same subjects in each cohort, we can now look at the concordance between cluster assignments. If you consider things from the point of view of just one cohort, you now have the 23 original solutions from that cohort, as well as 23 x 9 = 207 transferred solutions from the 9 other cohorts. For each original solution, you could then compare it to its 9 exact matches from the other cohorts, or you could just compare it to every single transferred clustering solution to look for the best match. As our metric of similarity we used normalized mutual information (NMI), which gives a value of 1 for a perfect match and a value of 0 for completely independent solutions (some potential limitations of NMI are mentioned here). In our analysis we did the comparison both ways, and no matter how we looked at it the results were a bit disappointing. You can see the distribution of NMI values for each of the participating cohorts here.

To sum this up:

  • median NMI is almost always below 0.5. Not great agreement.
  • We divided clustering methods into groups, and the red group always has higher NMI values (groups described below).
  • Most cohorts have a handful of very reproducible solutions. But when we compared the high NMI solutions across all cohorts they were inconsistent (i.e. different number of clusters, no consistent high NMI performer across all cohorts).

The blue, green, and red bars indicate the three different general classes of clustering that were used. Blue = k-medioid clusters. Green and Red = hierarchical clustering. Importantly, on the red group of clustering solutions had the option of defining subjects as “unclusterable.” Thus, only the red methods have the option of saying, “I refuse to cluster some proportion of the subjects in this dataset.” For the details of these clustering methods you can refer to the brief paper by Langfelder and follow the citation trail as far as you want.

So if the clustering reproducibility isn’t great, the first three explanations that come to mind are as follows:

  • the algorithms stink (probably not the case since these have been shown to work with other kinds of data)
  • the cohorts are just very different from each other
  • the cohorts are similar, but there aren’t any clusters

The correct answer seems to be the third one – the cohorts are actually fairly similar, but the process of defining clusters is what is not reproducible. The observation that supports this is that, when we calculated principal components from the same data used for clustering, the principal components were extremely similar. Here are the variable loadings for the first four PCs in each of the participating cohorts.

So, bad clusters, good PCs. The most natural conclusion is that the cohorts are in fact very similar, and more specifically the covariance structure of these variables is very consistent across cohorts. But the clustering partitions in these cohorts are not reproducible, probably because there is no “good” place to draw dividing lines in these data. The take home message is that we should probably focus first on PC-like continuous dimensions of variability (disease axes) rather than jumping immediately to clustering algorithms.

#1: The COPD continuum and the downsides of clustering

Most people believe that COPD isn’t just one disease. At the moment, COPD is an umbrella term that includes many different diseases and disease processes. There have also been many papers proposing various subtypes of COPD, but there is no agreement on what the specific subtypes of COPD actually are. Because there isn’t much consensus on how to summarize COPD heterogeneity, current COPD consensus definitions incporpoate COPD heterogeneity in only a limited manner.

The best attempt at synthesis of COPD subtypes is probably the paper by Pinto et al. which found that COPD studies using clustering methods were characterized by “significant heterogeneity in the selection of subjects, statistical methods used and outcomes validated” which made it impossible to compare results across studies or do a meta-analysis of COPD clustering. So, despite lots of attention and effort, progress in COPD subtyping has been slower than expected.

This post addresses two questions. First, “why can’t we agree on COPD subtypes?” Second, “How should we study COPD subtypes so as to produce more reliable results that people could agree on?” At a more general level, the issue is how to apply machine learning to make the best use of the current explosion of data available on large numbers of subjects with COPD. Large studies like COPDGene, SPIROMICS, and others have generated research grade clinical phenotyping, chest CT, and multiple kinds of genomic data that provide a whole new level of resolution into the molecular and lung structural changes that occur in COPD. It’s not unreasonable to think that these data would allow us to reclassify COPD in such a way that patterns of lung involvement combined with molecular signatures of disease processes would allow us to understand and predict the progression of different “flavors” of COPD with much greater accuracy. That is the goal, but getting there has proven more difficult than simply dropping all of these data into a machine learning black box and letting the magic happen.

Most subtyping projects assume that subtypes are distinct groups (as defined by some set of measured characteristics). This seems to make sense. After all, clinicians can easily describe patients with COPD whose clinical characteristics are so different as to suggest that they don’t really have the same disease at all, so why wouldn’t there be distinct groups? However, when we look at data from hundreds or thousands of people with COPD, it is abundantly clear that the distinct patients we can so easily recall don’t represent distinct groups but rather the ends of a continuous spectrum of disease. The image below shows why the COPD subtype concept is poorly suited for clinical reality of COPD.

10,000 smokers from the COPDGene Study

This is the distribution of smokers with and without COPD in a three dimensional space defined by FEV1, FEV1/FVC, and the percentage of emphysema on chest CT scan. Viewing these data in these three dimensions, it is clear that there are no clusters here. If you ask a clustering algorithm to draw partitions in data like this, many algorithms will happily give you clusters despite the fact that there is very little support in the data for distinct groups. Accordingly, these results will be:

  • highly sensitive to arbitrary choices of clustering method and parameters
  • not reproducible in independent data (as investigated here

To emphasize this point about low separability, here is the distribution in principal component space for COPDGene subjects colored by their k-means clusters as described in our 2014 paper in Thorax.

COPD k-means subtypes

Machine learning involves exploring lots of possible models rather than just a few. So if you’re going to be looking at lots of models, how will you choose a final model in a principled way? In our 2014 clustering paper we used a measure called normalized mutual information that measures the similarity between clustering results performed in the entire dataset compared to smaller subsamples of the data. The figure below shows how the distribution of NMI values across a range of feature sets and number of clusters (k).

The higher the NMI, the more reproducible the clustering result within our dataset. The variable set with the best internal reproducibility across the widest range of k was the simplest one that consisted of only four variables, but it is also clear that there is no single solution that stands out above the rest. If you want an NMI > 0.9, you have at least six clustering solutions to choose from. And they’re all quite different! The figure below shows how the clustering assignments shift as you go from 3 to 5 clusters.

COPD subtype reclassification by k
Emph Upper/Lower = emphysema apico-basal distribution
WA% = measure of airway wall thickness

Unfortunately, the choice between 3, 4, or 5 clusters is fairly arbitrary since their NMI values are very similar. So if this paper has been a huge success and transformed COPD care such that every person with COPD was assigned a Castaldi subtype, there are a lot of people who’s COPD subtype would depend on the very arbitrary choice between a final model with 3, 4, or 5 clusters. For example, there are 1,347 people (32%) who are belong to the airway-predominant COPD cluster (high airway wall thickness, low emphysema) in the k=3 solution, but only 778 (19%) in the k=5 solution. So what happened to those other 569 people? They’re bouncing around between subtypes based on fairly arbitrary analysis choices. We certainly don’t want COPD clinical subtypes to be determined based on this sort of arbitrary data analysis choice.

So why do I keep referring to COPD clusters as poorly reproducible despite the fact that the NMI values for our 2014 paper were very high? For three reasons. First, reproducibility subsamples of a single dataset is a lower bar than reproducibility across comparable independent datasets. Second, common sense suggests that algorithmically drawn separations in manifolds are likely to be driven by subtle differences that are dataset specific. Third, a systematic study of cluster reproducibility involving ten COPD cohorts from the US and Europe found only modest reproducibility of COPD clustering results.

So, is clustering useless? No, clustering in COPD is great for data exploration, which can be very enlightening and has demonstrably led to better understanding of COPD heterogeneity and novel molecular discoveries. But for the goal of producing reliable and rational clinical classification, traditional clustering methods are not well-equipped for summarizing COPD heterogeneity in a way that is likely to be highly reproducible. Other approaches are needed for that.