Irene Chen

MLHC 2018 Recap

28 Aug 2018

Last week from Aug 16-18, researchers flocked to the 2018 Machine Learning for Health Care (MLHC) conference, held at Stanford in Palo Alto.

Smiley face
Unfiltered, candid view of MLHC 2018

One unusual aspect of MLHC is its manageable size and format, especially compared to colossal conferences like NIPS and ICML. With attendance hovering around 300 people, the conference has maintained its “small community” feeling. Without much effort, I quickly met other researchers and reconnected with old friends, seeing people multiple times across the room. Although others reported not being able to register if you weren’t early enough, I appreciate the focus on controlling the population distribution and maintaining the single-track structure as the community continues to grow.

As Finale Doshi-Velez pointed out in the opening remarks, the focus of this year’s conference was useful machine learning for clinicians. In contrast to last year’s emphasis on the wide array of possibilities for machine learning—medical imaging, treating your child’s rare genetic condition, and hospital operations to name a few—this year targeted the gap between machine learning efforts now and actually improving health care. With more clinicians attending and presenting, we avoided an adversarial doctors vs AI framework and instead focused on common goals.

Capturing patient stories

As the kickoff keynote, Abraham Verghese from Stanford set the tone by advocating for a renewed emphasis on the patient narrative. He noted that for every 1 hour of patient time, clinicians currently spend roughly 2 hours on the electronic health records (EHR). Although he acknowledged EHRs are necessary for billing, he harkened back to the painting The Doctor, notable for the titular subject’s intense visual focus on the sickly child. When asked in a later question, Dr. Verghese clarified that he doesn’t believe clinical notes capture the entire patient story.

Despite the limitations of clinical notes, presenters at the conference unveiled research towards better understanding clinical notes including using multi-task learning on clinical notes to predict chronic conditions, generating topic distributions of discharge notes, leveraging multi-label learning with a convolutional residual model, and integrating expert knowledge with machine-learned diagnoses.

In a slightly different angle, keynote speaker Joyce Lee from University of Michigan motivated the need to consider the patient’s perspective through her expertise as a pediatric endocrinologist. When patients—especially younger patients—need to closely monitor their glucose levels to adapt insulin dosage, the combination of data silos, reporting bias, and outdated technology can hinder a patient’s experience and caused at least one patient to hack his own wearable solution to measure his own insulin. Comparing the patient record to “Microsoft Word meets Pinterest” because the long text of clinical notes can be peppered with lab tests, imaging, and other additions, Dr. Lee noted that even with the perfect algorithm, the healthcare system itself is imperfect and must be accounted for.

Model robustness matters more and more

As the aphorism goes, all models are wrong, but some are useful. Keynote speaker Russell Greiner from University of Alberta devoted a significant portion of his talk to important considerations of effective models. Association between one biomarker and outcome can be much less powerful than a predictive algorithm that draws on a wider set of features. Even features themselves can be selected with care as subjective symptoms may vary based on the treated clinicians whereas quantitative measurements are easier to use reliably in models. Additionally, models trained on one distribution and tested on another may suffer from covariate shift—although data can be reweighted with some success.

Amongst the papers presented, I was heartened to see thoughtful investigations into confounding factors beyond straightforward supervised learning on medical tasks. For example, racial disparities and mistrust in doctors can lead to differing end-of-life care. Even the missingness or existence of a measurement of EHR data can give additional signal that can be used to reconstruct monitor readings for EHR patients. On an orthogonal note, if researchers won’t or can’t share code and data, reproducibility becomes incredibly hard even on widely used datasets.

Since widely available datasets are key to reproducibility and therefore robustness, keynote speaker Mohammed Saeed of University of Michigan described the process in making MIMIC-II, an open, large, and extensive dataset of intensive care unit (ICU) stays and one of the most impactful contributions to the field of machine learning for healthcare. Most notably, Dr. Saeed credits the ability to de-identify the dataset as key to its proliferation as well as the community of researchers who take extensive privacy training before receiving access. Sadly, Dr. Saeed also shared that since MIMIC-II, no other hospitals have been willing to create a dataset nearly as impactful.

Every year the conference invites a “slightly different” keynote in order to expand the scope of the topics. This year, Cynthia Dwork of Harvard raised salient points about privacy and fairness, both fields to which she’s contributed immensely. Defining a method as preserving privacy if the model learns the same as if one individual were replaced with another, she explains how differential privacy neutralizes the risk of adaptivity. In her section on fairness, she outlined several viable definitions of fairness and corresponding flaws: fairness through blindness might still discriminate based on proxy features, demographic parity can’t be reached even with a completely accurate model if the two groups have different base rates, and individual fairness requires a similarity function which can be hard to define.

Actually useful models for patients and clinicians

When examining actually useful models, keynote speaker Danielle Belgrave introduced endotypes, a form of subtyping on underlying pathophysiological mechanisms. In a talk most relevant to my own research, Dr. Belgrave demonstrated probabilistic programming methods to build causal methods for asthma patients. Specifically, structure learning through model evidence allows for recovering causal links from eczema to asthma and then hay fever — rather than just associative. This causal link then allows public health officials to launch initiatives to prevent eczema and therefore asthma and hay fever.

In general, making machine learning models useful for clinicians was a strong theme in several talks. Dr. Verghese’s earlier talk included a plea to help clinicians (for example, decreasing keystrokes) to prevent further clinician burnout. Dr. Saeed also mentioned how clinicians suffer from too many false alarms, which one paper at the conference addressed through representation learning with a modified objective function on ECG waveforms.

The clinical abstracts also highlighted the challenges when actually deploying systems. Although I had to take a meeting during these presentations, skimming through the abstracts was exciting for the glimpse into real world applications.

Expanding the team beyond clinicians and machine learners

The last keynote featured Andrew Ng, a man who needs almost no introduction. It was very clever to have him go last to keep people around until the end of the conference. In contrast to the more structured slides of talks before, Dr. Ng’s talk featured white board drawings on his iPad that he projected onto the screen and with short project presentations from his graduate students. He urged for more collaborations beyond clinicians and machine learning researchers including ethicists, software engineers, and business people, which would help with actual deployments. In particular, he named trust as a huge blocker to getting people on board, and efforts including interpretability, uncertainty estimates, and accuracy can help with that.


Other papers I enjoyed include

Blockchain came up multiple times—in audience questions and in an off-hand comment from a presenter—as a potential way to increase trust or track patient data. Each time, the presenter expressed skepticism about the usefulness of blockchain since the main concern is the control of the data itself and risk minimization.

Venn diagrams are basically the unofficial figure of the conference since they appeared in 3+ keynote talks!

Overall the conference was incredibly well run. Everything started reasonably on time. Childcare was provided. Talks were live streamed. Every paper got time to present. Tutorials were offered for clinicians and machine learning researchers. A research guidelines panel offered best practices. As a PhD student early in my career, I very much felt a sense of community.

See everyone next year at MLHC 2019 at University of Michigan!


Mike Hughes gave a great tutorial Machine Learning for Clinicians: Advances for Multi-Modal Health Data.

Unfortunately, I couldn’t find other slides or videos available. If you find them, let me know and I will add them here.

Many thanks to Christina Ji for helpful comments on an earlier draft.

Tweet me @irenetrampoline if you like this post.