At the core of the controversy ignited by the recent New England Journal of Medicine(NEJM) editorial comparing data scientists to parasites brings is the emergence of data science as a distinct discipline, and the question of how it should relate to traditional academic clinical research (I’ve discussed the editorial here and the implications for incentives here.)
The Core Issue
To put it another way, is the activity of data analysis properly viewed as separate, and separable from the activity of data gathering?
For most academic clinical studies today, the researcher recruits patients, collects the data, analyzes the data, and publishes the results. Raw data are rarely shared and zealously guarded.
As data science has gained momentum and sophistication – and as the reproducibility of published science (particularly observational studies) has been increasingly scrutinized – the question becomes (as many have debated on twitter) whether the researchers collecting the data are necessarily the ones who are best positioned to interpret these data.
Wouldn’t science be best served, many data scientists ask, if the data collectors would deposit their information in an easily-accessible repository. Data scientists could then, in the words of Cloudera co-founder Jeffrey Hammerbacher, party on the data,” extract hidden relationships and deliver new insights.
Many academic clinical researchers worry that without a deep understanding of how data are generated, analysts might misinterpret the data; this is a key argument made by the NEJM editors.
Clinical researchers are also disinclined to part with their data because they often spend years and even decades recruiting and maintaining cohorts of patients who are analyzed and characterized. After devoting so much time not only building these cohorts, but also in exploring the right ways to study them – including understanding the subtleties and limitations of measurement techniques — these investigators are often loathe to simply give away the data they’ve worked so hard to collect, and to which they feel so viscerally connected.
In theory, an appealing solution would be to recognize and reward data gathering and data analysis as distinct activities, each associated with its own methodologies, hurdles, training and reward system.
But the issue is that the research enterprise is fundamentally about insight generation, not cohort building. Clinical investigators invest their careers in building cohorts and gathering data not because they particularly enjoy these activities, but because they’re keenly interested in getting to the analysis stage at the end. From their perspective, it can seem absurd to spend years collecting the ingredients and baking a cake only to have someone else come along at the end and enjoy it.
In response, data scientists have asked, what about the patients? After all, isn’t research supposed to be conducted to improve the care of patients, not to advance careers?
This stinging criticism has struck many clinical researchers–especially many MD-investigators–as particularly hypocritical; many clinical investigators believe they’ve devoted their lives to caring for and understanding the lives of afflicted patients. To many clinical researchers, data scientists (often but not always PhDs with no access to patients) are simply looking for yet another data set to compute upon; while data scientists may say they want access in the name of patients and the name of science, many clinical investigators are skeptical and believe data scientists are simply trying to advance their own careers.
In brief: data scientists say clinical researchers are hoarding data to advance their careers, clinical researchers say data scientists are seeking access to data to advance their careers and each group believes the other is disingenuously claiming to be motivated by an interest in the quality of science and in the care of patients.
I suspect both science and medicine would be best served by a close integration of data gathering and data analysis, and by ensuring patients truly own and control access to the use of their data.
First: while collecting and interpreting data may seem like separate activities, in practice, they are–or should be–closely related. Data are collected based on how the researcher plans to analyze them, and how you analyze data may be influenced by what you learn about how they are collected. This doesn’t mean raw data shouldn’t be made available. Rather, the idea is that science is far better served, and will progress faster, if the research model isn’t premised on the idea of throwing raw data over a wall and receiving a completed analysis back over a wall, but rather if the gathering and analysis occur as part of a rich, ongoing and dynamic conversation.
Second: we must account for the role of impassioned, integrative clinical investigators (not necessarily a clinician; consider the example of Mary Claire-King), who through their passion and determination, through their enduring commitment to and unique understanding of their patients, are able to drive the science forward in a way that atomized researchers cannot. Medicine, I would argue, recognized too late the cost of deconstruction, of optimizing process segments (Taylorization) at the expense of an integrated whole, of prioritizing a care system over a caring individual. Rather than Taylorize clinical research, it would seem preferable to bring together data collection and data analysis in a way that doesn’t lose sight of the whole picture: the patient, the science and the value of a researcher committed to and invested in connecting the two.
Finally, of course, there’s something more than a little unseemly about different groups of scientists arguing who has access to patient data. The patient should be at the center of modern research model, and should easily and explicitly control and manage access to his or her own data. A research enterprise centered around self-organizing patient groups might effectively compel both data collectors and data analyzers to commit to a level of engagement and data sharing or risk being excluded from the system.
The editors of the NEJM choose perhaps the worst framing (and most inflammatory language) possible to make what is actually a reasonable point: Data gathering and data analysis are ideally integrated activities, and it will be better for both patients and science if these disciplines work cooperatively rather than independently. More explicit data ownership by patients could go a long way in ensuring constructive collaboration among all stakeholders.
from THCB http://ift.tt/1PIyxsU