Big data is a hot topic right now--insert rant about new names for old ideas here--and that wave is finally breaking on the clinical lab shores. So the time is right for my "Big Data and Lab" manifesto. (Or at least an observation born of decades of experience.)
Big data has two jobs: archiving and analyzing. Both involve data on computers. There, I claim, the similarity ends. While it is tempting to try to kill both of these birds with a single stone, I assert that this is a terrible idea.
Specifically, I find that in order to archive data effectively, I need a free-form, attribute rich environment which accommodates evolving lab practice without losing the old or failing to capture the new. But in order to analyze effectively, I need a rigid, highly optimized and targeted environment, preferably with only the attributes I am analyzing and ideally with those attributes expressed in an easily selected and compared way.
In other words, I find that any environment rich enough to be a long-term archive is unwieldy for analysis and any environment optimized for analysis is very bad at holding all possible attributes.
Specifically, I have seen what happens to inflexible environments when a new LIS comes in, or a second new LIS and the programmers are struggling to fit new data into an old data model which was a reworking of an older data model. It ain't pretty--or easy to process or fast to process. I have also seen what happens when people, especially people without vast institutional knowledge, try to report on a flexible format with three different kinds of data in it. They get answers, but those answers are often missing entire classes of answers. "They used to code that HOW?" is a conversation I have had far too many times.
Yes, I am aware of Mongo & co and of the rise (and stalling) of XML databases and of the many environments which claim to be able to both. I have seen them all, I have tried them all and I have not changed my views.
So I use a two pronged approach: acquire and store the data in as free-form a manner as possible--structured text, such as raw HL7 or XML are great at this--and then extract what I need for any given analysis into a tradition (usually relational) database on which standard reporting tools work well and work quickly.
The biggest clinical lab-specific issue I find is the rise of complex results, results which are actually chunks of text with a clinical impression in them. For such results, we ended up cheating: tagging the results with keywords for later reporting and asking the clinicians to create simple codes to summarize the text.
I am eager to be proved wrong, because the two pronged approach is kind of a pain in the neck. But so far, I have seen no magic bullet that stands the test of time: either the reportable format is too rigid to hold evolving lab data or the flexible format is too slow to actually handle years and years of data.
If you feel that you have a better idea, I would love to hear it so leave a comment.