Introduction

Recently, the term “big data” has been used more and more in topics related to the analysis of huge amounts of information. Characteristics of big data—including medical data—are volume (large), variety, velocity, and veracity. In this case, volume refers to the size of the data, variety refers to different types/sources of data, velocity refers to the speed of data generation, and veracity refers to the quality of data or data uncertainty due to factors such as noise, artifacts, and missing data. In the health care system, a variety of resources—such as randomized controlled clinical trials, wearable devices (eg, clothing and accessories incorporating sensors that measure activity or parameters such as blood pressure), video streams (eg, a video-based system for detecting fall events in elderly persons living alone at home), personal genomic services, imaging devices, and social media or Internet searches—provide data that could be useful for many applications.1 Such applications include drug and medical device safety surveillance, quality of care and performance measurement, making of diagnoses and prediction of prognosis, population management, decision support and precision medicine, and public health and research applications.2,3

Over the last decade, medical researchers have taken into account the heterogeneity of data in their work, where the genetics of subjects have been studied as a function of epistasis, and family history and personal life events have been used to predict clinical evolution. Big data technology should expand this fascinating field of multivariate approach research and overcome the inability of existing approaches to effectively gather, share, and use information in a more comprehensive manner within the health care system.2 In order to utilize health care big data, research groups and organizations have designed and implemented many frameworks/ methods. One of the most established frameworks is Hadoop, which supports the analysis of large data sets. This framework has been used in the implementation of various applications, such as disease prediction in patients, diagnosis of cancer, patient emergency alerts, generation of disease decision rules, medical data quality assessment, and personalized recommendation systems.4-10

In precision medicine, a patient's unique characteristics are used to tailor treatment in a manner that might be more elaborate than the standard course. For example, cardiologists currently use an algorithm that for a given patient predicts the occurrence of a myocardial infarction within 5 or 10 years based on body weight, arterial pressure, smoking status, blood lipid analysis results, and personal and family cardiovascular history. Precision medicine can be used in the diagnosis and prevention of disease, such as cancer, owing to advances in next-generation sequencing (NGS), liquid biopsy technology, computational biology methods, high-throughput functional screening, and analytical approaches.11

In the abovementioned domains, big data mining techniques have led to interesting results. For example, performance with such techniques is comparable to that of medical experts. It will be interesting to follow studies on the efficiency of these mining techniques in comparison with usual clinical management.

In this article, we briefly review data analysis methods for health care systems and examine challenges facing the utilization of this data.

Computational approaches toward personalized medicine

Although the concept of personalized medicine is not new, the emergence of powerful analytical tools has recently opened new avenues to predictive, preventive, participatory, and personalized medicine, known as P4 medicine.12 The hope is to reduce cost and improve the quality of care. Personalized medicine was involved in more than 25% of novel new drugs approved by the US Food and Drug Administration (FDA) in 2015,13 which shows that personalized medicine is moving toward becoming a substantial component of treatment products.

Research groups have investigated different aspects of personalized medicine, such as diagnosis, prognosis, and pharmacogenomics, through computational approaches or through improving/revising standards and regulations. Many of these research works, such as the “Baseline Study” project by Google Inc., the Cancer Genome Atlas, and the 100 000 Genomes Project (100KGP), are focused on high-throughput genomic analysis to achieve personalized health care by developing computational methods.11,14,15 Genomic mutations can be exploited in the development of drugs that target a protein to treat disease.

By analyzing large amounts of data, Forkan et al showed that there is a trend or pattern in each individual patient's data.16 A use case in this model was used to identify the true abnormal conditions of patients with variations in blood pressure and heart rate. Vidyasagar reviewed machine learning techniques for predicting a drug response and found that there are biomarkers, even some without biological significance, that could predict a drug response.17 Krishnan and Westhead, in a study of the application of machine learning and probabilistic approaches to the prediction of functional effects of single-nucleotide polymorphisms (SNPs), found that machine learning methods could outperform probabilistic methods.18 An integration of clinical variables such as race (white vs nonwhite), intensive care unit (ICU) type (medical vs surgical), sex, and age has been used in developing multivariate logistic regression models to estimate a personalized initial dose of heparin.19 Using these models, investigators observed statistically significant associations between sub- and supratherapeutic activated partial thromboplastin time (aPTT), the aforementioned clinical variables, heparin dose, and sequential organ failure assessment scores (SOFA), with area under the curve (AUG; also called area under a receiver operating characteristics [ROC] curve, a two-dimensional depiction of classifier performance.) of 0.78 and 0.79 respectively.

None of the state-of-the-art big data-driven approaches have reported an accuracy (the ratio between correctly identified/classified samples and the total number of samples) of 100%, and this is probably due to challenges such as missing data, the quality of data, and variations in experimental results addressed in the next section.

Challenges

Besides general challenges inherent to the analysis of big data—such as missing data, erroneous/imprecise data, and heterogeneous data—employing big data in health care systems imposes new challenges, including the lack of reliability and repeatability of some (but by no means all) biological data; issues of privacy, ownership (ie, determining owner(s) of data), and confidentiality; inadequate data from randomized controlled clinical trials; and low quality of data in general.1,17,18 To address the technical challenges, such as missing data and imprecise data, statistical as well as machine learning methods have been investigated.20-26 However, there is no unique solution to these problems; similar to other approaches, the efficacy of statistical and machine learning methods needs to be proven for new medical applications.

Another challenge is disparity in ethnic and socioeconomic status, which results in inequalities in health care; indeed, utilization of “omic” technologies is costly and might not be affordable for resource-poor populations. Integrating molecular pathology, epidemiology, and social sciences could be a strategy to explore health disparities linked to social environments.27 However, any influence on the global health setting from such future studies will only be effected if their results are reflected in political and economic decisions made.

To develop disease-specific models applicable to personalizing therapeutic interventions, we need to incorporate biomarkers (indicators of normal biological processes, pathogenic processes, or pharmacological responses to therapeutic intervention12) from DNA sequencing and improve the quality of data. However, in some diseases, such as cancer, cell heterogeneity in a single tumor makes detection of low-level mutations difficult, and a chemotherapy selected on the basis of specific genetic characteristics of that patient's cancer might be impractical.28 To reveal a correlation between results of DNA studies and disease type, more samples from different cells at different locations would be required, a procedure with low feasibility.28

Another challenge is the lack of knowledge about the human system. From a big data perspective, understanding the functionality of each part of this system needs to be converted to computational models and then integrated with other models of the human body. Understanding the biological networks and molecular processes, and thus the treatment outcome, in neuropsychiatry disorders has been severely hampered by limited access to the brain. Major big data projects such as BRAIN (Brain Research through Advancing Innovative Neurotechnologies), HBP (Human Brain Project), and TVB (The Virtual Brain),10 have been undertaken to enable investigators to fully understand the activity and connectivity of neuronal systems. However, these projects are far from complete, and various aspects of brain functionality may remain unresolved. For instance, understanding placebo effects at the psychological level, as well as in terms of neuroimaging, and neurobiological/physiological changes, is an ongoing and fascinating field of research.

Duscussion and conclusion

With technological advances, different research groups and organizations are generating and using increasingly complex and diverse data sets in health care systems. However, as the human system is very complex, a comprehensive model is required in order to achieve P4 medicine. To develop such a model, new sensors, methods, platforms, and unique biomarkers for diagnosis, and therapeutic outcome prediction are required.29 There is still a need for devices and sensors able to provide good quality reports of relevant information on patient health. For instance, no thoroughly validated device for measuring cardiac output is currently available.30 To design a personalized model applicable to P4 medicine, more investment is required toward understanding the human body and relevant correlations so that it can be described with computational models. Moreover, in order to design an accurate model, more studies to investigate the influence of parameters such as environmental factors, family history, and lifestyle on health are warranted. However, this might be particularly challenging in the fields of neurology and psychiatry.