Research Bibliography

The following are some resources to check out for developing a full fledged population health monitoring program.  They are under review for application to the surveillance program being developed.


Part 1.  Techniques – Machine Learning


Machine learning is using your Big Data/EMR and software tools to develop routine methods for analyzing results.  Most recently, I produced a method for evaluating the top 20 ICDs (by groups I defined) for each of the ethnic groups in the EMR.  These lists were combined in a final table, with ethnics group listed side by side, in descending order for each, for the ‘hot diagnoses’, followed by rank, n and percent.  That SAS took about 1000 lines of programming and a minimum of 53 processes (according to the SAS info that is displayed during a run).

Machine Learning uses two methods–a supervised classification process and an unsupervised classification process.  Supervised classification is where you the researcher manually define the different groupings. For example, my ICD lists and sets are defined based upon personal impressions of ICDs that need to stand out more than they do with the predefined ICD groups defined at CMS.  My first group of ICDs is 135, my second 303.  These groups are to some degree subjective, and based upon clinical observations regarding priority healthcare issues.

Unsupervised classification is where the SAS itself analyzes the data and determines where clusters appear, with the clinical variables (parametric and non-parametric) used to define these clusters.  Although these are fairly easy to produce, they are not always logical and may link one outlier ICD in Group A to the cluster in group B, making the outcomes generated questionable.  There are hundreds of classifications to be tested in terms of ICD and ICD comorbidity relationships.  The following are few examples of these.

Most of these are available in full text form on the internet (sorry, no downloadable pdf copies for the moment, due to copyright concerns.)


Afzal, Z., Engelkes, M., Verhamme, K., Janssens, H. M., Sturkenboom, M. C., Kors, J. A., & Schuemie, M. J. (2013). Automatic generation of case‐detection algorithms to identify children with asthma from large electronic health record databases. Pharmacoepidemiology and Drug Safety, 22(8), 826-833. doi:10.1002/pds.3438

Afzal, Z., Schuemie, M. J., van Blijderveen, J. C., Sen, E. F., Sturkenboom, M. C., & Kors, J. A. (2013). Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records. BMC medical informatics and decision making, 13(1), 30. doi:10.1186/1472-6947-13-30

Boland, M. R., Tatonetti, N. P., & Hripcsak, G. (2014). CAESAR: a Classification Approach for Extracting Severity Au-tomatically from Electronic Health Records.

Boxwala, A. A., Kim, J., Grillo, J. M., & Ohno-Machado, L. (2011). Using statistical and machine learning to help institutions detect suspicious access to electronic health records. Journal of the American Medical Informatics Association, 18(4), 498-505. doi:10.1136/amiajnl-2011-000217

Caballero Barajas, K. L., & Akella, R. (2015, August). Dynamically Modeling Patient’s Health State from Electronic Medical Records: A Time Series Approach. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 69-78). ACM. doi:10.1145/2783258.2783289

Dua, S., Acharya, U. R., & Dua, P. (2014). Machine learning in healthcare informatics. Springer Berlin Heidelberg.  [Fuzzy logic, supervised and unsupervised classifications, rule learning, black box, predictions, longitudinal data, fraud, imagery.]

FitzHenry, F., Murff, H. J., Matheny, M. E., Gentry, N., Fielstein, E. M., Brown, S. H., … & Speroff, T. (2013). Exploring the Frontier of Electronic Health Record Surveillance: The Case of Post-Operative Complications. Medical Care, 51(6), 509. doi:10.1097/MLR.0b013e31828d1210

Gupta, S., Tran, T., Luo, W., Phung, D., Kennedy, R. L., Broad, A., … & Matheson, L. (2014). Machine-learning prediction of cancer survival: a retrospective study using electronic administrative records and a cancer registry. BMJ open, 4(3), e004007. doi:10.1136/bmjopen-2013-004007

Hoogendoorn, M., Moons, L. M., Numans, M. E., & Sips, R. J. (2014). Utilizing Data Mining for Predictive Modeling of Colorectal Cancer Using Electronic Medical Records. In Dominik Ślȩzak, Ah-Hwee Tan, James F. Peters, Lars Schwabe (Eds.), Brain Informatics and Health (pp. 132-141). Springer International Publishing. doi:10.1007/978-3-319-09891-3_13

Jensen, P. B., Jensen, L. J., & Brunak, S. (2012). Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6), 395-405. doi:10.1038/nrg3208

Liu, H., Bielinski, S. J., Sohn, S., Murphy, S., Wagholikar, K. B., Jonnalagadda, S. R., … Chute, C. G. (2013). An Information Extraction Framework for Cohort Identification Using Electronic Health Records . AMIA Summits on Translational Science Proceedings, 2013, 149–153.

Mo, H., Thompson, W. K., Rasmussen, L. V., Pacheco, J. A., Jiang, G., Kiefer, R., … & Lingren, T. (2015). Desiderata for computable representations of electronic health records-driven phenotype algorithms. Journal of the American Medical Informatics Association, 22(6), 1220-1230. doi:10.1093/jamia/ocv112

Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine, 23(1), 89-109.

Koutsojannis, C., Nabil, E., Tsimara, M., & Hatzilygeroudis, I. (2009, November). Using machine learning techniques to improve the behaviour of a medical decision support system for prostate diseases. In Intelligent Systems Design and Applications, 2009. ISDA’09. Ninth International Conference on (pp. 341-346). IEEE. 10.1109/ISDA.2009.110

Kreuzthaler, M., Schulz, S., & Berghold, A. (2015). Secondary use of electronic health records for building cohort studies through top-down information extraction. Journal of biomedical informatics, 53, 188-195. doi:10.1016/j.jbi.2014.10.010 [COHORTS]

Lin C, Karlson EW, Canhao H, Miller TA, Dligach D, Chen PJ, et al. (2013) Automatic Prediction of Rheumatoid Arthritis Disease Activity from the Electronic Medical Records. PLoS ONE 8(8): e69932. doi:10.1371/journal.pone.0069932

Martin-Sanchez, F., Iakovidis, I., Nørager, S., Maojo, V., de Groen, P., Van der Lei, J., … & Baud, R. (2004). Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care. Journal of biomedical informatics, 37(1), 30-42.

Meystre, S. M., Friedlin, F. J., South, B. R., Shen, S., & Samore, M. H. (2010). Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology, 10(1), 70. doi:10.1186/1471-2288-10-70

Patel, V. L., Shortliffe, E. H., Stefanelli, M., Szolovits, P., Berthold, M. R., Bellazzi, R., & Abu-Hanna, A. (2009). The coming of age of artificial intelligence in medicine. Artificial intelligence in medicine, 46(1), 5-17. doi:10.1016/j.artmed.2008.07.017

Pathak, J., Bailey, K. R., Beebe, C. E., Bethard, S., Carrell, D. S., Chen, P. J., … & Huff, S. M. (2013). Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium. Journal of the American Medical Informatics Association, 20(e2), e341-e348. doi:10.1136/amiajnl-2013-001939 [MU]

Pineda, A. L., Ye, Y., Visweswaran, S., Cooper, G. F., Wagner, M. M., & Tsui, F. R. (2015). Comparison of machine learning classifiers for influenza detection from emergency department free-text reports. Journal of Biomedical Informatics, 58, 60-69.. doi:10.1016/j.jbi.2015.08.019

Prather, J. C., Lobach, D. F., Goodwin, L. K., Hales, J. W., Hage, M. L., & Hammond, W. E. (1996, December). Medical data mining: knowledge discovery in a clinical data warehouse. In Proceedings: a conference of the American Medical Informatics Association/… AMIA Annual Fall Symposium. AMIA Fall Symposium (pp. 101-105).

Sada, Y., Hou, J., Richardson, P., El-Serag, H., & Davila, J. (2013). Validation of case finding algorithms for hepatocellular cancer from administrative data and electronic health records using natural language processing. Medical care, 54(2), e9–e14. doi:10.1097/MLR.0b013e3182a30373

Skeppstedt, M., Kvist, M., Nilsson, G. H., & Dalianis, H. (2014). Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study. Journal of Biomedical Informatics, 49, 148-158. doi:10.1016/j.jbi.2014.01.012

Szarvas, G., Farkas, R., & Busa-Fekete, R. (2007). State-of-the-art anonymization of medical records using an iterative machine learning framework. Journal of the American Medical Informatics Association, 14(5), 574-580. doi:10.1197/j.jamia.M2441

Wang, Z., Shah, A. D., Tate, A. R., Denaxas, S., Shawe-Taylor, J., & Hemingway, H. (2012). Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning. PLoS One, 7(1), e30412. doi:10.1371/journal.pone.0030412

Wiens, J., Campbell, W. N., Franklin, E. S., Guttag, J. V., & Horvitz, E. (2014, September). Learning Data-Driven Patient Risk Stratification Models for Clostridium difficile. In Open Forum Infectious Diseases (Vol. 1, No. 2, p. ofu045). Oxford University Press. doi: 10.1093/ofid/ofu045

Weiss, J. C., Natarajan, S., Peissig, P. L., McCarty, C. A., & Page, D. (2012). Machine learning for personalized medicine: predicting primary myocardial infarction from electronic health records. AI Magazine, 33(4), 33. doi:10.1609/aimag.v33i4.2438

Wolfson, J., Bandyopadhyay, S., Elidrisi, M., Vazquez-Benitez, G., Musgrove, D., Adomavicius, G., … & O’Connor, P. (2013). A Naive Bayes machine learning approach to risk prediction using censored, time-to-event electronic health record data. [Draft of presentation/publication; not completed.]

Wu, J., Roy, J., & Stewart, W. F. (2010). Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Medical care, 48(6), S106-S113. doi:10.1097/MLR.0b013e3181de9e17


Observational Studies=Data Mining


I most often incorporate GIS into my work by using the raw data provided by EMR, reclassifiying it as need be, and adding longitude-latitude data whenever possible. This use of GIS may be considered an extension of increasing popular “Observations Studies” term and techniques now found in the literature.  A GIS study of the raw or freshly mined and slightly modified data may also be labelled an “ecological study.”

Grimes, D. A., & Schulz, K. F. (2002). Bias and causal associations in observational research. The Lancet, 359(9302), 248-252. doi:10.1016/S0140-6736(02)07451-2

Hansen, R. A., Gray, M. D., Fox, B. I., Hollingsworth, J. C., Gao, J., & Zeng, P. (2013). How well do various health outcome definitions identify appropriate cases in observational studies? Drug Safety, 36(1), 27-32. doi:10.1007/s40264-013-0104-0

Madigan, D., Stang, P. E., Berlin, J. A., Schuemie, M., Overhage, J. M., Suchard, M. A., … & Ryan, P. B. (2014). A systematic statistical approach to evaluating evidence from observational studies. Annual Review of Statistics and Its Application, 1, 11-39. doi:10.1146/annurev-statistics-022513-115645

Nagisetty, N., Huang, E. Y., Wade, G., & Viangteeravat, T. (2014). Building a knowledge base to assist clinical decision-making using the Pediatric Research Database (PRD) and machine learning: a case study on pediatric asthma patients. BMC Bioinformatics, 15(Suppl 10), P17. doi:10.1186/1471-2105-15-S1-S10

Roche, J. J. W., Wenn, R. T., Sahota, O., & Moran, C. G. (2005). Effect of comorbidities and postoperative complications on mortality after hip fracture in elderly people: prospective observational cohort study. BMJ, 331(7529), 1374. doi:10.1136/bmj.38643.663843.55

Schuemie, M. J., Ryan, P. B., DuMouchel, W., Suchard, M. A., & Madigan, D. (2014). Interpreting observational studies: why empirical calibration is needed to correct p‐values. Statistics in medicine, 33(2), 209-218. doi:10.1002/sim.5925

Shiomi, H., Nakagawa, Y., Morimoto, T., Furukawa, Y., Nakano, A., Shirai, S., … & Mitsuoka, H. (2012). Association of onset to balloon and door to balloon time with long term clinical outcome in patients with ST elevation acute myocardial infarction having primary percutaneous coronary intervention: observational study. BMJ, 344, e3257. doi: 10.1136/bmj.e3257

Tannen, R. L., Weiner, M. G., & Xie, D. (2009). Use of primary care electronic medical record database in drug efficacy research on cardiovascular outcomes: comparison of database and randomised controlled trial findings. BMJ, 338. doi:10.1136/bmj.b81

Twisk, J. W. (1997). Different statistical models to analyze epidemiological observational longitudinal data: an example from the Amsterdam Growth and Health Study. International Journal of Sports Medicine, 18, S216-24.

Yost, N. P., Bloom, S. L., McIntire, D. D., & Leveno, K. J. (2005). A prospective observational study of domestic violence during pregnancy. Obstetrics & gynecology, 106(1), 61-65. doi:10.1097/01.AOG.0000164468.06070.2a


Comparative Effectiveness Research

CER is when treatment programs for several programs or facilities are contrasted and compared statistically.  This refers to database settings where the data source is several places, and in order to retain HIPAA compliance, the data is cleaned of the personal identifiers and other data, as specified by some program and/or HIPAA guidelines.  These guidelines are followed as much as possible by the researchers, but realize, full compliance is difficult when the restricted data is essential to the study process itself, such as 5digit zip code identification and even street and house number data.  CER involves institutions cross-comparing their healthcare results and performance.  These measures are often implemented as part of the meaningful use program as well.


Hersh, W. R., Weiner, M. G., Embi, P. J., Logan, J. R., Payne, P. R., Bernstam, E. V., … & Saltz, J. H. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical Care, 51(8 Suppl 3), S30-7. doi:10.1097/MLR.0b013e31829b1dbd

Holve, E., Segal, C., Lopez, M. H., Rein, A., & Johnson, B. H. (2012). The Electronic Data Methods (EDM) forum for comparative effectiveness research (CER). Medical care, 50, S7-S10. doi:10.1097/MLR.0b013e318257a66b

Kudyakov, R., Bowen, J., Ewen, E., West, S. L., Daoud, Y., Fleming, N., & Masica, A. (2012). Electronic health record use to classify patients with newly diagnosed versus preexisting type 2 diabetes: infrastructure for comparative effectiveness research and population health management. Population Health Management, 15(1), 3-11. doi:10.1089/pop.2010.0084.

Lopez, M. H., Holve, E., Sarkar, I. N., & Segal, C. (2012). Building the informatics infrastructure for comparative effectiveness research (CER): a review of the literature. Medical Care, 50, S38-S48. doi: 10.1097/MLR.0b013e318259becd

Masica, M. D., & Collinsworth, M. P. H. (2012). Leveraging Electronic Health Records in Comparative Effectiveness Research. Prescriptions for Excellence in Health Care Newsletter Supplement, 1(14), 6.

Ogunyemi, O. I., Meeker, D., Kim, H. E., Ashish, N., Farzaneh, S., & Boxwala, A. (2013). Identifying appropriate reference data models for comparative effectiveness research (CER) studies based on data from clinical information systems. Medical Care, 51, S45-S52. doi:10.1097/MLR.0b013e31829b1e0b

Toh, S., Platt, R., Steiner, J. F., & Brown, J. S. (2011). Comparative‐Effectiveness Research in Distributed Health Data Networks. Clinical Pharmacology & Therapeutics, 90(6), 883-887. doi:10.1038/clpt.2011.236

Toh, S., & Platt, R. (2013). Is size the next big thing in epidemiology?. Epidemiology, 24(3), 349-351. doi:10.1097/EDE.0b013e31828ac65e

Data Sharing (iDASH, a HIPAA certified cloud)

Ohno-Machado, L., Bafna, V., Boxwala, A. A., Chapman, B. E., Chapman, W. W., Chaudhuri, K., … & Kim, H. (2012). iDASH: integrating data for analysis, anonymization, and sharing. Journal of the American Medical Informatics Association, 19(2), 196-201. 10.1136/amiajnl-2011-000538

Reisinger, S. J., Ryan, P. B., O’Hara, D. J., Powell, G. E., Painter, J. L., Pattishall, E. N., & Morris, J. A. (2010). Development and evaluation of a common data model enabling active drug safety surveillance using disparate healthcare databases. Journal of the American Medical Informatics Association, 17(6), 652-662. doi:10.1136/jamia.2009.002477

Data Quality Assessment Model

Kahn, M. G., Raebel, M. A., Glanz, J. M., Riedlinger, K., & Steiner, J. F. (2012). A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Medical care, 50. doi:10.1097/MLR.0b013e318257dd67   Accessed at

Brown, J., Kahn, M., & Toh, S. (2013). Data quality assessment for comparative effectiveness research in distributed data networks. Medical care, 51(8 0 3), S22. doi:10.1097/MLR.0b013e31829b1e2c

Dreyer, N. A., Schneeweiss, S., McNeil, B. J., Berger, M. L., Walker, A. M., Ollendorf, D. A., & Gliklich, R. E. (2010). GRACE principles: recognizing high-quality observational studies of comparative effectiveness. The American Journal of Managed Care, 16(6), 467-471.

Data Accuracy

Cipparone, C. W., Withiam-Leitch, M., Kimminau, K. S., Fox, C. H., Singh, R., & Kahn, L. (2015). Inaccuracy of ICD-9 Codes for Chronic Kidney Disease: A Study from Two Practice-based Research Networks (PBRNs). The Journal of the American Board of Family Medicine, 28(5), 678-682. doi:10.3122/jabfm.2015.05.140136


Malin, B. A., El Emam, K., & O’Keefe, C. M. (2013). Biomedical data privacy: problems, perspectives, and recent advances. Journal of the American medical informatics association, 20(1), 2-6. doi:10.1136/amiajnl-2012-001509

Tran, D. T., Halgrim, S., & Carrell, D. (2014). C3-4: An Algorithm to Combine Machine Learning and Structured Data to Automate De-identification of Clinical Text. Clinical Medicine & Research, 12(1-2), 94-95. doi:10.3121/cmr.2014.1250.c3-4



Part 2 – Applications, Methods and Skills

These are examples of how to employ population health analysis procedures.   


Alghwiri, A., Alghadir, A., & Awad, H. (2014). The Arab Risk (ARABRISK): Translation and Validation. Biomedical Research, 25(2), 271-275.

Carroll, R. J., Thompson, W. K., Eyler, A. E., Mandelin, A. M., Cai, T., Zink, R. M., … & Karlson, E. W. (2012). Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19(e1), e162-e169. doi:10.1136/amiajnl-2011-000583

Holroyd-Leduc, J. M., Lorenzetti, D., Straus, S. E., Sykes, L., & Quan, H. (2011). The impact of the electronic medical record on structure, process, and outcomes within primary care: a systematic review of the evidence. Journal of the American Medical Informatics Association, 18(6), 732-737. doi:10.1136/amiajnl-2010-000019

Lin, Y. K., Chen, H., Brown, R., Li, S. H., & Yang, H. J. (2014). Time-to-Event Predictive Modeling for Chronic Conditions using Electronic Health Records. Intelligent Systems, IEEE, 29(3), 14-20. doi:10.1109/MIS.2014.18

Sovio, U., Skow, A., Falconer, C., Park, M. H., Viner, R. M., & Kinra, S. (2013). Improving prediction algorithms for cardiometabolic risk in children and adolescents. Journal of obesity, 2013. doi:10.1155/2013/684782


Anderson, A. E., Kerr, W. T., Thames, A., Li, T., Xiao, J., & Cohen, M. S. (2015). Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study. arXiv preprint arXiv:1501.02402.

Boland, M. R., Tatonetti, N. P., & Hripcsak, G. (2015). Development and validation of a classification approach for extracting severity automatically from electronic health records. Journal of Biomedical Semantics, 6(1), 14.  doi:10.1186/s13326-015-0010-8

Carroll, R. J., Eyler, A. E., & Denny, J. C. (2011). Naïve electronic health record phenotype identification for rheumatoid arthritis. In AMIA annual symposium proceedings (Vol. 2011, p. 189). American Medical Informatics Association.

Chen, Y., Carroll, R. J., Hinz, E. R. M., Shah, A., Eyler, A. E., Denny, J. C., & Xu, H. (2013). Applying active learning to high-throughput phenotyping algorithms for electronic health records data. Journal of the American Medical Informatics Association, 20(e2), e253-e259. doi:10.1136/amiajnl-2013-001945

Pecci, A., Klersy, C., Gresele, P., Lee, K. J., De Rocco, D., Bozzi, V., … & Fabris, F. (2014). MYH9‐Related Disease: A Novel Prognostic Model to Predict the Clinical Evolution of the Disease Based on Genotype–Phenotype Correlations. Human Mutation, 35(2), 236-247. doi:10.1002/humu.22476

Hripcsak, G., & Albers, D. J. (2013). Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association, 20(1), 117-121. doi:10.1136/amiajnl-2012-001145


Peissig, P. L., Costa, V. S., Caldwell, M. D., Rottscheit, C., Berg, R. L., Mendonca, E. A., & Page, D. (2014). Relational machine learning for electronic health record-driven phenotyping. Journal of biomedical informatics, 52, 260-270. doi:10.1016/j.jbi.2014.07.007

Rasmussen, L. V., Thompson, W. K., Pacheco, J. A., Kho, A. N., Carrell, D. S., Pathak, J., … & Starren, J. B. (2014). Design patterns for the development of electronic health record-driven phenotype extraction algorithms. Journal of Biomedical Informatics, 51, 280-286. doi:10.1016/j.jbi.2014.06.007

Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad, N., Johnson, S. B., & Lai, A. M. (2014). A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association, 21(2), 221-230. doi:10.1136/amiajnl-2013-001935

Wei, W. Q., Teixeira, P. L., Mo, H., Cronin, R. M., Warner, J. L., & Denny, J. C. (2015). Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. Journal of the American Medical Informatics Association, ocv130.



Chai, K. E., Anthony, S., Coiera, E., & Magrabi, F. (2013). Using statistical text classification to identify health information technology incidents. Journal of the American Medical Informatics Association, 20(5), 980-985. doi:10.1136/amiajnl-2012-001409

Dai, W., Brisimi, T. S., Adams, W. G., Mela, T., Saligrama, V., & Paschalidis, I. C. (2015). Prediction of hospitalization due to heart diseases by supervised learning methods. International journal of medical informatics, 84(3), 189-197. doi:10.1016/j.ijmedinf.2014.10.002

Pak, T. R., & Kasarskis, A. (2015). How next-generation sequencing and multiscale data analysis will transform infectious disease management. Clinical Infectious Diseases, 61(11), 1695-1702.  doi: 10.1093/cid/civ670

Ye, Y., Tsui, F., Wagner, M., Espino, J. U., & Li, Q. (2014). Influenza detection from emergency department reports using natural language processing and Bayesian network classifiers. Journal of the American Medical Informatics Association, 21(5), 815-823. doi:10.1136/amiajnl-2013-001934

QOL & Dx post-Tx

Penson, D. F., Feng, Z., Kuniyuki, A., McClerran, D., Albertsen, P. C., Deapen, D., … & Stanford, J. L. (2003). General quality of life 2 years following treatment for prostate cancer: what influences outcomes? Results from the prostate cancer outcomes study. Journal of Clinical Oncology, 21(6), 1147-1154. doi:10.1200/JCO.2003.07.139


Newhouse, J. P., & McClellan, M. (1998). Econometrics in outcomes research: the use of instrumental variables. Annual Review of Public Health, 19(1), 17-34. doi:10.1146/annurev.publhealth.19.1.17

Comorbidity Scores

Austin, S.R., Wong, Y.N., Uzzo, R.G., Beck, J.R., Egleston, B.L. (2015). Why Summary Comorbidity Measures Such As the Charlson Comorbidity Index and Elixhauser Score Work.  Medical Care, 53(9), e65-72. doi:10.1097/MLR.0b013e318297429c.  Accessed at

Bang, J. H., Hwang, S.-H., Lee, E.-J., & Kim, Y. (2013). The predictability of claim-data-based comorbidity-adjusted models could be improved by using medication data. BMC Medical Informatics and Decision Making, 13, 128.

Chu, Y.-T., Ng, Y.-Y., & Wu, S.-C. (2010). Comparison of different comorbidity measures for use with administrative data in predicting short- and long-term mortality. BMC Health Services Research, 10, 140.

Gutacker, N, Bloor, K, Cookson, R. (2015). Comparing the performance of the Charlson/Deyo and Elixhauser comorbidity measures across five European countries and three conditions.  European Journal of Public Health. 25 Suppl 1, 15-20. doi:10.1093/eurpub/cku221.

Johnson, A. E., Kramer, A. A., & Clifford, G. D. (2013). A new severity of illness scale using a subset of acute physiology and chronic health evaluation data elements shows comparable predictive accuracy*. Critical Care Medicine, 41(7), 1711-1718. doi: 10.1097/CCM.0b013e31828a24fe  [Oxford Acute Severity of Illness Score ; Particle Swarm Optimization]

Menendez, Mariano E. et al. The Elixhauser Comorbidity Method Outperforms the Charlson Index in Predicting Inpatient Death After Orthopaedic Surgery. Clinical Orthopaedics and Related Research 472.9 (2014): 2878–2886. PMC. Web. 24 Jan. 2016.

Schneeweiss, S., Maclure, M.  (2000).  Use of comorbidity scores for control of confounding in studies using administrative databases. International Journal of Epidemiology, 29(5), 891-8. Accessed at

Stausberg J, Hagn S (2015) New Morbidity and Comorbidity Scores based on the Structure of the ICD-10. PLoS ONE 10(12): e0143365. doi:10.1371/journal.pone.0143365  Accessed at

Yang, M., Mehta, H.B., Bali, V., Gupta, P., Wang, X., Johnson, M.L., Aparasu, R. R.
(2015). Which risk-adjustment index performs better in predicting 30-day mortality? A systematic review and meta-analysis. Journal Evaluation Clinical Practice, 21(2), 292-9. doi: 10.1111/jep.12307. [Includes several speciality disease scores]


Castro, V. M., Clements, C. C., Murphy, S. N., Gainer, V. S., Fava, M., Weilburg, J. B., … & Smoller, J. W. (2013). QT interval and antidepressant use: a cross sectional study of electronic health records. BMJ, 346, f288. doi:10.1136/bmj.f288

Costa, F. F. (2014). Big data in biomedicine. Drug discovery today, 19(4), 433-440. doi:10.1016/j.drudis.2013.10.012

Khoury, M. J., Rich, E. C., Randhawa, G., Teutsch, S. M., & Niederhuber, J. (2009). Comparative effectiveness research and genomic medicine: an evolving partnership for 21st century medicine. Genetics in Medicine, 11(10), 707-711. doi:10.1097/GIM.0b013e3181b99b90

Analytics (other)

Schulam, P., Wigley, F., & Saria, S. (2015, February). Clustering Longitudinal Clinical Marker Trajectories from Electronic Health Data: Applications to Phenotyping and Endotype Discovery. In Twenty-Ninth AAAI Conference on Artificial Intelligence.


Part 3 – Risk Analysis

Predicting Risk for Diabetes

Eggleston, E. M., & Klompas, M. (2014). Rational use of electronic health records for diabetes population management. Current Diabetes Reports, 14(4), 1-10. 10.1007/s11892-014-0479-z

Exalto, L. G., Biessels, G. J., Karter, A. J., Huang, E. S., Katon, W. J., Minkoff, J. R., & Whitmer, R. A. (2013). Risk score for prediction of 10 year dementia risk in individuals with type 2 diabetes: a cohort study. The Lancet Diabetes & Endocrinology, 1(3), 183-190. doi:10.1016/S2213-8587(13)70048-2

Herman, W. H. (2009). Predicting risk for diabetes: choosing (or building) the right model. Annals of Internal Medicine, 150(11), 812-814.

Jin, H., & Benyshek, D. C. (2013). The “metabolic syndrome index”: A novel, comprehensive method for evaluating the efficacy of diabetes prevention programs. doi:10.4236/jdm.2013.32014

Lawrence, J. M., Black, M. H., Zhang, J. L., Slezak, J. M., Takhar, H. S., Koebnick, C., … & Reynolds, K. (2013). Validation of pediatric diabetes case identification approaches for diagnosed cases by using information in the electronic health records of a large integrated managed health care organization. American Journal of Epidemiology, kwt230. doi:10.1093/aje/kwt230

Makam, A. N., Nguyen, O. K., Moore, B., Ma, Y., & Amarasingham, R. (2013). Identifying patients with diabetes and the earliest date of diagnosis in real time: an electronic health record case-finding algorithm. BMC medical informatics and decision making, 13(1), 81. doi:10.1186/1472-6947-13-81

Onitilo, A. A., Stankowski, R. V., Berg, R. L., Engel, J. M., Williams, G. M., & Doi, S. A. (2014). A novel method for studying the temporal relationship between type 2 diabetes mellitus and cancer using the electronic medical record. BMC medical informatics and decision making, 14(1), 38. doi:10.1186/1472-6947-14-38

Reed, M., Huang, J., Brand, R., Graetz, I., Neugebauer, R., Fireman, B., … & Hsu, J. (2013). Implementation of an outpatient electronic health record and emergency department visits, hospitalizations, and office visits among patients with diabetes. JAMA, 310(10), 1060-1065. doi:10.1001/jama.2013.276733.

Riaz, M., Basit, A., Hydrie, M. Z. I., Shaheen, F., Hussain, A., Hakeem, R., & Shera, A. S. (2012). Risk assessment of Pakistani individuals for diabetes (RAPID). Primary care diabetes, 6(4), 297-302. doi:10.1016/j.pcd.2012.04.002

Tankova, T., Chakarova, N., Atanassova, I., & Dakovska, L. (2011). Evaluation of the Finnish Diabetes Risk Score as a screening tool for impaired fasting glucose, impaired glucose tolerance and undetected diabetes. Diabetes Research and Clinical Practice, 92(1), 46-52. doi:10.1016/j.diabres.2010.12.020

Wang, H., Liu, T., Qiu, Q., Karp, E., Ding, P., He, Y. H., & Chen, W. Q. (2015). Development and validation of a simple risk score for prevalent undiagnosed type 2 diabetes in Southern Chinese population. International Journal of Diabetes in Developing Countries, 35(3), 1-9. doi:10.1007/s13410-014-0285-9

Prediction, Risk, in General

Bandyopadhyay, S., Wolfson, J., Vock, D. M., Vazquez-Benitez, G., Adomavicius, G., Elidrisi, M., … & O’Connor, P. J. (2014). Data mining for censored time-to-event data: A Bayesian network model for predicting cardiovascular risk from electronic health record data. Data Mining and Knowledge Discovery, 1-37. doi: 10.1007/s10618-014-0386-6

Eggleston, E. M., & Weitzman, E. R. (2014). Innovative uses of electronic health records and social media for public health surveillance. Current Diabetes Reports, 14(3), 1-9. doi:10.1007/s11892-013-0468-7

Fox, K. A., Dabbous, O. H., Goldberg, R. J., Pieper, K. S., Eagle, K. A., Van de Werf, F., … & Granger, C. B. (2006). Prediction of risk of death and myocardial infarction in the six months after presentation with acute coronary syndrome: prospective multinational observational study (GRACE). BMJ, 333(7578), 1091. doi:10.1136/bmj.38985.646481.55

Goldstein, B. A., Chang, T. I., Mitani, A. A., Assimes, T. L., & Winkelmayer, W. C. (2014). Near-term prediction of sudden cardiac death in older hemodialysis patients using electronic health records. Clinical Journal of the American Society of Nephrology, 9(1), 82-91. doi:10.2215/​CJN.03050313

Gultepe, E., Green, J. P., Nguyen, H., Adams, J., Albertson, T., & Tagkopoulos, I. (2014). From vital signs to clinical outcomes for patients with sepsis: a machine learning basis for a clinical decision support system. Journal of the American Medical Informatics Association, 21(2), 315-325. doi:10.1136/amiajnl-2013-001815

Himes, B. E., Dai, Y., Kohane, I. S., Weiss, S. T., & Ramoni, M. F. (2009). Prediction of chronic obstructive pulmonary disease (COPD) in asthma patients using electronic medical records. Journal of the American Medical Informatics Association, 16(3), 371-379. doi:10.1197/jamia.M2846

Hubbard, R. (2014). Statistical methods for misclassified outcomes and exposures in data from electronic medical records. [Report]. Accessed at

Li, D., Simon, G., Chute, C. G., & Pathak, J. (2013). Using Association Rule Mining for Phenotype Extraction from Electronic Health Records . AMIA Summits on Translational Science Proceedings, 2013, 142–146. [ARM Model building]

Mani, S., Ozdas, A., Aliferis, C., Varol, H. A., Chen, Q., Carnevale, R., … & Weitkamp, J. H. (2014). Medical decision support using machine learning for early detection of late-onset neonatal sepsis. Journal of the American Medical Informatics Association, 21(2), 326-336. doi:10.1136/amiajnl-2013-001854

Melton, L. J., Atkinson, E. J., St Sauver, J. L., Achenbach, S. J., Therneau, T. M., Rocca, W. A., & Amin, S. (2014). Predictors of Excess Mortality After Fracture: A Population‐Based Cohort Study. Journal of Bone and Mineral Research, 29(7), 1681-1690. doi:10.1002/jbmr.2193

Murray, R. E., Ryan, P. B., & Reisinger, S. J. (2011). Design and validation of a data simulation model for longitudinal healthcare data. In AMIA Annual Symposium Proceedings (Vol. 2011, p. 1176). American Medical Informatics Association.

Pearson, J. F., Brownstein, C. A., & Brownstein, J. S. (2011). Potential for electronic health records and online social networking to redefine medical research. Clinical chemistry, 57(2), 196-204. doi:10.1373/clinchem.2010.148668

Ryan, P. B., Schuemie, M. J., Gruber, S., Zorych, I., & Madigan, D. (2013). Empirical performance of a new user cohort method: lessons for developing a risk identification and analysis system. Drug safety, 36(1), 59-72. doi:10.1007/s40264-013-0099-6

Ryan, P. B., Schuemie, M. J. (2013). Evaluating Performance of Risk Identification Methods Through a Large-Scale Simulation of Observational Data. Drug Safety, 36(1), 171-180. doi:10.1007/s40264-013-0110-2

Weiss, J. C., Natarajan, S., Peissig, P. L., McCarty, C. A., & Page, D. (2012, July). Statistical Relational Learning to Predict Primary Myocardial Infarction from Electronic Health Records. In IAAI. Twenty-Fourth IAAI Conference, Toronto, Ontario, Canada, July 22, 2012 – July 26, 2012



Part 4 – Other Applications, Methodology Information



Gobbel, G. T., Reeves, R., Jayaramaraja, S., Giuse, D., Speroff, T., Brown, S. H., … & Matheny, M. E. (2014). Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives. Journal of biomedical informatics, 48, 54-65. doi:10.1016/j.jbi.2013.11.008

Jonnagaddala, J., Dai, H. J., Ray, P., & Liaw, S. T. (2015). A preliminary study on automatic identification of patient smoking status in unstructured electronic health records. ACL-IJCNLP 2015, 147. Accessed at

Poulin, C., Shiner, B., Thompson, P., Vepstas, L., Young-Xu, Y., Goertzel, B., … & McAllister, T. (2014). Predicting the risk of suicide by analyzing the text of clinical notes. PloS one, 9(1). doi:10.1371/journal.pone.0085733

Strauss, J. A., Chao, C. R., Kwan, M. L., Ahmed, S. A., Schottinger, J. E., & Quinn, V. P. (2013). Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm. Journal of the American Medical Informatics Association, 20(2), 349-355. doi:10.1136/amiajnl-2012-000928

Zheng, C., Rashid, N., Wu, Y. L., Koblick, R., Lin, A. T., Levy, G. D., & Cheetham, T. C. (2014). Using natural language processing and machine learning to identify gout flares from electronic clinical notes. Arthritis Care & Research, 66(11), 1740-1748. doi:10.1002/acr.22324


Hripcsak, G., Albers, D. J., & Perotte, A. (2015). Parameterizing time in electronic health record studies. Journal of the American Medical Informatics Association, ocu051. doi:10.1093/jamia/ocu051

Software Other

Mowery, D., Wiebe, J., Ross, M., Vellupillai, S., Mystere, S., Chapman, W. W. Generating Patient Problem Lists from the ShARe Corpus using SNOMED CT/SNOMED CT CORE Problem List  In Proceedings of the 2014 Workshop on Biomedical Natural Language Processing (BioNLP 2014) (pages 54–58). Baltimore, Maryland USA, June 26-27 2014. Accessed at

Ng, K., Ghoting, A., Steinhubl, S. R., Stewart, W. F., Malin, B., & Sun, J. (2014). PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records. Journal of biomedical informatics, 48, 160-170. doi:10.1016/j.jbi.2013.12.012

Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C., & Chute, C. G. (2010). Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 17(5), 507-513. doi:10.1136/jamia.2009.001560

Software Sentinel

Behrman, R. E., Benner, J. S., Brown, J. S., McClellan, M., Woodcock, J., & Platt, R. (2011). Developing the Sentinel System—a national resource for evidence development. New England Journal of Medicine, 364(6), 498-499. doi:10.1056/NEJMp1014427

Curtis, L. H., Weiner, M. G., Boudreau, D. M., Cooper, W. O., Daniel, G. W., Nair, V. P., … & Brown, J. S. (2012). Design considerations, architecture, and use of the Mini‐Sentinel distributed data system. Pharmacoepidemiology and Drug Safety, 21(S1), 23-31. 10.1002/pds.2336

Madigan, D., & Ryan, P. (2011). Commentary: What Can We Really Learn From Observational Studies?: The Need for Empirical Assessment of Methodology for Active Drug Safety Surveillance and Comparative Effectiveness Research. Epidemiology, 22(5), 629-631. doi:10.1097/EDE.0b013e318228ca1d

Maro, J. C., Platt, R., Holmes, J. H., Strom, B. L., Hennessy, S., Lazarus, R., & Brown, J. S. (2009). Design of a national distributed health data network. Annals of Internal Medicine, 151(5), 341-344. doi:10.7326/0003-4819-151-5-200909010-00139

Overhage, J. M., Ryan, P. B., Reich, C. G., Hartzema, A. G., & Stang, P. E. (2012). Validation of a common data model for active safety surveillance research. Journal of the American Medical Informatics Association, 19(1), 54-60. doi:10.1136/amiajnl-2011-000376 [Values]

Reich, C., Ryan, P. B., Stang, P. E., & Rocca, M. (2012). Evaluation of alternative standardized terminologies for medical conditions within a network of observational healthcare databases. Journal of Biomedical Informatics, 45(4), 689-696. doi:10.1016/j.jbi.2012.05.002

Ryan, P. B., Madigan, D., Stang, P. E., Marc Overhage, J., Racoosin, J. A., & Hartzema, A. G. (2012). Empirical assessment of methods for risk identification in healthcare data: results from the experiments of the Observational Medical Outcomes Partnership. Statistics in Medicine, 31(30), 4401-4415.  doi:10.1002/sim.5620

Stang, P. E., Ryan, P. B., Racoosin, J. A., Overhage, J. M., Hartzema, A. G., Reich, C., … & Woodcock, J. (2010). Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership. Annals of internal medicine, 153(9), 600-606. doi:10.7326/0003-4819-153-9-201011020-00010

Stang, P. E., Ryan, P. B., Dusetzina, S. B., Hartzema, A. G., Reich, C., Overhage, J. M., & Racoosin, J. A. (2012). Health outcomes of interest in observational data: issues in identifying definitions in the literature. Health Outcomes Research in Medicine, 3(1), e37-e44. doi:10.1016/j.ehrm.2011.11.003

Coloma, P. M., Trifirò, G., Schuemie, M. J., Gini, R., Herings, R., Hippisley‐Cox, J., … & Lei, J. (2012). Electronic healthcare databases for active drug safety surveillance: is there enough leverage?. Pharmacoepidemiology and Drug Safety, 21(6), 611-621. doi:10.1002/pds.3197


MISC (some nice reads on this topic)

For more on OMOP, see

Brooks, R., & Grotz, C. (2010). Implementation of electronic medical records: How healthcare providers are managing the challenges of going digital. Journal of Business & Economics Research (JBER), 8(6). doi:10.19030/jber.v8i6.736

Hansen, M. M., Miron-Shatz, T., Lau, A. Y. S., & Paton, C. (2014). Big Data in Science and Healthcare: A Review of Recent Literature and Perspectives: Contribution of the IMIA Social Media Working Group. Yearbook of Medical Informatics, 9(1), 21-26. doi:10.15265/IY-2014-0004  [Table provides examples: location use, visualization, assess disease spread, evaluate cause, predict, define social and environmental factors, crisis and disaster management planning, tracking, storing and mining population health data, bring together data from different sources, monitor, cost model]

Herland, M., Khoshgoftaar, T. M., & Wald, R. (2013, December). Survey of Clinical Data Mining Applications on Big Data in Health Informatics. In Machine Learning and Applications (ICMLA), 2013 12th International Conference (Vol. 2, pp. 465-472). IEEE. doi:10.1109/ICMLA.2013.163

Kennedy, E. H., Wiitala, W. L., Hayward, R. A., & Sussman, J. B. (2013). Improved cardiovascular risk prediction using nonparametric regression and electronic health record data. Medical care, 51(3), 251. doi:10.1097/MLR.0b013e31827da594

Kerr, W. T., Lau, E. P., Owens, G. E., & Trefler, A. (2012). The future of medical diagnostics: large digitized databases. The Yale journal of biology and medicine, 85(3), 363.

Kushida, C. A., Nichols, D. A., Jadrnicek, R., Miller, R., Walsh, J. K., & Griffin, K. (2012). Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Medical care, 50, S82-S101. doi:10.1097/MLR.0b013e3182585355


Murdoch, T. B., & Detsky, A. S. (2013). The inevitable application of big data to health care. Jama, 309(13), 1351-1352.

Pathak, J., Kho, A. N., & Denny, J. C. (2013). Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. Journal of the American Medical Informatics Association, 20(e2), e206-e211. doi:10.1136/amiajnl-2013-002428

Savage, N. (2012). Better medicine through machine learning. Communications of the ACM, 55(1), 17-19. doi:10.1145/2063176.2063182

Schneeweiss, S. (2014). Learning from big health care data. New England Journal of Medicine, 370(23), 2161-2163. doi:10.1056/NEJMp1401111

Scruggs, S. B., Watson, K., Su, A. I., Hermjakob, H., Yates, J. R., Lindsey, M. L., & Ping, P. (2015). Harnessing the Heart of Big Data. Circulation Research, 116(7), 1115-1119. doi:10.1161/CIRCRESAHA.115.306013

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 557-570. doi:10.1142/S0218488502001648


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s