Development of a Sepsis Prediction Model Using Electronic Health Records

Final Capstone

Victoria M. Vazquez, Ph.D.


There are an estimated 1.7 million cases of sepsis in the United States every year (Dantes and Epstein 2018) with an annual cost of $23.7 billion in 2013 (Novosad et al. 2016). The goal of this project was to produce a model that could predict development of sepsis in hospital patients, allowing nurses and doctors to intervene sooner, thus reducing the incidence of sepsis and the mortality rate in patients with sepsis.


I used the Medical Information Mart for Intensive Care III (MIMIC-III) database. The database is comprised of deidentified patient data that was obtained from Beth Israel Deaconess Medical Center from 2001-2012. The database is freely available, but requires completion of the CITI “Data or Specimens Only Research” course and registration for an account on PhysioNet. The database is comprised of 26 tables (Table 1) and includes demographics, vital sign measurements, laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (Table 2).

Table 1


Table 2


The chart events table is of particular interest because it includes all data added to a patient's chart with as much as several measurements every 15 minutes for some patients. As such, it is unwieldy on a basic computer and with traditional machine learning algorithms. I used SQLite to query the database and narrow the scope of the data that would be used in the modeling. I reduced the type of data to seven categories based on the label column (Table 3). And I filtered the data to include patients who did not have sepsis upon admission to the hospital based on the free form diagnosis column in the admissions table. The results of the numerous queries were to csv files, one including patients who never developed sepsis in the hospital and one including those who did.

Fig. 1 SQLite Queries of MIMIC-III Database