Detailed, granular information entered into the EMR is worth the expense of entry when it proves useful in improving practice efficiency, quality, and return on investment of the EMR.
Transactional vs. Warehoused data
EMRs normally store their data in an underlying relational database (Oracle, SQL-Server, Access, MySQL) or hierarchical/object database (MUMPS, M, Caché) in “transactional” form. The transactional form includes all information need to conduct the healthcare enterprise, including “internal” data of little interest to the end consumer/clinician (internal date-time stamps, update codes, workstation origin codes, incremental data updates, etc.). In some circumstances, there is a case to be made for extracting key clinical data (extraction), cleaning up the data (transformation), and writing (loading) the data into a database specifically designed to ease data analysis. This sequence of events is the warehousing ETL process.
This process may be repeated across multiple databases, providing uniform, concept-compatible data in “normalized” form. In a similar fashion, the normalized warehouse data can be subdivided into small, customized modules (data marts).

By performing the ETL process and paring down the quantity of data, the reliability of analysis is enhanced and summarized data becomes available to less technically-oriented users.
Users in small practices and organizations may not have resources to construct such specialized data warehouses, and may need to settle for copies (or limited) copies of their transactional data from which to do their analyses. A thorough understanding of the EMR’s table structure (the database schema) is essential as the analyst will perform “on-the-fly” what the warehouse ETL process accomplishes in a more structured way.
Reporting tools and reporting languages
Since every EMR has at its heart a database, the method of entering and retrieving data is a special programming language for databases – SQL (Structured Query Language). SQL is considered a 4th generation programming language as it works at a “higher” level than 3rd generation languages such as C, Java, etc. Specifically, the database system is told what information needs to be extracted, not how to do it (this is determined by the database system’s query optimizer).
Database reporting tools provide an “attractive” front end for the querying process, often shielding the analyst from the raw SQL code. Such tools include Crystal Reports, Microsoft’s Access Query tool (which can be used for both Access and non-Access queries), as well as the database vendor’s own internal querying tools. The key to a successful query and report is a properly framed question and the appropriate ODBC driver (“translator”) between the database system and the query tool. (ODBC - Open DataBase Connectivity – a Microsoft “standard” for allowing data extraction from diverse databases – Access, SQL Server, Oracle, Caché, etc.)
A worked example – reporting on Systolic Blood Pressures
A simple query might be the extraction of all Systolic Blood Pressures from all Patients. Information about the Systolic Blood Pressure would reside in a database table of Observations with following attributes (columns) - Test Type, Date, Result Value, and Patient ID number. Information about the Patient would reside in the Patient table with attributes of Name, Address, Phone Number, and Patient ID number.

To extract a result set with Name, Phone Number, Date, and Result Value, the following query is composed in SQL:
|
SELECT Name, Phone_Number, Date, Result_Value FROM Observation, Patient WHERE Observation.Patient_ID = Patient.Patient_ID AND Test_Type = Systolic_Blood_Pressure |
(the attributes desired) (tables accessed)
(link tables on Patient ID) (limits Results returned) |
Any of the above report tools (Crystal Reports, Brio, and Access) would send precisely this query to the database. The only difference between them is how the query might be composed with a “wizard” and how the result set would be display on the screen.
Interestingly, the same extracted data can be formatted differently for different users:
- Formatting for office staff - the information might be grouped with a prominent phone number and color (red) for all individuals with Systolic Blood Pressures equal to or greater than 130 or date older than one year ago. Listings for normal values and dates within the past year are omitted. Listings are sorted alphabetically.
- Formatting for physician review - the information might be grouped with a prominent color (red) for all individuals with Systolic Blood Pressures equal to or greater than 130 or date older than one year ago. Listings are sorted with the highest blood pressures first. A percentage of abnormal blood pressures and a percentage of measurements over one year old are included at the bottom of the report.
By further refinement, this admittedly simple query can be modified to show:
- Only the last Blood Pressure
- The last 3 Blood Pressures within a 6 month time period
- Blood Pressures trending upward
By pulling in Diagnoses information from a Problem table,

one could examine the Blood Pressures of patients with a diagnosis of Hypertension:
|
SELECT Name, Phone_Number, Date, Result_Value FROM Observation, Patient, Problem WHERE Observation.Patient_ID = Patient.Patient_ID AND Problem.Patient_ID = Patient.Patient_ID AND Test_Type = Systolic_Blood_Pressure AND Diagnosis = Hypertension |
(the attributes desired) (tables accessed)
(link tables on Patient ID) (link tables on Patient ID) (limits Results returned) (limits Diagnosis) |
By further refinement one might examine:
- Diagnosis rate of Hypertension of those with Blood Pressures exceeding a set level
- Names of patients who have not been diagnosed but have met Blood Pressure diagnostic criteria
Increasing the complexity by adding in treatment data from a Medication table could yield the following:
- Effects of Medications on blood pressure before and after treatment initiation
- Compliance with Medications regimens based on patient self-reporting and/or office questionnaires
- Prescriptions written for Medications and refills authorized
- Adverse drug events (ADEs) where a “reversal” Medication is given shortly after a treatment Medication in response to an altered physiologic measurement (Naloxone given within 30 minutes after a Morphine dose)
By extension, the clinical domain could be changed from Hypertension to Diabetes simply by a two simple changes in the SQL:
|
SELECT Name, Phone_Number, Date, Result_Value FROM Observation, Patient, Problem WHERE Observation.Patient_ID = Patient.Patient_ID AND Problem.Patient_ID = Patient.Patient_ID AND Test_Type = Hemoglobin_A1c AND Diagnosis = Diabetes |
(the attributes desired) (tables accessed)
(link tables on Patient ID) (link tables on Patient ID) * (limits Results returned) * (limits Diagnosis) |
Problems in formulating queries with Reporting Tools
It is common in formulating clinically “interesting” queries to encounter groups of “like” patients or treatments, and then compare them. This often involves selective limiting of “candidates” for analysis and in SQL terms requires the use of “subqueries” (queries embedded in queries) or Result Set “subtraction”. Virtually no reporting tool supports this explicitly with “wizard”-type query building. “Work-arounds” are possible in two ways:
- Creation of a SQL Stored Procedure placed on the database Server. (A Stored Procedure is a query that resides in the Server, rather than in the Reporting Tool, and allows use of advanced SQL syntax not supported in most Reporting Tools. This type of Stored Procedure is database “safe” as it is “read-only” - existing data is examined, but not altered.)
- Embed the complex SQL manually into the Report Tool’s SQL query window. (This is fraught with problems as the Reporting Tool may attempt to overwrite the embedded SQL with its own, and the Reporting Tools variable names need to be “adjusted” to work with the embedded SQL).
It is also quite common to work with large blocks of text such as found in clinical documents (History & Physicals, Progress Notes, Discharge Summaries, Radiology Reports, etc.). Given limitations of many databases, information stored in blocks greater than 255 characters may be relatively “unsearchable”, especially when extracted from databases using the manufacturer’s own ODBC driver. “Work-arounds” include:
- Using the database manufacturer’s ODBC driver and limiting retrieval and searching to the first 255 characters
- Using a “high-performance” ODBC driver from a third-party vendor that allows searching on large text blocks (variously referred to as VARCHAR(>255), MEMO, CLOB in database documentation).
Extending across multiple data systems
In larger organizations, data may reside on multiple database systems. Common examples are:
- Inpatient EMR
- Outpatient EMR
- Department-specific research and/or clinical databases
- Disease specific registries (cancer, diabetes, cystic fibrosis)
- Billing systems
Extracting data then presents several challenges, amongst them:
- Ensuring that each patient is uniquely and correctly identified across each database
- Mapping clinical concepts between systems to ensure “semantic congruence” (e.g. does the concept of “leg” in System A refer to “thigh” and “lower leg” together as in “common usage” whereas in System B “leg” refers to “lower leg” only as in the Universal Medical Language System?)
- Determining what “episode of care” a given event refers to (e.g. Lab Results applied to the current hospital admission vs. last admission)
- Reconciling the concept of “normal” values (a normal value from Lab A may be different from Lab B given internal criteria of “normal” and test methodology used)
- Determining how results are represented –
o As “numbers” only, as a mixture of “numbers” and characters
o As single numbers only or ranges of numbers
More subtle changes may take place over the life of the database
- Data elements mapped to the same medical concept may subtly change with data-entry form version changes (e.g. a question asked in a slightly different way may be mapped to the same data element.)
- Clinical knowledge may have changed when mapped to the same data element (e.g. over a period of two years, the concept of “SARS” changed from an ill-defined symptom complex – Severe Acute Respiratory Syndrome – to being associated with a specific causative agent – the SARS corona virus.)
- New concept terms may replace older ones. Without being aware of these changes, it is possible for the analyst to “lose” reported data out of ignorance.
Beyond these concerns are the technical details of data extraction “mechanics”:
- If different databases are used, then “compatible” ODBC drivers must be used. Experience has shown that only “high-performance” ODBC drivers ensure maximal inter-database compatibility (this is true between relational databases such as Access, SQL-Server, and Oracle, and also using Caché’s own non-relational ODBC driver with other relational drivers).
- Reconciling individual datatypes (Number, Character, and their subdivisions) across the database systems.
In some cases, data is similar enough that a single-cross database can be performed. In more complex cases, an intermediate database may need to be constructed using the ETL process described for warehouses.
Formal Statistical Analysis
Examining data screened or “rearranged” by a reporting tool may suggest plausible clinical hypotheses to be tested with formal statistical packages (with the usual caveat regarding post-hoc data “drudging” and testing multiple hypotheses without a Bonferroni or similar correction). Conversely, data abstracted for a formal research study may be reformatted for documentation and/or presentation purposes using a standard Reporting Tool.
The same SQL queries used with Reporting Tools can be used with standard formal statistical packages such as SAS, SPSS and Stata. Using these queries require either a one or two-pass setup process:
- For SAS, use PROC SQL
- For SPSS, use ODBC LOAD
- For Stata, use the menu-driven ODBC data connection tool
Summary
EMR data can be invaluable for improving practice efficiency, quality and return on investment. Implementing this is a non-trivial process that requires:
- Proper structuring of the entered data on the front end of the EMR
- Intimate understanding of the underlying EMR database schema to ensure that all relevant data elements are captured for analysis
- Knowledge of enough SQL to ensure that the elements captured reflect “clinical reality”
- An intimate partnership between the information requester (often the clinician) and the database analyst who generates reports to ensure that the requested data and its presentation meet the requester’s requirement
- Realization that data extraction and analysis is an “iterative” process – several passes are usually required before an initial report is considered “finished”
If the process has gone well, the reporting process is often easily generalized across multiple clinical domains – what was previously a single QA process report can be easily modified as different care improvement areas are identified.