All in One View

Content from What is OMOP?


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • What is OMOP?
  • Why is using a standard important in healthcare data?
  • How do OMOP tables relate to each other?
  • What are concept_ids and how can we get an humanly readable name for them?

Objectives

  • Examine the diagram of the OMOP tables and the data specification
  • Understand OMOP standardization and vocabularies
  • Connect to an OMOP database and explore the concept table
  • Get a humanly readable name for a concept_id

Setting up R


Getting started

The “Projects” interface in RStudio not only creates a working directory for you, but also remembers its location (allowing you to quickly navigate to it). The interface also (optionally) preserves custom settings and open files to make it easier to resume work after a break.

Create a new project

Connect to a database

For this episode we will be using the CDMConnector package to connect to an OMOP Common Data Model database. We define a function that will open this package and connect an appropriate dataset. It is listed below but you will also find it in the workshop/code/CDMConnector directory that you should have downloaded. This package also contains synthetic example data that can be used to demonstrate querying the data.

R

library(dplyr)

# Connect to GiBleed if not already connected
if (!exists("cdm") || !inherits(cdm, "cdm_reference")) {
  db_name <- "GiBleed"
  CDMConnector::requireEunomia(datasetName = db_name)
  con <- DBI::dbConnect(duckdb::duckdb(),
                        dbdir = CDMConnector::eunomiaDir(datasetName = db_name))
  cdm <- CDMConnector::cdmFromCon(con, cdmSchema = "main", writeSchema = "main")
}

OUTPUT


Download completed!

Make sure everyone - has R open - has a project - has managed to connect to the database

Introduction


OMOP is a format for recording Electronic Healthcare Records. It allows you to follow a patient journey through a hospital by linking every aspect to a standard vocabulary thus enabling easy sharing of data between hospitals, trusts and even countries.

OMOP CDM Diagram

A diagram showing the tables that occur in the OMOP-CDM , how they relate to each other and standard vocabularies.
The OMOP Common Data Model

OMOP CDM stands for the Observational Medical Outcomes Partnership Common Data Model. You don’t really need to remember what OMOP stands for. Remembering that CDM stands for Common Data Model can help you remember that it is a data standard that can be applied to different data sources to create data in a Common (same) format. The table diagram will look confusing to start with but you can use data in the OMOP CDM without needing to understand (or populate) all 37 tables.

Challenge

Challenge

Look at the OMOP-CDM figure and answer the following questions:

  1. Which table is the key to all the other tables?

  2. Which table allows you to distinguish between different stays in hospital?

  1. The Person table

  2. The Visit_occurrence table

Why use OMOP?


A diagram showing that different sources of data, transformed to OMOP, can then be used by multiple analysis tools.
Rationale for the use of OMOP-CDM

Once a database has been converted to the OMOP CDM, evidence can be generated using standardized analytics tools. This means that different tools can also be shared and reused. So using OMOP can help make your research FAIR.

Check that everyone knows what FAIR stands for

Read in the database as above.

The data themselves are not actually read into the created cdm object. Rather it is a reference that allows us to query the data from the database.

Typing names(cdm) will give a summary of the tables in the database and we can look at these individually using the $ operator and the colnames command.

OMOP Tables

R

names(cdm)

OUTPUT

 [1] "person"                "observation_period"    "visit_occurrence"
 [4] "visit_detail"          "condition_occurrence"  "drug_exposure"
 [7] "procedure_occurrence"  "device_exposure"       "measurement"
[10] "observation"           "death"                 "note"
[13] "note_nlp"              "specimen"              "fact_relationship"
[16] "location"              "care_site"             "provider"
[19] "payer_plan_period"     "cost"                  "drug_era"
[22] "dose_era"              "condition_era"         "metadata"
[25] "cdm_source"            "concept"               "vocabulary"
[28] "domain"                "concept_class"         "concept_relationship"
[31] "relationship"          "concept_synonym"       "concept_ancestor"
[34] "source_to_concept_map" "drug_strength"        

Looking at the column names in each table

R

colnames(cdm$person)

OUTPUT

 [1] "person_id"                   "gender_concept_id"
 [3] "year_of_birth"               "month_of_birth"
 [5] "day_of_birth"                "birth_datetime"
 [7] "race_concept_id"             "ethnicity_concept_id"
 [9] "location_id"                 "provider_id"
[11] "care_site_id"                "person_source_value"
[13] "gender_source_value"         "gender_source_concept_id"
[15] "race_source_value"           "race_source_concept_id"
[17] "ethnicity_source_value"      "ethnicity_source_concept_id"
Challenge

Challenge

How do you think the visit_occurrence table is used to connect to the person table?

R

colnames(cdm$visit_occurrence)

OUTPUT

 [1] "visit_occurrence_id"           "person_id"
 [3] "visit_concept_id"              "visit_start_date"
 [5] "visit_start_datetime"          "visit_end_date"
 [7] "visit_end_datetime"            "visit_type_concept_id"
 [9] "provider_id"                   "care_site_id"
[11] "visit_source_value"            "visit_source_concept_id"
[13] "admitting_source_concept_id"   "admitting_source_value"
[15] "discharge_to_concept_id"       "discharge_to_source_value"
[17] "preceding_visit_occurrence_id"

Looking at both tables we can see that they both have a column labelled person_id which could be used to link them together.

Notice that the visit_concept_id column in the visit_occurrence table is also a concept_id. This concept_id can be used to find out more information about the type of visit (e.g. inpatient, outpatient etc) by looking it up in the concept table. In this case the visit_concept_id is 9201 which relates to an inpatient visit. We can find this out by filtering the concept table for concept_id 9201 and selecting the concept_name column.

R

cdm$concept |>
  filter(concept_id == 9201) |>
  select(concept_name)

OUTPUT

# Source:   SQL [?? x 1]
# Database: DuckDB 1.4.1 [unknown@Linux 6.8.0-1044-azure:R 4.5.3//tmp/Rtmp81g5qq/file586f1575b8ae.duckdb]
  concept_name
  <chr>
1 Inpatient Visit

CODING_NOTE: We use filter to identify the row(s) we want and select to choose the column(s) we want. These are functions from the dplyr package that can make code clearer by chaining instructions together and can be used for a local R object or database connection. If we were working with a local R object we could use also base R code: cdm$concept$concept_name[cdm$concept$concept_id == 9201] to get the same result - however this would command would not work with a database connection.

A useful function

Finding the humanly readable name for a concept_id will be a useful function. We can create a function get_concept_name() that takes the ‘cdm’ object and a concept_id as an input and returns the concept_name.

Challenge

Challenge

Create the function get_concept_name() that takes the ‘cdm’ object and a concept_id as an input and returns the concept_name.

R

get_concept_name <- function(cdm_obj, id) {
  cdm_obj$concept |>
    filter(concept_id == id) |>
    select(concept_name) |>
    pull()
}

Explanation of function code

  • The function is called get_concept_name and it takes two arguments, cdm_obj and id.
  • Inside the function, we query the concept table from the cdm_obj object.
  • We use the filter function to select rows where the concept_id matches the input id.
  • We then use select to choose only the concept_name column from the filtered results.
  • Finally, we use pull() to extract the concept_name as a vector, which is returned by the function. We need to use this because we are querying a remote database, not one that is local.
Key Points
  • Using a standard makes it much easier to share data
  • OMOP uses concepts to link different tables together
  • The concept table contains humanly readable names for concept_ids

Content from Exploring OMOP concepts with R


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • Find the vocabulary, domain and concept_class for a given concept_id
  • Establish whether a concept_id is a standard concept
  • Find all concepts within a given domain
  • Find all concepts within a given vocabulary

Objectives

  • Understand that concepts have additional attributes such as vocabulary, domain, classand standard concept status
  • Use R to query the concept table for specific attributes of concepts
  • Filter concepts based on domain, vocabulary and class
  • Identify standard concepts within the OMOP vocabulary

Introduction


The primary purpose of the concept table is to provide a standardised representation of medical Concepts, allowing for consistent querying and analysis across healthcare databases. Users can join the concept table with other tables in the CDM to enrich clinical data with Concept information or use the concept table as a reference for mapping clinical data from source terminologies to Standard or other Concepts.

An OMOP concept_id is a unique integer identifier. These are defined in the OMOP concept table with a corresponding name and other attributes. OMOP contains concept_ids for other medical vocabularies such as SNOMED and LOINC, which OMOP terms as source vocabularies.

Nearly everything in a hospital can be represented by an OMOP concept_id.

Looking up OMOP concepts


OMOP concepts can be looked up in Athena an online tool provided by OHDSI (Observational Health Data Sciences and Informatics, a collaboration that continues to develop OMOP).

The CDMConnector package allows connection to an OMOP Common Data Model in a database. It also contains synthetic example data that can be used to demonstrate querying the data.

In the previous episode we set up the CDMConnector package to connect to an OMOP Common Data Model database and used it to look at the concepts table. We also created the function get_concept_name() to get a humanly readable name for a concept_id. We will use these again in this episode.

Setting up the connection

R

library(CDMConnector)
library(dplyr)

db_name <- "GiBleed"
CDMConnector::requireEunomia(datasetName = db_name)

db <- DBI::dbConnect(duckdb::duckdb(),
                     dbdir = CDMConnector::eunomiaDir(datasetName = db_name))

cdm <- CDMConnector::cdmFromCon(con = db, cdmSchema = "main",
                                writeSchema = "main")

Exploring the concept table


R

colnames(cdm$concept)

OUTPUT

 [1] "concept_id"       "concept_name"     "domain_id"        "vocabulary_id"
 [5] "concept_class_id" "standard_concept" "concept_code"     "valid_start_date"
 [9] "valid_end_date"   "invalid_reason"  

The concept table contains the following columns:

Column Names Description of content
concept_id Unique identifier for the concept.
concept_name Name or description of the concept.
domain_id The domain to which the concept belongs (e.g. Condition, Drug).
vocabulary_id The vocabulary from which the concept originates (e.g. SNOMED, RxNorm).
concept_class_id Classification within the vocabulary (e.g. Clinical Finding, Ingredient).
standard_concept ‘S’ for standard concepts that come from internationally accepted standard vocabularies.
concept_code Code used by the source vocabulary to identify the concept.
valid_start_date Date the concept became valid in OMOP.
valid_end_date Date the concept ceased to be valid.
invalid_reason Reason for invalidation, if applicable.

The concept table is the main table for looking up information about concepts. We can use R to query the concept table for specific attributes of concepts.

Challenge

Challenge

Answer the following questions using R and the concept table:

  1. How many entries are there in the concept table?

  2. How many distinct vocabularies are there in the concept table?

  3. How many distinct domains other than ‘None’ are there in the concept table?

  4. How many distinct concept_classes are there in the concept table?

  1. How many entries are there in the concept table?

R

cdm$concept |>
  summarise(n_concepts = n()) |>
  collect()

OUTPUT

# A tibble: 1 × 1
  n_concepts
       <dbl>
1        444

Answer: There are 444 entries in the concept table. This is a tiny fraction of the overall table which can be found at Athena

CODING_NOTE: The function n() counts the number of rows in the table and summarise() creates a summary table with that count. These functions are part of the dplyr package. When you have loaded the library once your environment will remember it for the rest of the session. We are also using collect at the end of the pipe to ensure that the data is converted to a local object in R. You can get away without using it at this stage, as R works out that you are getting a single value, but you will often need to use collect with database connections.

  1. How many distinct vocabularies are there in the concept table?

R

cdm$concept |>
  summarise(n_distinct_vocabularies = n_distinct(vocabulary_id)) |>
  collect()

OUTPUT

# A tibble: 1 × 1
  n_distinct_vocabularies
                    <dbl>
1                       9

Answer: There are 9 distinct vocabularies used in this dataset.

CODING_NOTE: The function n_distinct(x) counts the number of distinct values in the column x.

  1. How many distinct domains other than ‘None’ are there in the concept table?

R

cdm$concept |>
  filter(domain_id != "None") |>
  summarise(n_distinct_domains = n_distinct(domain_id)) |>
  collect()

OUTPUT

# A tibble: 1 × 1
  n_distinct_domains
               <dbl>
1                  8

Answer: There are 8 distinct domains other than ‘None’ in this dataset.

CODING_NOTE: We use the filter() function to filter out rows where the domain_id is ‘None’ before counting the distinct domains.

  1. How many distinct concept_classes are there in the concept table?

R

cdm$concept |>
  summarise(n_distinct_concept_classes = n_distinct(concept_class_id)) |>
  collect()

OUTPUT

# A tibble: 1 × 1
  n_distinct_concept_classes
                       <dbl>
1                         21

Answer: There are 21 distinct concept_classes used in this dataset.

Filtering concepts by domain, vocabulary, class and standard concept status


Let’s look into filtering concepts based on their domain, vocabulary, concept_class and standard_concept status.

Challenge

Challenge

List the first ten rows of the concept table, ordered by concept_id. List only the concept_id, domain_id, vocabulary_id, concept_class_id and standard_concept columns.

R

cdm$concept |>
  arrange(concept_id) |>
  filter(row_number() <= 10) |>
  select(concept_id,
         domain_id,
         vocabulary_id,
         concept_class_id,
         standard_concept) |>
  collect()

OUTPUT

# A tibble: 10 × 5
   concept_id domain_id vocabulary_id concept_class_id standard_concept
        <int> <chr>     <chr>         <chr>            <chr>
 1          0 Metadata  None          Undefined        <NA>
 2       8507 Gender    Gender        Gender           S
 3       8532 Gender    Gender        Gender           S
 4       9201 Visit     Visit         Visit            S
 5       9202 Visit     Visit         Visit            S
 6       9203 Visit     Visit         Visit            S
 7      28060 Condition SNOMED        Clinical Finding S
 8      30753 Condition SNOMED        Clinical Finding S
 9      78272 Condition SNOMED        Clinical Finding S
10      80180 Condition SNOMED        Clinical Finding S               

CODING_NOTE: The arrange() function orders the rows by concept_id. The filter(row_number() <= 10) function filters to the first 10 rows. The select() function selects only the specified columns. We have to use collect() to pull the data into R memory to view it. This is because we are querying a remote database, rather than a local object. In the previous challenge R would figure out that it should display the calculated result, but here we need to be explicit.

Look at vocabularies

Vocabulary: The source or system of coding for concepts, such as SNOMED, RxNorm, LOINC, or ICD‑10. OMOP maps many vocabularies into a common, standardised set so different coding systems can be analysed together.

Challenge

Challenge

List all distinct vocabularies in the concept table.

R

cdm$concept |>
  filter(!is.na(vocabulary_id)) |>
  distinct(vocabulary_id) |>
  arrange(vocabulary_id) |>
  pull(vocabulary_id)

OUTPUT

[1] "Gender"  "ICD10CM" "LOINC"   "NDC"     "Visit"   "RxNorm"  "CVX"
[8] "SNOMED"  "None"   

CODING_NOTE: Here we can use pull(x) to pull the data x into R memory to view it. This is because we are only requiring one column of data, so we can pull that column directly into R memory without needing to use collect() first.

Look at domains

Domain: A high‑level category that groups concepts by what they represent in clinical data, such as Condition, Drug, Procedure, Measurement, or Observation. A concept’s domain determines which OMOP table it belongs to and how it’s used analytically.

Challenge

Challenge

List all distinct domains in the concept table.

R

cdm$concept |>
  filter(!is.na(domain_id)) |>
  distinct(domain_id) |>
  arrange(domain_id) |>
  pull(domain_id)

OUTPUT

[1] "Drug"        "Measurement" "Observation" "Visit"       "Metadata"
[6] "Gender"      "Condition"   "Procedure"  

Look at concept classes

Class: A more detailed category that groups concepts within a domain by what they represent in clinical data.

Challenge

Challenge

List all distinct concept_classes in the concept table that are in the “Drug” domain.

R

cdm$concept |>
  filter(domain_id == "Drug") |>
  filter(!is.na(concept_class_id)) |>
  distinct(concept_class_id) |>
  arrange(concept_class_id) |>
  pull(concept_class_id)

OUTPUT

 [1] "Clinical Drug"       "Quant Clinical Drug" "CVX"
 [4] "Ingredient"          "11-digit NDC"        "Branded Pack"
 [7] "Clinical Drug Comp"  "Branded Drug"        "Quant Branded Drug"
[10] "Branded Drug Comp"  

Look at non standard concepts

A standard concept is the preferred, harmonised code in OMOP that represents a clinical idea across vocabularies. Standard concepts (standard_concept = “S”) are the target of mappings from source codes, and they define which domain and table the data belong to for consistent analysis. However, OMOP also include nonstandard concepts from sources that are not globally used but maybe useful locally. dm+d, the NHS Dictionary of Medicines and Devices is one such vocabulary that is included in OMOP but is not a standard vocabulary.

Challenge

Challenge

Find any nonstandard concepts (i.e. concepts where standard_concept is not ‘S’) by filtering the concept table. List the first 10 concept_ids of nonstandard concepts. Then look up their concept_name, domain_id, vocabulary_id and standard_concept status.

R

cdm$concept |>
  filter(is.na(standard_concept) | standard_concept != "S") |>
  slice_min(order_by = concept_id, n = 10, with_ties = FALSE) |>
  pull(concept_id)

OUTPUT

[1]        0  1569708 35208414 44923712 45011828

Answer: There are only four nonstandard concepts in this dataset: 1569708, 35208414, 44923712, 45011828.

CODING_NOTE: We use slice_min() to get the first 10 rows which match the filter, ordered by concept_id.

R

cdm$concept |>
  filter(concept_id %in% c(1569708, 35208414, 44923712, 45011828)) |>
  select(concept_id,
         concept_name,
         domain_id,
         vocabulary_id,
         standard_concept) |>
  collect()

OUTPUT

# A tibble: 4 × 5
  concept_id concept_name               domain_id vocabulary_id standard_concept
       <int> <chr>                      <chr>     <chr>         <chr>
1   35208414 Gastrointestinal hemorrha… Condition ICD10CM       <NA>
2   44923712 celecoxib 200 MG Oral Cap… Drug      NDC           <NA>
3    1569708 Other diseases of digesti… Condition ICD10CM       <NA>
4   45011828 Diclofenac Sodium 75 MG D… Drug      NDC           <NA>            

CODING_NOTE: We use %in% to filter for multiple concept_ids which we can list in a vector.

Now you should be able to replicate our get_concept_name() function to look up other attributes of concepts such as domain, vocabulary and standard concept status.

Challenge

Challenge

Find the domain,vocabulary and concept class for concept_id 35208414

Is this concept a standard concept?

R

get_concept_domain <- function(cdm_obj, id) {
  cdm_obj$concept |>
    filter(concept_id == id) |>
    select(domain_id) |>
    pull()
}

get_concept_vocabulary <- function(cdm_obj, id) {
  cdm_obj$concept |>
    filter(concept_id == id) |>
    select(vocabulary_id) |>
    pull()
}

get_concept_concept_class <- function(cdm_obj, id) {
  cdm_obj$concept |>
    filter(concept_id == id) |>
    select(concept_class_id) |>
    pull()
}

get_concept_standard_status <- function(cdm_obj, id) {
  cdm_obj$concept |>
    filter(concept_id == id) |>
    select(standard_concept) |>
    pull()
}

get_concept_domain(cdm, 35208414)

OUTPUT

[1] "Condition"

R

get_concept_vocabulary(cdm, 35208414)

OUTPUT

[1] "ICD10CM"

R

get_concept_concept_class(cdm, 35208414)

OUTPUT

[1] "4-char billing code"

R

get_concept_standard_status(cdm, 35208414)

OUTPUT

[1] NA

Answer:

  • The domain for concept_id 319835 is ‘Condition’.
  • The vocabulary for concept_id 319835 is ‘ICD10CM’.
  • The concept class for concept id 319835 is ‘4-char billing code’
  • This concept is not a standard concept (standard_concept = ‘NA’).
Key Points
  • Concepts have additional attributes such as vocabulary, domain, and standard concept status
  • The concept table can be queried using R to retrieve specific attributes of concepts
  • Concepts can be filtered based on their domain, vocabulary and class
  • Standard concepts are those that are recommended for use in analyses within the OMOP framework

Content from Parquet files


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • What is a parquet file?

  • How to explore and open parquet files in R?

Objectives

  • Understand the structure of parquet files.

  • Learn how to read parquet files in R.

Introduction


In this episode, we will explore parquet files, a popular file format for storing large datasets efficiently. We will learn how to read parquet files in R and understand their structure.

Callout

For this episode we will be using a sample OMOP CDM database that is pre-loaded with data. This database is a simplified version of a real-world OMOP CDM database and is intended for educational purposes only.

(UCLH only) This will come in a similar form as you would get data if you asked for a data extract via the SAFEHR platform (i.e. a set of parquet files).

As part of the setup prior to this course you were asked to download and install the sample dataset. If you have not done this yet, please refer to the setup instructions. For now, we will assume that you have the sample OMOP CDM dataset available on your local machine at the following path: ./data/omop/ and the functions in a folder ./code/parquet_dataset.

Parquet files


Parquet is a columnar storage file format that is optimized for use with big data processing frameworks. It is designed to be efficient in terms of both storage space and read/write performance. Parquet files are often used in data warehousing and big data analytics applications.

Exploring Parquet files


We have provided a function that will allow you to browse the structure of the data in the same way as we did with the database in the previous episode. This code is available below or in the downloaded workshop/code/open_omop_dataset.R file. You can source this file to load the function into your R environment.

R

open_omop_dataset <- function(path) {
    # iterate over table level directories
    list.dirs(path, recursive = FALSE) |>
      # exclude folder name from path and use it as index for named list
      purrr::set_names(~ basename(.)) |>
      # "lazy-load" list of parquet files from specified folder
      purrr::map(arrow::open_dataset)
}

CODING_NOTE: This function uses the arrow package to read in the parquet files. The open_dataset() function from the arrow package allows us to read in the parquet files without having to load the entire dataset into memory. This is particularly useful when working with large datasets. The function is reasonably complex but it is designed to be flexible and work with any OMOP CDM dataset that is structured in the same way as the one we are using for this course. It will read in all the parquet files in the specified directory and create a nested list structure that allows us to easily access the different tables in the dataset. We leave it to you to explore the code and understand how it works.

Now we can use this function to open the sample OMOP CDM dataset located in the workshop/data/omop/ directory and explore it in the same way as we did with the database in the previous episode.

R

omop <- open_omop_dataset("./data/omop")

Note that the path to the data directory may be different depending on where you have stored the sample OMOP CDM dataset on your local machine.

Check the people have used the right path. Their environment should now have an entry under Data reading ‘omop List of 8’

Explore the data using the following:

R

omop

OUTPUT

$concept
FileSystemDataset with 1 Parquet file
10 columns
concept_id: int32
concept_name: string
domain_id: string
vocabulary_id: string
standard_concept: string
concept_class_id: string
concept_code: string
valid_start_date: date32[day]
valid_end_date: date32[day]
invalid_reason: string

$concept_relationship
FileSystemDataset with 1 Parquet file
6 columns
concept_id_1: int32
concept_id_2: int32
relationship_id: string
valid_start_date: date32[day]
valid_end_date: date32[day]
invalid_reason: string

$condition_occurrence
FileSystemDataset with 1 Parquet file
10 columns
condition_occurrence_id: int32
person_id: int32
condition_concept_id: int32
condition_start_date: date32[day]
condition_end_date: date32[day]
condition_type_concept_id: int32
condition_status_concept_id: int32
visit_occurrence_id: int32
condition_source_value: string
condition_source_concept_id: int32

$drug_exposure
FileSystemDataset with 1 Parquet file
12 columns
drug_exposure_id: int32
person_id: int32
drug_concept_id: int32
drug_exposure_start_date: date32[day]
drug_exposure_start_datetime: timestamp[us, tz=UTC]
drug_exposure_end_date: date32[day]
drug_exposure_end_datetime: timestamp[us, tz=UTC]
drug_type_concept_id: int32
quantity: double
route_concept_id: int32
visit_occurrence_id: int32
drug_source_concept_id: int32

$drug_strength
FileSystemDataset with 1 Parquet file
12 columns
drug_concept_id: int32
ingredient_concept_id: int32
amount_value: double
amount_unit_concept_id: int32
numerator_value: double
numerator_unit_concept_id: int32
denominator_value: double
denominator_unit_concept_id: int32
box_size: double
valid_start_date: date32[day]
valid_end_date: date32[day]
invalid_reason: string

$measurement
FileSystemDataset with 1 Parquet file
12 columns
measurement_id: int32
person_id: int32
measurement_concept_id: int32
measurement_date: date32[day]
measurement_datetime: timestamp[us, tz=UTC]
operator_concept_id: int32
value_as_number: double
value_as_concept_id: int32
unit_concept_id: int32
range_low: double
range_high: double
visit_occurrence_id: int32

$observation
FileSystemDataset with 1 Parquet file
9 columns
observation_id: int32
person_id: int32
observation_concept_id: int32
observation_date: date32[day]
observation_datetime: timestamp[us, tz=UTC]
value_as_number: int32
value_as_string: string
value_as_concept_id: int32
visit_occurrence_id: int32

$person
FileSystemDataset with 1 Parquet file
8 columns
person_id: int32
gender_concept_id: int32
year_of_birth: int32
month_of_birth: int32
day_of_birth: int32
race_concept_id: int32
gender_source_value: string
race_source_value: string

$procedure_occurrence
FileSystemDataset with 1 Parquet file
7 columns
procedure_occurrence_id: int32
person_id: int32
procedure_concept_id: int32
procedure_date: date32[day]
procedure_datetime: timestamp[us, tz=UTC]
procedure_type_concept_id: int32
visit_occurrence_id: int32

$visit_occurrence
FileSystemDataset with 1 Parquet file
10 columns
visit_occurrence_id: int32
person_id: int32
visit_concept_id: int32
visit_start_date: date32[day]
visit_start_datetime: timestamp[us, tz=UTC]
visit_end_date: date32[day]
visit_end_datetime: timestamp[us, tz=UTC]
visit_type_concept_id: int32
discharged_to_concept_id: int32
preceding_visit_occurrence_id: int32

You will see that this gives you a list of all the tables in this dataset and what columns they contain. It is obviously a much smaller dataset! You can explore individual tables which will also give you the column names and the data type of the entry.

R

omop$person

OUTPUT

FileSystemDataset with 1 Parquet file
8 columns
person_id: int32
gender_concept_id: int32
year_of_birth: int32
month_of_birth: int32
day_of_birth: int32
race_concept_id: int32
gender_source_value: string
race_source_value: string

To actually open each table we can use

R

library(dplyr)
person <- omop$person |>
  collect()

person

OUTPUT

# A tibble: 8 × 8
  person_id gender_concept_id year_of_birth month_of_birth day_of_birth
      <int>             <int>         <int>          <int>        <int>
1      1111              8507          1993              6           15
2      1112              8532          1970              6           15
3      1113              8507          1983              6           15
4     34567              8532          2015              6           15
5     78901              8532          1989              6           15
6        31              8532          1987              0            0
7         2              8532          2008              0            0
8        58              8507          1985              0            0
# ℹ 3 more variables: race_concept_id <int>, gender_source_value <chr>,
#   race_source_value <chr>

CODING_NOTE: The collect() function is used to actually read the data from the parquet files into memory. This is necessary because the open_dataset() function creates a reference to the data rather than loading it into memory. By using collect(), we can work with the data as a regular data frame in R.

Or we can use the specific functions from the arrow package to read in the parquet files directly.

R

library(arrow)
person <- read_parquet("./data/omop/person/part-0.parquet")
person

OUTPUT

# A tibble: 8 × 8
  person_id gender_concept_id year_of_birth month_of_birth day_of_birth
      <int>             <int>         <int>          <int>        <int>
1      1111              8507          1993              6           15
2      1112              8532          1970              6           15
3      1113              8507          1983              6           15
4     34567              8532          2015              6           15
5     78901              8532          1989              6           15
6        31              8532          1987              0            0
7         2              8532          2008              0            0
8        58              8507          1985              0            0
# ℹ 3 more variables: race_concept_id <int>, gender_source_value <chr>,
#   race_source_value <chr>

CODING_NOTE: The read_parquet() function from the arrow package allows us to read in a specific parquet file directly into R. This can be useful if we only want to work with a specific table from the dataset and do not need any of the other files.

Check that everyone has been able to read in the person table and see the data.

Points worth a discussion:

  • the day and month of birth

  • gender_source_value and race_source_value are given for some rows but not others

Callout

As part of the privacy preserving policies around health data, dates of birth are often de-identified to only show the year of birth. This is why in this dataset the day and month of birth are set to 15/6 or 0/0 for all individuals.

You can also see that in some cases the gender_source_value and race_source_value columns are populated while in others they are not. This depends on the policy of the individual hospital. Note: the data in this dataset is a number of different sources combined together to form a single OMOP CDM database.

Challenge

Challenge

Adapt the code we had developed for the get_concept_name function in the previous episode to work with this parquet file dataset.

R

get_concept_name <- function(omop_obj, id) {
  omop_obj$concept |>
    filter(concept_id == id) |>
    select(concept_name) |>
    collect()
}

CODING_NOTE: The get_concept_name() function is adapted to work with the parquet file dataset. It uses the filter() and select() functions from the dplyr package to query the concept table in the parquet dataset. The collect() function is used to read the result into memory so that we can work with it as a regular data frame in R.

Now we can use this function to look up concept names by their concept_id.

R

get_concept_name(omop, 8507)

OUTPUT

# A tibble: 1 × 1
  concept_name
  <chr>
1 Male        

Answer: The concept_id 8507 corresponds to the concept “Male”.

Key Points
  • Parquet files are a columnar storage file format optimized for big data processing.

  • The arrow package in R can be used to read and manipulate parquet files.

Content from More on concepts


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • Where to find other concept_ids in OMOP

  • How to link OMOP tables

Objectives

  • Understand that there are many other concept_ids in OMOP tables and that these are usually named with a _concept_id suffix.

  • Learn how to link OMOP tables using common identifiers such as person_id and visit_occurrence_id.

  • Be able to use the concept table to look up humanly readable names for various concept_ids.

  • Use joins to combine data from multiple OMOP tables based on common identifiers.

Introduction


In this episode, we will explore more concepts related to the OMOP Common Data Model (CDM). We will focus on understanding how different tables in the OMOP CDM are linked together through common identifiers. This knowledge is crucial for effectively querying and analysing healthcare data stored in the OMOP format.

Callout

For this episode we will be using a sample OMOP CDM database that is pre-loaded with data. This database is a simplified version of a real-world OMOP CDM database and is intended for educational purposes only.

(UCLH only) This will come in a similar form as you would get data if you asked for a data extract via the SAFEHR platform (i.e. a set of parquet files).

As part of the setup prior to this course you were asked to download and install the sample dataset. If you have not done this yet, please refer to the setup instructions. For now, we will assume that you have the sample OMOP CDM dataset available on your local machine at the following path: ./data/omop/ and the functions in a folder ./code/parquet_dataset.

You will then need to load the database as shown in the previous episode.

R

library(dplyr)

open_omop_dataset <- function(path) {
    # iterate over table level directories
    list.dirs(path, recursive = FALSE) |>
      # exclude folder name from path and use it as index for named list
      purrr::set_names(~ basename(.)) |>
      # "lazy-load" list of parquet files from specified folder
      purrr::map(arrow::open_dataset)
}

R

omop <- open_omop_dataset("./data/omop")

and the useful function we created in the previous episode to look up concept names.

R

get_concept_name <- function(omop_obj, id) {
  omop_obj$concept |>
    filter(concept_id == id) |>
    select(concept_name) |>
    collect()
}

Other concept_ids in OMOP


In addition to the concept_id column in various OMOP tables, there are several other columns that use *_concept_id to provide information.

The person table contains the following columns (among others not listed here):

Column Names Description of content
person_id Unique identifier for each person
gender_concept_id Concept identifier for the sex of the person
year_of_birth Year of birth of the person
month_of_birth Month of birth of the person
day_of_birth Day of birth of the person
race_concept_id Concept identifier for the race/ethnic background of the person
gender_source_value Source value for the sex of the person
race_source_value Source value for the race/ethnic background of the person

Look at the column names of the person table.

R

omop$person

OUTPUT

FileSystemDataset with 1 Parquet file
8 columns
person_id: int32
gender_concept_id: int32
year_of_birth: int32
month_of_birth: int32
day_of_birth: int32
race_concept_id: int32
gender_source_value: string
race_source_value: string

Several of these columns end with _concept_id, such as gender_concept_id and race_concept_id. These columns link to the concept table to provide humanly readable names for the concepts represented by these IDs.

For example, to get the gender name for a person, you can use the gender_concept_id column in the person table and look it up in the concept table.

R

# First read in the person table
person <- omop$person |> 
  collect()

From this we can see that the gender_concept_id can have values 8507 or 8532. We can look these up in the concept table using the previously defined function get_concept_name().

R

get_concept_name(omop, 8507)

OUTPUT

# A tibble: 1 × 1
  concept_name
  <chr>
1 Male        

R

get_concept_name(omop, 8532)

OUTPUT

# A tibble: 1 × 1
  concept_name
  <chr>
1 Female      

It might also be useful to look up id values from the names. We can create a function get_concept_id() that takes a concept_name as input and returns the concept_id.

Challenge

Challenge

Create the function get_concept_id() that takes an omop_obj and a concept_name as input and returns the concept_id.

R

get_concept_id <- function(omop_obj, name) {
  omop_obj$concept |>
    filter(concept_name == name) |>
    select(concept_id) |>
    collect()
}

Check that this works by looking up the concept_id for “Female”.

R

get_concept_id(omop, "Female")

OUTPUT

# A tibble: 1 × 1
  concept_id
       <int>
1       8532

Answer: The concept_id for “Female” is 8532.

Challenge

Challenge

Using the person table and the functions get_concept_name() and get_concept_id() that we have defined, answer the following questions:

  1. What is the gender of the White patient in the person table?

  2. What is the gender of the White British patient in the person table?

  3. How many men and women are in the person table?

  1. First we need to know the concept_id of the concept White

R

get_concept_id(omop, "White")

OUTPUT

# A tibble: 1 × 1
  concept_id
       <int>
1       8527

Then we need to know which patient has this race_concept_id and what the corresponding gender_concept_id is for this patient.

R

white <- person |> 
  filter(race_concept_id == 8527)

get_concept_name(omop, white$gender_concept_id)

OUTPUT

# A tibble: 1 × 1
  concept_name
  <chr>
1 Female      

Answer: The White patient is female. (Note the code above assumes there is only one White patient.)

  1. Similarly, we first need to know the concept_id of the concept White British

R

get_concept_id(omop, "White British")

OUTPUT

# A tibble: 1 × 1
  concept_id
       <int>
1   46286810

Then we need to know which patient has this race_concept_id and what the corresponding gender_concept_id is for this patient.

R

white_british <- person |> 
  filter(race_concept_id == 46286810)

get_concept_name(omop, white_british$gender_concept_id) 

OUTPUT

# A tibble: 1 × 1
  concept_name
  <chr>
1 Female      

Answer: The White British patient is female. (Note the code above assumes there is only one White British patient.)

  1. The table is small enough to actually count by hand but also we can use the R package dplyr to count the number of men and women.

R

# Let's create a mini version of the concept table that contains only the concepts(the gender concepts) and the columns(concept_id, concept_name)  we want

gender_concept <- omop$concept |>
  filter(concept_id %in% c(8507, 8532)) |>
  select(concept_id, concept_name) |>
  collect()

# Now we can join to get the number of people of each gender
person |> 
  inner_join(gender_concept, by = join_by(gender_concept_id == concept_id)) |>
  group_by(concept_name) |>
  summarise(count = n())

OUTPUT

# A tibble: 2 × 2
  concept_name count
  <chr>        <int>
1 Female           5
2 Male             3

CODING_NOTE: The inner_join function is used to combine the person table with the gender_concept table we created, based on the gender_concept_id. This allows us to get the humanly readable gender names.

Linking OMOP tables


In the previous episode, we saw how to look up concept names using concept_ids. In this episode, we will explore how different OMOP tables are linked together using common identifiers.

For example, the visit_occurrence table contains a person_id column that links to the person table. This allows us to retrieve information about the person associated with a particular visit.

R

omop$visit_occurrence

OUTPUT

FileSystemDataset with 1 Parquet file
10 columns
visit_occurrence_id: int32
person_id: int32
visit_concept_id: int32
visit_start_date: date32[day]
visit_start_datetime: timestamp[us, tz=UTC]
visit_end_date: date32[day]
visit_end_datetime: timestamp[us, tz=UTC]
visit_type_concept_id: int32
discharged_to_concept_id: int32
preceding_visit_occurrence_id: int32

R

omop$person

OUTPUT

FileSystemDataset with 1 Parquet file
8 columns
person_id: int32
gender_concept_id: int32
year_of_birth: int32
month_of_birth: int32
day_of_birth: int32
race_concept_id: int32
gender_source_value: string
race_source_value: string
Challenge

Challenge

Using the person and visit_occurrence tables, answer the following questions:

  1. How many visits are recorded for each person in the visit_occurrence table
  1. We can count the number of visits for each person by grouping the visit_occurrence table by person_id and counting the number of occurrences.

R

visit_counts <- omop$visit_occurrence |>
  group_by(person_id) |>
  summarise(visit_count = n()) |>
  collect()

visit_counts

OUTPUT

# A tibble: 8 × 2
  person_id visit_count
      <int>       <int>
1      1111           1
2      1112           2
3      1113           1
4     78901           1
5     34567           1
6        31           2
7         2           2
8        58          10

CODING_NOTE: The group_by() function is used to group the data by person_id, and the summarise function is used to count the number of visits for each person. The collect() function is used to retrieve the results from the remote database.

We can use functions from dplyr to join tables together by these links. For instance, we can find all of the visits for Female patients.

R

female_patients <- omop$person |>
  filter(gender_concept_id == 8532)

female_visits <- female_patients |>
  inner_join(omop$visit_occurrence, by = join_by(person_id)) |>
  select(person_id, visit_occurrence_id, visit_start_date, visit_end_date) |>
  collect()

female_visits

OUTPUT

# A tibble: 8 × 4
  person_id visit_occurrence_id visit_start_date visit_end_date
      <int>               <int> <date>           <date>
1      1112                1102 2025-12-24       NA
2      1112                1002 2024-01-15       2024-01-29
3     34567               55667 2025-12-18       2025-12-18
4     78901               44556 2004-09-15       2004-09-18
5        31               37923 2024-07-23       2024-07-23
6        31                1222 2019-05-10       2019-05-10
7         2                1798 2025-05-16       2025-05-19
8         2                 154 2025-06-12       2025-06-12    

CODING_NOTE: We have used the inner_join method to only keep rows where both tables have the same ids, but there are others. We also used the join_by function which allows us to define which column to join on, you can choose multiple columns in a join.

Challenge

Challenge

Using the person and visit_occurrence tables, answer the following questions:

  1. How many visits did each person make, and show that person’s race as a name.
  2. How many visits were there for men and women?
  1. We already have visit counts from the previous solution, and we need to join in the person’s race, and then join in the names for that race concept.

R

# Using the same logic as the previous example
visit_counts <- omop$visit_occurrence |>
  group_by(person_id) |>
  summarise(visit_count = n())

visit_counts_with_race <- visit_counts |>
  inner_join(omop$person, by = join_by(person_id)) |>
  inner_join(omop$concept, by = join_by(race_concept_id == concept_id)) |>
  rename(race_name = concept_name) |>
  select(person_id, visit_count, race_name) |>
  collect()
  
visit_counts_with_race

OUTPUT

# A tibble: 8 × 3
  person_id visit_count race_name
      <int>       <int> <chr>
1      1111           1 Black
2      1112           2 White
3      1113           1 Asian
4     78901           1 Asian
5     34567           1 Asian
6        31           2 White British
7         2           2 Ethnicity not stated
8        58          10 Ethnicity not stated

CODING_NOTE: The rename function allows us to change the name of a column, so that it makes more sense in the final output.

  1. We use joins to get the gender of each person before summarising by their gender concept. Optionally, we can also join in the concept table

R

# Using the same logic as the previous example
counts_by_gender_concept <- omop$person |>
  inner_join(omop$visit_occurrence, by = join_by(person_id)) |>
  group_by(gender_concept_id) |>
  summarise(visit_count = n())

# we can also join in the concept name to make it human readable
counts_by_gender_concept |>
  inner_join(omop$concept, by = join_by(gender_concept_id == concept_id)) |>
  rename(gender = concept_name) |>
  select(gender, visit_count) |>
  collect()

OUTPUT

# A tibble: 2 × 2
  gender visit_count
  <chr>        <int>
1 Male            12
2 Female           8
Key Points
  • OMOP tables contain many concept_ids, usually named with a _concept_id suffix.

  • The concept table can be used to look up humanly readable names for various concept_ids.

  • OMOP tables can be linked using common identifier.

Content from Measurements and Observations


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • What are the differences between measurements and observations?

  • How to find out what measurements and observations have been recorded for a particular person?

  • How to find the values of measurements and observations and associated information such as units?

Objectives

  • Know that measurements are mainly lab results and other records like pulse rate

  • Know observations are other facts obtained through questioning or direct observation

  • Understand concept ids identify the measure or observation, values are stored in value_as_number, value_as_concept_id, or (for observation table only ) value_as_string.

  • Be able to join to the concept table to find a particular measurement or observation concept by name

Introduction


This episode covers the OMOP measurement and observation tables.

Callout

For this episode we will be using a sample OMOP CDM database that is pre-loaded with data. This database is a simplified version of a real-world OMOP CDM database and is intended for educational purposes only.

(UCLH only) This will come in a similar form as you would get data if you asked for a data extract via the SAFEHR platform (i.e. a set of parquet files).

As part of the setup prior to this course you were asked to download and install the sample dataset. If you have not done this yet, please refer to the setup instructions. For now, we will assume that you have the sample OMOP CDM dataset available on your local machine at the following path: ./data/omop/ and the functions in a folder ./code/parquet_dataset.

You will then need to load the database as shown in the previous episode.

R

open_omop_dataset <- function(path) {
    # iterate over table level directories
    list.dirs(path, recursive = FALSE) |>
      # exclude folder name from path and use it as index for named list
      purrr::set_names(~ basename(.)) |>
      # "lazy-load" list of parquet files from specified folder
      purrr::map(arrow::open_dataset)
}

R

omop <- open_omop_dataset("./data/omop")

and the useful functions we created in the previous episode to look up concept names/ids.

R

library(dplyr)

get_concept_name <- function(omop_obj, id) {
  omop_obj$concept |>
    filter(concept_id == id) |>
    select(concept_name) |>
    collect()
}

R

get_concept_id <- function(omop_obj, name) {
  omop_obj$concept |>
    filter(concept_name == name) |>
    select(concept_id) |>
    collect()
}

The OMOP measurement and observation tables contain information collected about a person.

The difference between them is that measurement contains numerical or categorical values collected by a standardised process, whereas observation contains less standardised clinical facts. Measurements are often lab results, vital signs or other clinical measurements such as height, weight, blood pressure, pulse rate, respiratory rate, oxygen saturations etc. Observations are other facts obtained through questioning or direct observation, for example smoking status, alcohol intake, family history, symptoms reported by the patient etc.

A person can have multiple measurements and observations. Some columns are similar between measurement and observation.

Concepts and values


Data are stored as questions and answers. A question (e.g. Pulse rate) is defined by a concept_id and the answer is stored in a value column.

The measurement table contains the following columns (among others not listed here):

Column Names Description of content
measurement_id Unique identifier for each measurement
person_id Identifier for the patient
measurement_concept_id Concept identifier for the measurement
measurement_date Date the measurement was taken
measurement_datetime Date and time the measurement was taken
operator_concept_id Concept identifier for the operator (e.g. less than, greater than) used in the measurement value comparison if applicable
value_as_number The numeric value of the measurement
value_as_concept_id Concept identifier for the categorical value of the measurement
unit_concept_id Concept identifier for the unit of measurement
range_low The low end of the normal range for the measurement
range_high The high end of the normal range for the measurement
visit_occurrence_id Identifier for the visit during which the measurement was taken

The observation table contains the following columns (among others not listed here):

Column Names Description of content
observation_id Unique identifier for each observation
person_id Identifier for the patient
observation_concept_id Concept identifier for the observation
observation_date Date the observation was made
observation_datetime Date and time the observation was made
value_as_number The numeric value of the observation
value_as_string The string value of the observation
value_as_concept_id Concept identifier for the categorical value of the observation
visit_occurrence_id Identifier for the visit during which the observation was made

The measurement_concept_id or observation_concept_id columns define what has been recorded. Here are some examples :

Example Measurement concepts Example Observation concepts
Respiratory rate Respiratory function
Pulse rate Wound dressing observable
Hemoglobin saturation with oxygen Mandatory breath rate
Body temperature Body position for blood pressure measurement
Diastolic blood pressure Alcohol intake - finding
Arterial oxygen saturation Tobacco smoking behavior - finding
Body weight Vomit appearance
Leukocytes [#/volume] in Blood State of consciousness and awareness

Look at the column values we have got in the tables associated with our database.

R

print("measurement column names:")

OUTPUT

[1] "measurement column names:"

R

omop$measurement |> 
  colnames()

OUTPUT

 [1] "measurement_id"         "person_id"              "measurement_concept_id"
 [4] "measurement_date"       "measurement_datetime"   "operator_concept_id"
 [7] "value_as_number"        "value_as_concept_id"    "unit_concept_id"
[10] "range_low"              "range_high"             "visit_occurrence_id"   

R

print("measurement column names:")

OUTPUT

[1] "measurement column names:"

R

observation <- omop$observation |> 
  colnames()

observation

OUTPUT

[1] "observation_id"         "person_id"              "observation_concept_id"
[4] "observation_date"       "observation_datetime"   "value_as_number"
[7] "value_as_string"        "value_as_concept_id"    "visit_occurrence_id"   

CODING_NOTE: The colnames() function is used to get the column names of the measurement and observation tables. The print() function is used to print a label before the column names for clarity. The column names are then printed to the console, as the default behaviour when you don’t assign an output to a variable.

In the case of the observation table, we assign the column names to a variable observation and then print it. This is just to demonstrate that you can assign the column names to a variable if you want to use them later in your code.

Challenge

Challenge

Looking at the measurement and observation tables identify the various columns that might store a value and columns which help you make sense of what a value might mean.

The various value columns store values :

column name data type example concept_name
value_as_number numeric value 1.2 -
unit_concept_id units of the numeric value 9529 kilogram
value_as_concept_id categorical value 4328749 High
operator_concept_id optional operators 4172704 >

Note where values are a concept_id, the name of that concept can be looked up in the concept table that is part of the OMOP vocabularies and included in most CDM instances.

You can see from the column names within the tables that for an observation the value can be a string, a number or a concept, whereas for a measurement the value can be a number accompanied by a unit concept or the value can be a concept.

Looking at observation values


Let’s focus on observations.

We could go through each table and use our get_concept_name function to work out what all these measurements and observations are, but that could get a bit tedious!

Let’s try and join to the concept table and produce a table that gives us the humanly readable names to start with.

Challenge

Challenge

By joining to the concept table produce a version of the observation table with concept names. Only include columns that are relevant to the value.

R

# Pre-load concept names and ids
concepts <- omop$concept |> 
  select(concept_id, concept_name) |>
  collect()

# Create a mini observation table with only the columns relevant to value
mini_observation <- omop$observation |>
  select(observation_id, person_id, observation_concept_id, value_as_concept_id, value_as_number) |>
  collect()

# Join to get names of the observation concept id
# Rename the new column to observation_concept_name
# Relocate the new column to be after observation_concept_id
mini_observation <- mini_observation |>
  inner_join(concepts, by=join_by(observation_concept_id == concept_id)) |>
  rename(observation_concept_name = concept_name) |>
  relocate(observation_concept_name, .after = observation_concept_id)

# Repeat the join to get names of the value concept id
mini_observation <- mini_observation |>
  left_join(concepts, by = join_by(value_as_concept_id == concept_id)) |>
  rename(value_as_concept_name = concept_name) |>
  relocate(value_as_concept_name, .after = value_as_concept_id)

CODING_NOTE: In the above code we first read in the concept table and use select to get only the concept_id and concept_name columns to create a smaller table of concepts. Remember that we use collect() to bring the data into memory. Then we create a mini version of the observation table that only contains the columns relevant to the value. We then use inner_join to join this mini observation table to the smaller concepts table joining on the observation_concept_id to get the name of the observation concept. We use rename to rename this new column to observation_concept_name and relocate to move it to be after the observation_concept_id column. We then repeat this process to join to the concepts table by the value_as_concept_id to get the name of the value concept, rename it to value_as_concept_name and relocate it to be after the value_as_concept_id column. We could have done this in one step by joining to the concept table twice in the same code chunk, but we have done it in two steps so that you can inspect it midway. The process is the same whether you join to the concept table once or twice, you just need to specify the correct join condition and rename the new columns appropriately.

Now we can look at this named table.

R

tibble::view(mini_observation)
mini_observation

OUTPUT

# A tibble: 30 × 7
   observation_id person_id observation_concept_id observation_concept_name
            <int>     <int>                  <int> <chr>
 1           6001      1111                4160001 Clinical finding present
 2           6002      1111                4203130 Discharge from hospital
 3           6003      1112                4138933 Admission to intensive care …
 4           6004      1112               45772969 On ventilator
 5           6005      1112                4203130 Discharge from hospital
 6           6006      1112                4103640 Amputated foot
 7           6007      1112                4160001 Clinical finding present
 8           6008      1113                4024958 Throat culture
 9           6009      1113                4232313 Microbial identification kit…
10           6010      1113                4160001 Clinical finding present
# ℹ 20 more rows
# ℹ 3 more variables: value_as_concept_id <int>, value_as_concept_name <chr>,
#   value_as_number <int>

CODING_NOTE: The tibble::view() function is used to open the mini_observation data frame in a spreadsheet-like viewer in RStudio. This allows us to easily explore the data and see the humanly readable names for the observation concepts and value concepts. it is more readable than using print() or simply typing the name of the data frame in the console, especially if the data frame has many rows and columns.

Social indexes

Could be skipped if short of time

It is interesting to note that some observations relate to social indexes such as deprivation indices. As noted in the title these are observations made in England only.

Challenge

Challenge

Create a mini version of the concepts table that contains only the concepts relating to social indices. These concepts are those with concept_id 35812888, 35812884, 35812883, 35812882, 35812883, 35812885.

R

social_concepts <- omop$concept |>
  filter(concept_id %in% c(35812888, 35812884, 35812883, 35812882, 35812883, 35812885)) |>
  collect()

social_concepts

OUTPUT

# A tibble: 5 × 10
  concept_id concept_name               domain_id vocabulary_id standard_concept
       <int> <chr>                      <chr>     <chr>         <chr>
1   35812882 Index of Multiple Depriva… Observat… UK Biobank    ""
2   35812883 Income score (England)     Observat… UK Biobank    ""
3   35812884 Employment score (England) Observat… UK Biobank    ""
4   35812885 Health score (England)     Observat… UK Biobank    ""
5   35812888 Crime score (England)      Observat… UK Biobank    ""
# ℹ 5 more variables: concept_class_id <chr>, concept_code <chr>,
#   valid_start_date <date>, valid_end_date <date>, invalid_reason <chr>

CODING_NOTE: We use %in% to filter on each concept_id in the list given. The collect() function is then used to bring this filtered data into memory so that we can work with it as a regular data frame in R. We then display the resulting table of social concepts by simply typing the name of the data frame social_concepts.

This is an instance of a nonstandard concept being used within OMOP.

Comment on the fact that these concepts are nonstandard and that this is an example of how local data can be mapped to OMOP concepts even if they are not part of the standard vocabulary.

Looking at measurement values


Let’s now look at measurements. As we said before, measurements are often numerical values with associated units. This can arise from lab results or vital signs.

Challenge

Challenge

Consider the concept with the name Heart rate. Use the measurement and concept tables to answer the following question:

  1. What are the units associated with this measurement concept?

  2. What is the average value recorded for this measurement across all persons?

  3. What class of concept is this measurement concept?

  1. What are the units associated with Heart rate?

R

# Get the concept id for Heart rate  
heart_rate_id <- get_concept_id(omop, "Heart rate") |>
  pull(concept_id)

heart_rate_id

OUTPUT

[1] 3027018

R

# Filter measurement table for this concept id
heart_rate_measurements <- omop$measurement |>
  filter(measurement_concept_id == heart_rate_id) 

heart_rate_units <- heart_rate_measurements |>
  distinct(unit_concept_id) |>
  pull()

get_concept_name(omop, heart_rate_units)

OUTPUT

# A tibble: 1 × 1
  concept_name
  <chr>
1 per minute  

Answer: The units associated with the Heart rate measurement concept are per minute.

CODING_NOTE: We first use the get_concept_id() function to get the concept_id for “Heart rate”. We then filter the measurement table for rows where the measurement_concept_id matches this concept_id. We then use distinct() to get the unique unit concept ids associated with these heart rate measurements, and then collect() to bring it into memory. Finally, we use our get_concept_name() function to look up the names of these unit concepts. Note that we use the variable heart_rate_id to store the id so that we can use it in subsequent code.

  1. What is the average value recorded for Heart rate across all persons?

R

average_heart_rate <- heart_rate_measurements |>
  collect() |>
  summarise(heart_rate =   mean(value_as_number, na.rm = TRUE)
)

average_heart_rate

OUTPUT

# A tibble: 1 × 1
  heart_rate
       <dbl>
1         95

Answer: The average value recorded for the Heart rate measurement concept across all persons is 95 beats per minute.

CODING_NOTE: Here we are using dplyr’s summarise() function to apply a function to the entire dataframe. We use the mean() function to calculate the average of the value_as_number column in the heart_rate_measurements data frame. We set na.rm = TRUE to ignore any missing values when calculating the mean.

  1. Get the class of concept for Heart rate

R

heart_rate_class <- omop$concept |>
  filter(concept_id == heart_rate_id) |>
  select(concept_class_id) |>
  collect() 

heart_rate_class

OUTPUT

# A tibble: 1 × 1
  concept_class_id
  <chr>
1 Clinical Observation

Answer: The class of concept for Heart rate is Clinical Observation.

CODING_NOTE: We filter the concept table for the row where the concept_id matches the heart_rate_id. We then select the concept_class_id column to get the class of concept. Finally, we use collect() to bring this data into memory and display it.

You may have noticed that one of the column names in the measurement table is operator_concept_id. This column is used to store optional operators that can be used to indicate whether a measurement value is greater than, less than, equal to etc. a certain value. For example, if a measurement value is recorded as “> 10”, the value_as_number column would contain the number 10 and the operator_concept_id column would contain the concept id for the “greater than” operator. This allows us to capture measurements that are recorded in this way while still being able to work with the numeric value in the value_as_number column.

Challenge

Challenge

Create a version of the measurement table containing columns: measurement_id, measurement_concept_id, operator_concept_id, value_as_number, range_low and range_high for person_id = 31, leaving out any rows where operator_concept_id is 0. Now add a column for the names of the measurement and operator concepts.

Using this reduced measurement table to list the measurements made for this person.

What range should the C reactive protein measurement for this person fall within?

R

# Create a mini version of the measurement table for person_id 31
mini_measurement <- omop$measurement |>
  filter(
    person_id == 31,
    !is.na(operator_concept_id) & operator_concept_id != 0
    ) |>
  select(measurement_id, measurement_concept_id, operator_concept_id, value_as_number, range_low, range_high) |>
  collect() 

# Join to get names of the measurement concept id
mini_measurement <- mini_measurement |>
  left_join(concepts, by = join_by(measurement_concept_id == concept_id)) |>
  rename(measurement_concept_name = concept_name) |>
  relocate(measurement_concept_name, .after = measurement_concept_id)

# Join to get names of the operator concept id
mini_measurement <- mini_measurement |>
  left_join(concepts, by = join_by(operator_concept_id == concept_id)) |>
  rename(operator_concept_name = concept_name) |>
  relocate(operator_concept_name, .after = operator_concept_id)

tibble::view(mini_measurement)
mini_measurement

OUTPUT

# A tibble: 6 × 8
  measurement_id measurement_concept_id measurement_concept_name
           <int>                  <int> <chr>
1         351796                4301868 Pulse rate
2         351800                4313591 Respiratory rate
3         354289                4011919 Hemoglobin saturation with oxygen
4         354292               44810247 LACE (length of stay, acuity of automat…
5         354293                3020460 C reactive protein [Mass/volume] in Ser…
6         354294               46236952 Glomerular filtration rate [Volume Rate…
# ℹ 5 more variables: operator_concept_id <int>, operator_concept_name <chr>,
#   value_as_number <dbl>, range_low <dbl>, range_high <dbl>

Answer: The measurements made for person_id 31 are: Pulse rate = 61 Respiratory rate = 16 Hemoglobin saturation with oxygen = 100 LACE = 4 C reactive protein [Mass/volume] in Serum or Plasma < 0.6 Glomerular filtration rate > 90

The C reactive protein measurement for this person should fall within the range 0 - 5 mg/L.

CODING_NOTE: We first filter the measurement table for rows where person_id is 31 and operator_concept_id is filled and not 0, and select only the relevant columns. We then use collect() to bring this data into memory. We then join to the concepts table to get the names of the measurement concepts and operator concepts, renaming the new columns appropriately and relocating them for better readability. Finally, we display the resulting mini_measurement data frame which contains the measurements made for person_id 31 along with the names of the measurement and operator concepts.

We leave it as an exercise for the student to look up the units of each measurement.

There is also a column called value_as_concept_id which can be used to store categorical values for measurements. For example, if a measurement is recorded as “High”, the value_as_concept_id column would contain the concept_id for “High”. This allows us to capture measurements that are recorded in this way while still being able to work with the categorical value in the value_as_concept_id column.

Challenge

Challenge

Using the measurement table, find all measurements that have a value recorded in the value_as_concept_id column and join to the concept table to get the names of these measurements and values. Display only the unique set of measurement concept names and value concept names.

R

categorical_measurements <- omop$measurement |>
  filter(!is.na(value_as_concept_id) & value_as_concept_id != 0) |>
  select(measurement_id, measurement_concept_id, value_as_concept_id)

# Join to get names of the measurement concept id
categorical_measurements <- categorical_measurements |>
  left_join(concepts, by = join_by(measurement_concept_id == concept_id)) |>
  rename(measurement_concept_name = concept_name) |>
  relocate(measurement_concept_name, .after = measurement_concept_id)

# Join to get names of the value concept id
categorical_measurements <- categorical_measurements |>
  inner_join(concepts, by = join_by(value_as_concept_id == concept_id)) |>
  rename(value_as_concept_name = concept_name) |>
  relocate(value_as_concept_name, .after = value_as_concept_id)

# Get unique set of measurement concept names and value concept names
categorical_measurements <- categorical_measurements |>
  select(measurement_concept_name, value_as_concept_name) |>
  distinct() |>
  collect()

tibble::view(categorical_measurements)
categorical_measurements

OUTPUT

# A tibble: 4 × 2
  measurement_concept_name                                 value_as_concept_name
  <chr>                                                    <chr>
1 Alert Confusion Voice Pain Unresponsiveness scale        Mentally alert
2 Blood group antibody screen [Presence] in Serum or Plas… Not present
3 Rh [Type] in Blood                                       Positive
4 ABO group [Type] in Blood                                Candida sp identify …

Answer: The resulting categorical_measurements data frame contains all measurements that have a categorical value recorded along with the names of the measurement and value concepts.

CODING_NOTE: We first filter the measurement table for rows where value_as_concept_id is not missing, and select only the relevant columns. We then join to the concepts table to get the names of the measurement concepts and value concepts, renaming the new columns appropriately and relocating them for better readability. The we then select only the measurement concept name and value concept name columns, and remove duplicates using distinct(). Finally, we display the resulting categorical_measurements data frame which contains all measurements that have a unique categorical value recorded along with the names of the measurement and value concepts. We delay using collect() to as late as possible, so that we only bring things into memory when needed.

Key Points
  • Measurements are mainly lab results and other clinical measurement records like pulse rate
  • Observations are other facts obtained through questioning or direct observation
  • Concept ids identify the measure or observation and values are stored in value_as_number or value_as_concept_id
  • We can join to the concept table to find a particular measurement or observation concept by name

Content from Conditions and Visits


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • What are conditions in the OMOP CDM?

  • When do we need to consider visits in our analysis?

Objectives

  • Understand the structure and purpose of the conditions table in the OMOP CDM.

  • Know that visits are recorded in the visit_occurrence table.

  • Learn how and when to consider visits in data analysis.

  • Know that a visit is a period of time and patients can have multiple visits.

  • Understand that multiple measurements, conditions etc. can occur within a visit.

  • Understand that other tables link to visits.

Introduction


This episode covers the OMOP conditions and visits table.

Callout

For this episode we will be using a sample OMOP CDM database that is pre-loaded with data. This database is a simplified version of a real-world OMOP CDM database and is intended for educational purposes only.

(UCLH only) This will come in a similar form as you would get data if you asked for a data extract via the SAFEHR platform (i.e. a set of parquet files).

As part of the setup prior to this course you were asked to download and install the sample dataset. If you have not done this yet, please refer to the setup instructions. For now, we will assume that you have the sample OMOP CDM dataset available on your local machine at the following path: ./data/omop/ and the functions in a folder ./code/parquet_dataset.

You will then need to load the database as shown in the previous episode.

R

library(dplyr)
open_omop_dataset <- function(path) {
    # iterate over table level directories
    list.dirs(path, recursive = FALSE) |>
      # exclude folder name from path and use it as index for named list
      purrr::set_names(~ basename(.)) |>
      # "lazy-load" list of parquet files from specified folder
      purrr::map(arrow::open_dataset)
}

R

omop <- open_omop_dataset("./data/omop")

and the useful functions we created in the previous episode to look up concept names and ids.

R

get_concept_name <- function(omop_obj, id) {
  omop_obj$concept |>
    filter(concept_id == id) |>
    select(concept_name) |>
    collect()
}

R

get_concept_id <- function(omop_obj, name) {
  omop_obj$concept |>
    filter(concept_name == name) |>
    select(concept_id) |>
    collect()
}

Conditions


Conditions are a key part of the OMOP CDM. They represent diagnoses that have been made for patients. Conditions are stored in the condition_occurrence table. Each record in this table represents a single occurrence of a condition for a patient. The table contains records of diseases, medical conditions, diagnoses, signs, or symptoms observed by providers or reported by patients. Conditions are mapped from diagnostic codes and represented using standardized concepts in a hierarchical structure.

The condition_occurrence table contains the following columns (among others not listed here):

Column Names Description of content
condition_occurrence_id Unique identifier for each condition occurrence
person_id Identifier for the patient
condition_concept_id Concept identifier for the condition
condition_start_date Date the condition started
condition_end_date Date the condition ended
condition_type_concept_id Concept identifier for the type of condition occurrence
condition_status_concept_id Concept identifier for the status of the condition occurrence
visit_occurrence_id Identifier for the visit during which the condition was recorded
condition_source_value Original value from the source data
condition_source_concept_id Concept identifier for the source value
Challenge

Challenge

  1. How many records are there in the condition_occurrence table?

  2. List any of the conditions that occur more than once in the table along with their humanly readable names.

  3. Choose one patient and list all the conditions they have and when they started?

  1. How many records are there in the condition_occurrence table?

R

omop$condition_occurrence |>
  tally() |>
  collect()

OUTPUT

# A tibble: 1 × 1
      n
  <int>
1    35

Answer: There are 35 records in the condition_occurrence table.

CODING NOTE: The tally() function is used to count the number of records in the table. It is more efficient to use tally() directly on the database without collecting the data into R, as this allows the database to perform the counting operation, which is optimized for large datasets. We then use collect() to run the calculation and produce the output. In place of tally() we could have used count(), which is a convenience function to group by a column and then tally(). We could have also used base R nrow(omop$condition_occurrence) in this example.

  1. List any of the conditions that occur more than once in the table along with their humanly readable names.

R

omop$condition_occurrence |>
  group_by(condition_concept_id) |>
  summarise(occurrences = n()) |>
  filter(occurrences > 1) |>
  left_join(
    omop$concept,
    by = join_by(condition_concept_id == concept_id)
  ) |>
  select(concept_name, occurrences) |>
  collect()

OUTPUT

# A tibble: 4 × 2
  concept_name                             occurrences
  <chr>                                          <int>
1 Injury of head                                     2
2 Inflammatory disorder of digestive tract           2
3 Gastritis                                          2
4 Hemorrhoids                                        2

CODING NOTE: We first group the condition_occurrence table by condition_concept_id and count the number of occurrences for each condition. We then filter to keep only those conditions that occur more than once. Finally, we join with the concept table to get the humanly readable names of the conditions and select the relevant columns for display.

  1. Choose one patient and list all the conditions they have and when they started

R

patient_id <- 1111  # Replace with the desired person_id 

omop$condition_occurrence |>
  filter(person_id == patient_id) |>
  left_join(
    omop$concept,
    by = join_by(condition_concept_id == concept_id)
  ) |>
  select(condition_concept_id, concept_name, condition_start_date) |>
  collect()

OUTPUT

# A tibble: 1 × 3
  condition_concept_id concept_name                         condition_start_date
                 <int> <chr>                                <date>
1              4230399 Closed fracture of lateral malleolus 2025-07-22          

CODING NOTE: We filter the condition_occurrence table for the specified person_id to get all conditions for that patient. We then join with the concept table to get the humanly readable names of the conditions and select the relevant columns for display.

Question three can be repeated for different patients by changing the patient_id variable. You might be surprised to see that if you choose patient 31 you will see that the entry for their condition and start date is repeated. This isn’t an error!

Challenge

Challenge

Investigate why patient 31 has repeated entries for their condition and start date in the condition_occurrence table. Look at the condition_type_concept_id, conditions_status_concept_id, and condition_source_value columns to understand the differences between these entries.

R

patient_id <- 31  

omop$condition_occurrence |>
  filter(person_id == patient_id) |>
  left_join(
    omop$concept,
    by = join_by(condition_concept_id == concept_id)
  ) |>
  rename(condition_concept_name = concept_name) |>
  relocate(condition_concept_name, .after = condition_concept_id) |>
  left_join(
    omop$concept,
    by = join_by(condition_type_concept_id == concept_id)
  ) |>
  rename(condition_type_concept_name = concept_name) |>
  relocate(condition_type_concept_name, .after = condition_type_concept_id) |>
  left_join(
    omop$concept,
    by = join_by(condition_status_concept_id == concept_id)
  ) |>
  rename(condition_status_concept_name = concept_name) |>
  relocate(condition_status_concept_name,
           .after = condition_status_concept_id) |>
  select(
    condition_concept_id, 
    condition_concept_name,
    condition_start_date, 
    condition_type_concept_id,
    condition_type_concept_name,
    condition_status_concept_id,
    condition_status_concept_name,
    condition_source_value
    ) |>
  collect()

OUTPUT

# A tibble: 2 × 8
  condition_concept_id condition_concept_name condition_start_date
                 <int> <chr>                  <date>
1               375415 Injury of head         2019-05-10
2               375415 Injury of head         2019-05-10
# ℹ 5 more variables: condition_type_concept_id <int>,
#   condition_type_concept_name <chr>, condition_status_concept_id <int>,
#   condition_status_concept_name <chr>, condition_source_value <chr>

Answer: The output shows that for patient 31, the condition_concept_id and condition_start_date are identical, but the condition_type_concept_id and condition_status_concept_id differ. This indicates that the same condition was recorded in different contexts or with different statuses, which explains the repeated entries in the condition_occurrence table. This is commonly found in hospital records!

CODING NOTE: We filter the condition_occurrence table for person_id 31 and join with the concept table multiple times to get the humanly readable names for the condition_concept_id, condition_type_concept_id, and condition_status_concept_id. We then select and arrange the relevant columns for display to understand the differences between the entries for this patient.

Visits


The visit_occurrence table contains events where each person engages with the healthcare system for a duration of time.

The visit_occurrence table contains the following columns (among others not listed here):

Column Names Description of content
visit_occurrence_id Unique identifier for each visit occurrence
person_id Identifier for the patient
visit_concept_id Concept identifier for the type of visit
visit_start_date Date the visit started
visit_start_datetime Date and time the visit started
visit_end_date Date the visit ended
visit_end_datetime Date and time the visit ended
visit_type_concept_id Concept identifier for the type of visit
discharged_to_concept_id Concept identifier for where the patient was discharged to
preceding_visit_occurrence_id Identifier for the previous visit occurrence

The main clinical tables condition_occurrence, measurement, observation and drug_exposure contain a visit_occurrence_id that links to this table.

visit_concept_id specifies the kind of visit that took place using standardised OMOP concepts. These include Inpatient visit, Emergency Room Visit and Outpatient Visit. Inpatient visits can last for longer than one day.

As we have seen we don’t need to consider visits to answer all research questions. For example if we can count the number of patients with a particular condition without considering visits. However, in some cases visits are important. For example, if we want to know how many emergency room visits resulted in a hospital admission we need to consider visits.

Challenge

Challenge

  1. Find out how many different types of visits are recorded in the visit_occurrence table and link these to get their name.

  2. Find patients who had more than one visit.

  3. How many patients had both an emergency room visit and an inpatient visit?

  1. Find out how many different types of visits are recorded in the visit_occurrence table and link these to get their name.

R

omop$visit_occurrence |>
  count(visit_concept_id) |>
  left_join(
    omop$concept,
    by = join_by(visit_concept_id == concept_id)
  ) |>
  select(visit_concept_id, concept_name, n) |>
  arrange(desc(n)) |>
  collect()

OUTPUT

# A tibble: 4 × 3
  visit_concept_id concept_name                           n
             <int> <chr>                              <int>
1             9203 Emergency Room Visit                   8
2             9201 Inpatient Visit                        7
3             9202 Outpatient Visit                       4
4              262 Emergency Room and Inpatient Visit     1

Answer: The different types of visits recorded in the visit_occurrence table along with their names are: - Emergency Room Visit - Inpatient visit - Outpatient Visit - Emergency Room and Inpatient Visit

CODING NOTE: We use the count() function to count the number of occurrences of each visit_concept_id in the visit_occurrence table. We then join with the concept table to get the humanly readable names of the visit types and select the relevant columns for display. Finally, we arrange the results in descending order of count using the desc() function in conjunction with the arrange() function.

  1. Find patients who had more than one visit.

R

omop$visit_occurrence |>
  group_by(person_id) |>
  tally() |>
  filter(n > 1) |>
  collect()

OUTPUT

# A tibble: 4 × 2
  person_id     n
      <int> <int>
1      1112     2
2        31     2
3         2     2
4        58    10

CODING NOTE: We group the visit_occurrence table by person_id and use tally() to count the number of visits for each patient. We then filter to keep only those patients with an n greater than 1 and collect the results into R for display. We could have also used the summarise() function instead of tally().

  1. How many patients had both an emergency room visit and an inpatient visit?

R

patients_with_both_in_one_visit <- omop$visit_occurrence |>
  filter(visit_concept_id == 262) |>
  count(person_id) |>
  collect()

patients_with_both_visits_seperately <- omop$visit_occurrence |>
  filter(visit_concept_id %in% c(9203, 9201)) |>
  group_by(person_id) |>
  summarise(visit_types = n_distinct(visit_concept_id)) |>
  filter(visit_types  == 2) |>
  collect()

patients_with_both_in_one_visit |>
  bind_rows(patients_with_both_visits_seperately) |>
  distinct(person_id) |>
  tally()

OUTPUT

# A tibble: 1 × 1
      n
  <int>
1     2

Answer: The number of patients who had both an emergency room visit and an inpatient visit is 2.

CODING NOTE: We filter the visit_occurrence table for the visit_concept_id values corresponding to emergency room visits and inpatient visit coded in a single concept. We then use count(person_id) to find the total number of visits per patient. For the next section we look at patients who have had separate inpatient and emergency room visits by filtering the visit_occurrence table for the visit_concept_id for both concepts. We then group by person_id and use summarise() to count the number of distinct visit types for each patient. Finally, we count the number of patients who had more than one distinct visit type to get the number of patients who had both types of visits. Then we join together both tables using bind_rows() and get the distinct patients before counting the number of rows with tally().

Key Points
  • Conditions are stored in the condition_occurrence table in the OMOP CDM.

  • Visits are stored in the visit_occurrence table and linked to other clinical tables via visit_occurrence_id.

  • A visit represents a period of time during which a patient interacts with the healthcare system and there can be multiple types of visits.

  • Visits may be important to consider in analyses depending on the research question.

Content from Medications


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • Where are medications stored ?

  • How do you trace the relationship between concepts from different vocabularies?

Objectives

  • Know that exposure of a patient to medications is stored in the drug_exposure table

  • Understand that drug concepts can be at different levels of granularity

  • Understand that source values are mapped to a standard vocabulary

  • Understand how to access the concentration or strength of an exposure using the drug_strength table

Introduction


This episode considers medications (the drug exposure table) in the OMOP Common Data Model (CDM).

Callout

For this episode we will be using a sample OMOP CDM database that is pre-loaded with data. This database is a simplified version of a real-world OMOP CDM database and is intended for educational purposes only.

(UCLH only) This will come in a similar form as you would get data if you asked for a data extract via the SAFEHR platform (i.e. a set of parquet files).

As part of the setup prior to this course you were asked to download and install the sample dataset. If you have not done this yet, please refer to the setup instructions. For now, we will assume that you have the sample OMOP CDM dataset available on your local machine at the following path: ./data/omop/ and the functions in a folder ./code/parquet_dataset.

You will then need to load the database as shown in the previous episode.

R

library(dplyr)
open_omop_dataset <- function(path) {
    # iterate over table level directories
    list.dirs(path, recursive = FALSE) |>
      # exclude folder name from path and use it as index for named list
      purrr::set_names(~ basename(.)) |>
      # "lazy-load" list of parquet files from specified folder
      purrr::map(arrow::open_dataset)
}

R

omop <- open_omop_dataset("./data/omop")

and the useful functions we created in the previous episode to look up concept names and ids.

R

get_concept_name <- function(omop_obj, id) {
  omop_obj$concept |>
    filter(concept_id == id) |>
    select(concept_name) |>
    collect()
}

R

get_concept_id <- function(omop_obj, name) {
  omop_obj$concept |>
    filter(concept_name == name) |>
    select(concept_id) |>
    collect()
}

Drug exposure


The OMOP drug_exposure table stores exposure of a patient to medications. The purpose of records in this table is to indicate an exposure to a certain drug as best as possible. In this context a drug is defined as an active ingredient. Drug Exposures are defined by Concepts from the Drug domain, which form a complex hierarchy. As a result, one drug_source_concept_id may map to multiple standard concept_ids if it is a combination product. Records in this table represent prescriptions written, prescriptions dispensed, and drugs administered by a provider to name a few. The drug_type_concept_id can be used to find and filter on these types. This table includes additional information about the drug products, the quantity given, and route of administration.

The drug_exposure table contains the following columns (among others not listed here):

Column Names Description of content
drug_exposure_id Unique identifier for each drug exposure
person_id Identifier for the patient
drug_concept_id Standard concept identifier for the drug
drug_exposure_start_date Date the drug exposure started
drug_exposure_start_datetime Date and time the drug exposure started
drug_exposure_end_date Date the drug exposure ended
drug_exposure_end_datetime Date and time the drug exposure ended
drug_type_concept_id Concept identifier for the type of drug exposure
quantity The quantity of the drug administered
route_concept_id Concept identifier for the route of administration
visit_occurrence_id Identifier for the visit during which the drug was administered
drug_source_concept_id OMOP concept ID for the source value

Drug data can be very complicated, as can the process of converting from the source data to OMOP. You may not find what you expect depending on this and the quality of the source data.

Drug concepts


The standard OHDSI drug vocabularies are called RxNorm and RxNormExtension. RxNorm contains all drugs currently on the US market. RxNormExtension is maintained by the OHDSI community and contains all other drugs.

A particular concept_id can be at one of a number of different levels in a drug hierarchy.

Challenge

Challenge

List the main levels of drug concepts in RxNorm from the concept table (e.g. the concept_class_id).

R

omop$concept |>
  filter(vocabulary_id == "RxNorm") |>
  select(concept_class_id) |>
  distinct() |>
  arrange(concept_class_id) |>
  collect() 

OUTPUT

# A tibble: 4 × 1
  concept_class_id
  <chr>
1 Clinical Drug
2 Clinical Drug Form
3 Ingredient
4 Quant Branded Drug

Answer: There are more levels than shown here, but that is a disadvantage of using a small sample database. In a full OMOP CDM database you would see more levels.

CODING NOTE: The distinct() function is used to get unique values of concept_class_id for the RxNorm vocabulary. The arrange() function is then used to sort these values in alphabetical order for easier reading.

A fuller example of the drug concept hierarchy in RxNorm is shown in the table below.

RxNorm concept_class_id Description
Clinical Drug A combination of an ingredient, strength, and dose form (e.g., Ibuprofen 200 mg Oral Tablet).
Clinical Drug Comp A drug component with strength but no form (e.g., Ibuprofen 200 mg).
Clinical Drug Form A drug with a specific dose form but no strength (e.g., Ibuprofen Oral Tablet).
Quant Clinical Drug A clinical drug with a specific quantity (e.g., Ibuprofen 200 mg Oral Tablet 1).
Ingredient A base active drug ingredient, without strength or dose form (e.g., Ibuprofen).

There are also concepts for Branded drugs and for packs of drugs (e.g. a box of 30 tablets) but these are not shown in this sample table.

Drug mapping in the NHS


Drugs in the NHS are standardised to the NHS Dictionary of Medicines and Devices (dm+d). dm+d is included in OMOP so there are values of OMOP concept_id for each dm+d. However because dm+d is not a standard vocabulary in OMOP it is translated once more to get to a standard OMOP concept_id in RxNorm or RxNormExtension that can be used in collaborative studies. If there is a drug_concept_id value of 0 and there are source codes, this can be because that drug doesn’t map to a standard concept. Reminder that the source values are stored in these columns.

Challenge

Challenge

Look up the concept_id 871182 and search through the concept table by drug name to find the likely concept in RxNorm and its concept_id.

R

# look up the concept entry
dmd_concept <- omop$concept |>
  filter(concept_id == 871182) |>
  select(concept_id, concept_name, domain_id, vocabulary_id, concept_class_id) |>
  collect()

dmd_concept

OUTPUT

# A tibble: 1 × 5
  concept_id concept_name          domain_id vocabulary_id concept_class_id
       <int> <chr>                 <chr>     <chr>         <chr>
1     871182 Lorazepam 1mg tablets Drug      dm+d          VMP             

R

# this is the dose of lorazepam
# now look up any concepts that have a similar name
similar <- filter(omop$concept, grepl('Lorazepam', concept_name, TRUE)) |>
  collect()

similar

OUTPUT

# A tibble: 4 × 10
  concept_id concept_name               domain_id vocabulary_id standard_concept
       <int> <chr>                      <chr>     <chr>         <chr>
1     871182 Lorazepam 1mg tablets      Drug      dm+d          ""
2   19019113 lorazepam MG Oral Tablet   Drug      RxNorm        "S"
3   35777064 1 ML Lorazepam 4 MG/ML In… Drug      RxNorm Exten… "S"
4   36816707 Lorazepam 4mg/1ml solutio… Drug      dm+d          ""
# ℹ 5 more variables: concept_class_id <chr>, concept_code <chr>,
#   valid_start_date <date>, valid_end_date <date>, invalid_reason <chr>

Answer: We can see from the resulting table that there are entries for each lorazepam dose from both the dm+d and RxNorm vocabularies. The concept_id 871182 corresponds to the dm+d concept “Lorazepam 1mg tablets”. This seems maps to the RxNorm concept “lorazepam Oral Tablet” which has concept_id 19019113, but without knowing the quantity we can’t be sure which dose it maps to. This is an example of the complexity of drug data and the mapping process.

CODING_NOTE: The function grepl() is used to find all concepts that have “Lorazepam” in their name. We have added the ignore.case = TRUE argument to make the search case-insensitive. This allows us to find all relevant concepts regardless of how they are capitalized in the concept names.

This is a snapshot of the Athena table for code 871182. Athena is the OHDSI tool for exploring the OMOP vocabularies and concept relationships. It is available online at https://athena.ohdsi.org/. You can use it to look up concepts and their relationships to other concepts in the OMOP CDM.

A snapshot of the Athena table for code 871182.
The OMOP Code 871182

Looking at the entry for concept_id 871182 we can see that it is in the dm+d vocabulary. We can see that is connected to the RxNorm concept “lorazepam 1 MG Oral Tablet” which has concept_id 19019113. So our assumption above was correct, but the concept table in our dataset didn’t fill in the name fully so we couldn’t be sure without looking it up in the official table. Within the OMOP CDM there are tables (concept_relationship) where you can look up the relationships between concepts programatically, so this doesn’t have to be manually done by using Athena.

Quantity


There are other vocabulary tables for the OMOP CDM other than the concept table. drug_strength stores the amount of concentration and associated units of a specific ingredient within a drug product.

The drug_strength table contains the following columns (among others not listed here):

Column Names Description of content
drug_concept_id Concept representing the Branded Drug or Clinical Drug Product
ingredient_concept_id Concept representing the active ingredient contained within the drug product
amount_value Amount of active ingredient
amount_unit_concept_id Concept for the unit of measure for the amount_value
numerator_value Concentration of the active ingredient in the drug product
numerator_unit_concept_id Concept for the unit of measure for numerator_value
denominator_value Amount of total liquid in the drug product
denominator_unit_concept_id Concept for the unit of measure for denominator_value

Drug data can be very complicated, as can the process of converting from the source data to OMOP. You may not find what you expect depending on this and the quality of the source data.

Challenge

Challenge

Work out what drugs person_id 2 was exposed to and the quantity of each drug by joining to the drug_strength table. Please display the names for the drug_concept_id. Once you have this, also display the names of amount_unit_concept_id, numerator_unit_concept_id and denominator_unit_concept_id.

R

# filter for person_id 2
person_2_drugs <- omop$drug_exposure |>
  filter(person_id == 2) |>
  select(drug_exposure_id, drug_concept_id, drug_source_concept_id, quantity) 

# now create a table with humanly readable names for the drug_concept_id and drug_source_concept_id.
# first, get the concept names for drug_concept_id
concept_names <- omop$concept |>
  select(concept_id, concept_name)

# join with person_2_drugs
person_2_drugs <- person_2_drugs |>
  left_join(concept_names, by = join_by(drug_concept_id == concept_id)) |>
  rename(drug_concept_name = concept_name)

# Join in the drug strength table and get the names
person_2_drugs_with_strength <- person_2_drugs |>
  left_join(omop$drug_strength, by = join_by(drug_concept_id)) |>
  # join in names of concepts
  left_join(concept_names, by = join_by(amount_unit_concept_id == concept_id)) |>
  rename(amount_unit_concept_name = concept_name) |>
  left_join(concept_names, by = join_by(numerator_unit_concept_id == concept_id)) |>
  rename(numerator_unit_concept_name = concept_name) |>
  left_join(concept_names, by = join_by(denominator_unit_concept_id == concept_id)) |>
  rename(denominator_unit_concept_name = concept_name) |>
  select(
    drug_exposure_id,
    drug_concept_name, 
    quantity, 
    amount_value, 
    amount_unit_concept_name, 
    numerator_value, 
    numerator_unit_concept_name, 
    denominator_value, 
    denominator_unit_concept_name
    ) |>
  collect()

person_2_drugs_with_strength

OUTPUT

# A tibble: 7 × 9
  drug_exposure_id drug_concept_name                       quantity amount_value
             <int> <chr>                                      <dbl>        <dbl>
1            38914 1 ML Lorazepam 4 MG/ML Injectable Solu…     0.25           NA
2            43725 100 ML Glucose 50 MG/ML Injectable Sol…     0              NA
3            44660 2 ML Ondansetron 2 MG/ML Injectable so…     1              NA
4            29587 lorazepam MG Oral Tablet                    1               1
5            34505 promethazine hydrochloride 25 MG Oral …     1              25
6            34506 lorazepam MG Oral Tablet                    1               1
7            47679 promethazine hydrochloride 25 MG Oral …     1              25
# ℹ 5 more variables: amount_unit_concept_name <chr>, numerator_value <dbl>,
#   numerator_unit_concept_name <chr>, denominator_value <dbl>,
#   denominator_unit_concept_name <chr>

Answer: Person_id 2 was exposed to three drugs. - They had three doses of lorazepam: two separate doses of 1 x 1mg lorazepam tablets and one dose of 0.25 x 4mg/ml injectable lorazepam. - They had two doses of 1 x 25mg promethazine hydrochloride tablets. - They had one dose of 1 x 2mg/ml injectable ondansetron.

(Note the quantity of glucose is zero.)

CODING NOTE: We first filter the drug_exposure table for person_id 2 and select the relevant columns. Then we join this with the concept table to get the human readable name for the drug. We then join the drug_strength and join the concept table for each of the concept column in the drug_strength table.

Administration route


Challenge

Challenge

Find the route of administration for the drugs person_id 2 was exposed to.

R

# we could have used the same table we created in the previous challenge
person_2_drugs <- omop$drug_exposure |>
  filter(person_id == 2) |>
  select(drug_concept_id, drug_source_concept_id, quantity, route_concept_id) |>
  left_join(concept_names, by = join_by(drug_concept_id == concept_id)) |>
  rename(drug_concept_name = concept_name) 


# join with person_2_drugs
person_2_drugs <- person_2_drugs |>
  left_join(concept_names, by = join_by(route_concept_id == concept_id)) |>
  rename(route_concept_name = concept_name) |>
  collect()

person_2_drugs

OUTPUT

# A tibble: 7 × 6
  drug_concept_id drug_source_concept_id quantity route_concept_id
            <int>                  <int>    <dbl>            <int>
1        19019113                 871182     1             4132161
2        40180065                 874635     1             4132161
3        19019113                 871182     1             4132161
4        35777064               36816707     0.25          4302612
5        21049614               21255277     0             4171047
6        35778239               36817124     1             4171047
7        40180065                 874635     1             4132161
# ℹ 2 more variables: drug_concept_name <chr>, route_concept_name <chr>

CODING NOTE: We follow a similar process as in the previous challenge, but this time we also select the route_concept_id and join it with the concepts table to get the human-readable name for the route of administration. The resulting table now includes the route of administration for each drug exposure.

Key Points
  • Know that exposure of a patient to medications is mainly stored in the drug_exposure table
  • Understand that drug concepts can be at different levels of granularity
  • Understand that source values are mapped to a standard vocabulary
  • Understand how to access the concentration or strength of an exposure using the drug_strength table

Content from Dates and times


Last updated on 2026-04-28 | Edit this page

Estimated time: 0 minutes

Overview

Questions

  • What is the duration of an event?

  • How can I view trends in measurements?:

Objectives

  • Know that dates and times are used in most tables in the OMOP CDM to record when events happened.

  • Understand that dates and times are often used in pairs to record the start and end of an event, which allows us to calculate the duration of that event.

  • Know how to convert a date or datetime string to a date or datetime object in R.

Introduction


This episode considers dates and times in the OMOP Common Data Model (CDM).

Callout

For this episode we will be using a sample OMOP CDM database that is pre-loaded with data. This database is a simplified version of a real-world OMOP CDM database and is intended for educational purposes only.

(UCLH only) This will come in a similar form as you would get data if you asked for a data extract via the SAFEHR platform (i.e. a set of parquet files).

As part of the setup prior to this course you were asked to download and install the sample dataset. If you have not done this yet, please refer to the setup instructions. For now, we will assume that you have the sample OMOP CDM dataset available on your local machine at the following path: ./data/omop/ and the functions in a folder ./code/parquet_dataset.

You will then need to load the database as shown in the previous episode.

R

library(dplyr)

open_omop_dataset <- function(path) {
    # iterate over table level directories
    list.dirs(path, recursive = FALSE) |>
      # exclude folder name from path and use it as index for named list
      purrr::set_names(~ basename(.)) |>
      # "lazy-load" list of parquet files from specified folder
      purrr::map(arrow::open_dataset)
}

R

omop <- open_omop_dataset("./data/omop")

and the useful functions we created in the previous episode to look up concept names and ids.

R

get_concept_name <- function(omop_obj, id) {
  omop_obj$concept |>
    filter(concept_id == id) |>
    select(concept_name) |>
    collect()
}

R

get_concept_id <- function(omop_obj, name) {
  omop_obj$concept |>
    filter(concept_name == name) |>
    select(concept_id) |>
    collect()
}

Dates and times in OMOP

Dates and times are used in most tables to record when events happened, usually with the start and end date recorded. In many places there is also a time component, for example to record the time of day when a measurement was taken. Column names are frequently either suffixed with _date or _datetime to indicate whether they contain just a date or both date and time information.

Parquet has data types for dates and date-times, but if the data was written to CSV or as a string then we can convert these strings to date or datetime objects using the as.Date() or as.POSIXct() functions, respectively. This will allow us to perform date and time calculations and visualizations more easily.

Usually the dates come in pairs, for example condition_start_date and condition_end_date in the condition_occurrence table. This allows us to determine the duration of a condition, for example, by calculating the difference between the start and end dates.

Challenge

Challenge

Using the condition_occurrence table, display the duration in days for each condition and find the average duration of conditions in this dataset.

R

# First we need to read in the condition_occurrence table with only the columns we need to worry about and collect the data into memory
condition_occurrence <- omop$condition_occurrence |>
  select(condition_occurrence_id, condition_start_date, condition_end_date) |>
    collect()

# Exclude any rows where the end date is missing, as we can't calculate the duration for these
condition_occurrence <- condition_occurrence |>
    filter(!is.na(condition_end_date))

# We can calculate the length of each condition in days by taking the difference between the end date and the start date. We add 1 to include both the start and end date in the duration calculation.
condition_occurrence <- condition_occurrence |>
  mutate(
    length_days_inclusive = as.integer(condition_end_date - condition_start_date) + 1L
  )

# Finally we can calculate the average duration of conditions in this dataset, excluding any missing values
average_duration <- mean(condition_occurrence$length_days_inclusive, na.rm = TRUE)

average_duration

OUTPUT

[1] 3.178571

Answer: The average duration of conditions in this dataset is approximately 3.2 days.

CODING NOTE: In the code above, we first read in the condition_occurrence table and selected only the relevant columns. We then filtered out any rows where the condition_end_date was missing, as we cannot calculate the duration for these conditions. Next, we calculated the length of each condition in days by taking the difference between the end date and the start date and adding 1 to include both the start and end date in the duration calculation. Finally, we calculated the average duration of conditions in this dataset by taking the mean of the length_days_inclusive column, excluding any missing values.

It is not uncommon for the end date to be missing in the data, for example if a condition is ongoing at the time of data extraction. In this case, we can only calculate the duration for conditions that have an end date recorded.

Also consider the measurement table where we have both a measurement_date and a measurement_datetime column. The measurement_date column contains just the date of the measurement, while the measurement_datetime column contains both the date and time of the measurement. Depending on the analysis we want to do, we may choose to use one or the other of these columns. Note generally a measurement doesn’t have a start and end date, but just a single date or datetime when the measurement was taken.

Challenge

Challenge

Using the measurement table, graph the Body temperature of patient 113 over time. You can use either the measurement_date or measurement_datetime column for the x-axis.

R

library(ggplot2)
# First we need to know the concept id for body temperature
get_concept_id(omop, "Body temperature")

OUTPUT

# A tibble: 1 × 1
  concept_id
       <int>
1    3020891

R

# Then we need to read in the measurement table, filter for the measurements of interest and collect the data into memory
measurement <- omop$measurement |>
    filter(person_id == 1113 & measurement_concept_id == 3020891) |>
    collect() 

ggplot(measurement, aes(x = measurement_datetime, y = value_as_number)) +
  geom_point() +
  geom_line() +
  labs(
    title = "Body temperature of patient 113 over time",
    x = "Date and time of measurement",
    y = "Body temperature (°C)"
  )

CODING NOTE: In the code above, we first used the get_concept_id() function to find the concept ID for “Body temperature”. We then read in the measurement table and filtered for measurements of patient 113 with the relevant concept ID. We collected this data into memory. Finally, we created a line plot of body temperature over time using ggplot2, we did not need to order the data by datetime as ggplot does this for us when we set the datetime column as the value for the x-axis.

Converting data types

We are working with a dataset from parquet files which contains information about the data type of each column. If you are working with a csv file as an input, then there are no guarantees that a dataset will have the expected data types. We can simulate what this may look like by creating a new measurement table that has string types instead of the expected types.

R

measurement_to_fix <- omop$measurement |>
  mutate(
    across(c(measurement_datetime, measurement_date, value_as_number), ~ as.character((.x)))
    ) |>
  collect()

measurement_to_fix

OUTPUT

# A tibble: 353 × 12
   measurement_id person_id measurement_concept_id measurement_date
            <int>     <int>                  <int> <chr>
 1           8001      1111                3020891 2025-07-22
 2           8002      1111                3004249 2025-07-22
 3           8003      1111                3012888 2025-07-22
 4           8004      1111                3027018 2025-07-22
 5           8005      1112                3013502 2024-01-15
 6           8006      1112                3027018 2024-01-15
 7           8007      1112                3020891 2024-01-15
 8           8009      1112                3027315 2024-01-16
 9           8010      1112                3013502 2024-01-18
10           8011      1112                3013502 2025-12-24
# ℹ 343 more rows
# ℹ 8 more variables: measurement_datetime <chr>, operator_concept_id <int>,
#   value_as_number <chr>, value_as_concept_id <int>, unit_concept_id <int>,
#   range_low <dbl>, range_high <dbl>, visit_occurrence_id <int>

We’ve used the across() function from the dplyr package to apply the as.character() function to three columns.

Challenge

Challenge

Using the measurement_to_fix table.

  1. Calculate the number of days between the first and last Body temperature measurement for each patient, sorting the final output by patient_id. Which patient has Body temperature measurements spanning mulitple days?
  2. Create a scatter plot of all Systolic blood pressure and Diastolic blood pressure measurement values that were taken between the 1st of January 2025 and 1st of October 2025, using measurement_datetime for the x-axis. Colour the points by the patient_id.
  1. Calculate the number of days between the first and last Body temperature measurement for each patient, sorting the final output by patient_id. Which patient has Body temperature measurements spanning mulitple days?

R

# Convert the date column to a date
measurement_with_date <- measurement_to_fix |>
    mutate(measurement_date = as.Date(measurement_date, format = "%Y-%m-%d"))

# Calculate the number of days between first and last body temperature measurement
measurement_with_date |>
  filter(measurement_concept_id == 3020891) |>
  summarise(
    first_measurement_date = min(measurement_date),
    last_measurement_date = max(measurement_date),
    days_between = as.integer(last_measurement_date - first_measurement_date),
    .by = person_id
    ) |>
  arrange(person_id)

OUTPUT

# A tibble: 5 × 4
  person_id first_measurement_date last_measurement_date days_between
      <int> <date>                 <date>                       <int>
1      1111 2025-07-22             2025-07-22                       0
2      1112 2024-01-15             2024-01-15                       0
3      1113 2025-03-15             2025-03-19                       4
4     34567 2025-12-28             2025-12-28                       0
5     78901 2004-09-15             2004-09-15                       0

Answer: Only patient 113 has Body Temperature measurements over multiple days.

CODING NOTE: In the code above, we converted the measurement_date from character to date instances using the as.Date() function. We then filtered the data to the concept for Body Temperature, and used summarise() grouped by person_id to calculate the first and last measurement date, and their difference in numbers of days.

  1. Create a scatter plot of all Systolic blood pressure and Diastolic blood pressure measurement values that were taken between the 1st of January 2025 and 1st of October 2025, using measurement_datetime for the x-axis. Colour the plot by the patient_id.

R

# Correct the data types of remaining columns
measurement_fixed <- measurement_with_date |>
  mutate(
    measurement_datetime = as.POSIXct(measurement_datetime, tz = "UTC"),
    value_as_number = as.double(value_as_number)
  )

# Get concepts for blood pressure
systolic_concept <- get_concept_id(omop, "Systolic blood pressure") |>
  pull()

diastolic_concept <- get_concept_id(omop, "Diastolic blood pressure") |>
  pull()

# Filter measurements by concepts and time range
measurement_fixed |>
  filter(
    measurement_concept_id %in% c(systolic_concept, diastolic_concept),
    measurement_datetime >= "2025-01-01",
    measurement_datetime < "2025-10-01",
  ) |>
  ggplot(aes(x = measurement_datetime, y = value_as_number, colour = as.factor(person_id))) +
  geom_point() +
  labs(
    title = "Blood pressure measurements over time",
    x = "Measurement taken (datetime)",
    y = "Blood pressure",
    colour = "person_id"
  )

CODING NOTE: In the code above, we converted the measurement_datetime from character to datetime instances using the as.POSIXct() function, we would need to confirm which timezone the data is in if we haven’t already been told. We also converted the value_as_number column to a double (numeric allowing for decimal places) data type using as.double(). We then found the concepts required, filtered to these and the date range and used these to plot the data using ggplot(). We used the colour option in aes() to define that the colour of the points should be, converting it to a factor using as.factor() so that the continuous variable is diplayed as categories.

Key Points
  • Dates and times are used in most tables in the OMOP CDM to record when events happened.

  • They are often used in pairs to record the start and end of an event, which allows us to calculate the duration of that event.