Variable reference
This section describes each available function for creating variables within a study definition.
For more information on the datasets contained within the OpenSAFELY database, see the Data sources section.
For more information on writing a study definition, go to the study definition section.
Primary Care Record🔗
These variables are derived from data held in the patients' primary care records.
registered_as_of(reference_date, return_expectations=None)
🔗
All patients registered on the given date. Note this function passes arguments to registered_with_one_practice_between()
Parameters:
Name | Description | Default |
---|---|---|
reference_date |
date of interest as a string with the format |
required |
return_expectations |
a dictionary containing an expectation definition defining an |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
This creates a variable "registered" with patient returning an integer of 1
if patient registered
at date. Patients who are not registered return an integer of 0
:
registered=patients.registered_as_of(
"2020-03-01",
return_expectations={"incidence": 0.98}
)
registered_with_one_practice_between(start_date, end_date, return_expectations=None)
🔗
All patients registered with the same practice through the given period
Parameters:
Name | Description | Default |
---|---|---|
start_date |
start date of interest of period as a string with the format |
required |
end_date |
end date of interest of period as a string with the format |
required |
return_expectations |
a dictionary containing an expectation definition defining an |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
This creates a variable registered_one
with patient returning an integer of 1
if patient registered
at one practice between two dates. Patients who are not registered return an integer of 0
.
registered_one=patients.registered_with_one_practice_between(
start_date="2020-03-01",
end_date="2020-06-01",
return_expectations={"incidence": 0.90}
)
date_deregistered_from_all_supported_practices(on_or_before=None, on_or_after=None, between=None, date_format=None, return_expectations=None)
🔗
Returns the date (if any) on which the patient de-registered from all practices for which OpenSAFELY has data. Events which occur in primary care after this date will not be recorded in the platform (though there may be data from other sources e.g. SGSS, CPNS).
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of strings with a date format returned if patient had deregistered, otherwise empty |
Examples:
A variable called dereg_date
is created with returns a date of de-registration if patient has
deregistered from a practice within the dataset within the specified time period.
dereg_date=patients.date_deregistered_from_all_supported_practices(
on_or_after="2020-03-01",
date_format="YYYY-MM",
return_expectations={
{"date": {"earliest": "2020-03-01"},
"incidence": 0.05
}
)
with_complete_history_between(start_date, end_date, return_expectations=None)
🔗
All patients for which we have a full set of records between the given dates
Parameters:
Name | Description | Default |
---|---|---|
start_date |
start date of interest of period as a string with the format |
required |
end_date |
end date of interest of period as a string with the format |
required |
return_expectations |
a dictionary containing an expectation definition defining an |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
This creates a variable has_consultation_history
with patient returning an integer of 1
if
patient registered at one practice between two dates and has a completed record. Patients who are
not registered with a complete record return an integer of 0
.
has_consultation_history=patients.with_complete_gp_consultation_history_between(
start_date="2019-02-01",
end_date="2020-01-31",
return_expectations={"incidence": 0.9},
)
with_complete_gp_consultation_history_between(start_date, end_date, return_expectations=None)
🔗
The concept of a "consultation" in EHR systems does not map exactly to the GP-patient interaction we're interested in (see above) so there is some processing required on the part of the EHR vendor to produce the consultation record we need. This does not happen automatically as part of the GP2GP transfer, and therefore this query can be used to find just those patients for which the full history is available. This means finding patients who have been continuously registered with a single TPP-using practice throughout a time period.
Parameters:
Name | Description | Default |
---|---|---|
start_date |
start date of interest as a string with the format |
required |
end_date |
end date of interest as a string with the format |
required |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
registered_practice_as_of(date, returning=None, return_expectations=None)
🔗
Return patients' practice address characteristics such as STP or MSOA
Parameters:
Name | Description | Default |
---|---|---|
date |
date of interest as a string with the format |
required |
returning |
string indicating value to be returned. Options are:
|
None |
return_expectations |
a dict defining the |
None |
Returns:
Type | Description |
---|---|
list |
of strings |
Exceptions:
Type | Description |
---|---|
ValueError |
if unsupported |
Examples:
This creates a variable called region
based on practice address of the patient:
region=patients.registered_practice_as_of(
"2020-02-01",
returning="nuts1_region_name",
return_expectations={
"rate": "universal",
"category": {
"ratios": {
"North East": 0.1,
"North West": 0.1,
"Yorkshire and the Humber": 0.1,
"East Midlands": 0.1,
"West Midlands": 0.1,
"East of England": 0.1,
"London": 0.2,
"South East": 0.2,
},
},
},
)
address_as_of(date, returning=None, round_to_nearest=None, return_expectations=None)
🔗
Return patients' address characteristics such as IMD as of a particular date
Parameters:
Name | Description | Default |
---|---|---|
date |
date of interest as a string with the format |
required |
returning |
string indicating value to be returned. Options are:
|
None |
round_to_nearest |
an integer that represents how |
None |
return_expectations |
a dict defining the |
None |
Returns:
Type | Description |
---|---|
list |
of integers for |
Exceptions:
Type | Description |
---|---|
ValueError |
if unsupported |
Examples:
This creates a variable called imd
based on patient address.
imd=patients.address_as_of(
"2020-02-29",
returning="index_of_multiple_deprivation",
round_to_nearest=100,
return_expectations={
"rate": "universal",
"category": {"ratios": {"100": 0.1, "200": 0.2, "300": 0.7}},
},
)
with_gp_consultations(on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', date_format=None, return_expectations=None)
🔗
These are GP-patient interactions, either in person or via phone/video call. The concept of a "consultation" in EHR systems is generally broader and might include things like updating a phone number with the receptionist.
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
find_first_match_in_period |
a boolean that indicates if the data returned is first event if there are multiple matches within the time period |
None |
find_last_match_in_period |
a boolean that indicates if the data returned is last event if there are multiple matches within the time period |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
A variable called gp_count
is created that counts number of GP consultation between two dates in
2019.
gp_count=patients.with_gp_consultations(
between=["2019-01-01", "2020-12-31"],
returning="number_of_matches_in_period",
return_expectations={
"int": {"distribution": "normal", "mean": 6, "stddev": 3},
"incidence": 0.6,
},
)
sex(return_expectations=None)
🔗
Returns the sex of the patient.
Parameters:
Name | Description | Default |
---|---|---|
return_expectations |
a dict containing an expectation definition defining a rate and a ratio for sexes |
None |
Returns:
Type | Description |
---|---|
list |
|
Examples:
This creates a variable 'sex' with all patients returning a sex of either "M", "F" or ""
sex=patients.sex(
return_expectations={
"rate": "universal",
"category": {"ratios": {"M": 0.49, "F": 0.51}},
}
)
age_as_of(reference_date, return_expectations=None)
🔗
Returns age of patient of at a particular date. Note can be negative if born after reference_date
.
Parameters:
Name | Description | Default |
---|---|---|
reference_date |
date of interest as a string with the format |
required |
return_expectations |
a dict defining an expectation definition that includes at least a rate
and a distribution. If |
None |
Returns:
Type | Description |
---|---|
list |
ages as integers |
Examples:
This creates a variable "age" with all patient returning an age as an integer:
age=patients.age_as_of(
"2020-02-01",
return_expectations={
"rate" : "universal",
"int" : {"distribution" : "population_ages"}
}
)
date_of_birth(date_format=None, return_expectations=None)
🔗
Returns date of birth as a string with format "YYYY-MM"
Parameters:
Name | Description | Default |
---|---|---|
date_format |
a string detailing the format of the dates for date of birth to be returned. It can be "YYYY-MM" or "YYYY" and wherever possible the least disclosive data should be returned. i.e returning only year is less disclosive than a date with month and year. |
None |
return_expectations |
a dictionary containing an expectation definition defining a rate and a distribution |
None |
Returns:
Type | Description |
---|---|
list |
dates as strings with "YYYY-MM" format |
Exceptions:
Type | Description |
---|---|
ValueError |
if Date of Birth is attempted to be returned with a |
Examples:
This creates a variable dob
with all patient returning a year and month as a string:
dob=patients.date_of_birth(
"YYYY-MM",
return_expectations={
"date": {"earliest": "1950-01-01", "latest": "today"},
"rate": "uniform",
}
)
most_recent_bmi(on_or_before=None, on_or_after=None, between=None, minimum_age_at_measurement=16, return_expectations=None, include_measurement_date=False, date_format=None, include_month=False, include_day=False)
🔗
Return patients' most recent BMI (in the defined period) either computed from weight and height measurements or, where they are not availble, from recorded BMI values. Measurements taken when a patient was below the minimum age are ignored. The height measurement can be taken before (but not after) the defined period as long as the patient was over the minimum age at the time.
The date of the measurement can be obtained using date_of("<bmi-column-name>")
.
If the BMI is computed from weight and height then we use the date of the
weight measurement for this.
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
minimum_age_at_measurement |
Measurements taken before this age will not count towards BMI calculations. It is an integer and the default value is 16. |
16 |
return_expectations |
a dictionary defining the incidence and distribution of expected BMI
within the population in question. This is a 3-item key-value dictionary of "date" and "float".
"date" is dictionary itself and should contain the |
None |
include_measurement_date |
a boolean indicating if an extra column, named |
False |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
Returns:
Type | Description |
---|---|
float |
most recent BMI |
Examples:
This creates a variable "bmi" returning a float of the most recent bmi calculated from recorded height and weight, or from a recorded bmi record. Patient who do not have this information available do not return a value:
bmi=patients.most_recent_bmi(
between=["2010-02-01", "2020-01-31"],
minimum_age_at_measurement=18,
include_measurement_date=True,
date_format="YYYY-MM",
return_expectations={
"date": {"earliest": "2010-02-01", "latest": "2020-01-31"},
"float": {"distribution": "normal", "mean": 28, "stddev": 8},
"incidence": 0.80,
}
)
mean_recorded_value(codelist, on_most_recent_day_of_measurement=None, return_expectations=None, on_or_before=None, on_or_after=None, between=None, include_measurement_date=False, date_format=None, include_month=False, include_day=False)
🔗
Return patients' mean recorded value of a numerical value as defined by a codelist on a particular day within the defined period. This is important as allows us to account for multiple measurements taken on one day.
The date of the measurement can be included by flagging with date format options.
Parameters:
Name | Description | Default |
---|---|---|
codelist |
a codelist for requested value |
required |
on_most_recent_day_of_measurement |
boolean flag for requesting measurements be on most recent date |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value
within the population in question. This is a 3-item key-value dictionary of "date" and "float".
"date" is dictionary itself and should contain the |
None |
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
include_measurement_date |
a boolean indicating if an extra column, named |
False |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
Returns:
Type | Description |
---|---|
float |
mean of value |
Examples:
This creates a variable bp_sys
returning a float of the most recent systolic blood pressure from
the record within the time period. In the event of repeated measurements on the same day, these
are averaged. Patient who do not have this information
available do not return a value:
bp_sys=patients.mean_recorded_value(
systolic_blood_pressure_codes,
on_most_recent_day_of_measurement=True,
between=["2017-02-01", "2020-01-31"],
include_measurement_date=True,
date_format="YYYY-MM",
return_expectations={
"float": {"distribution": "normal", "mean": 80, "stddev": 10},
"date": {"earliest": "2019-02-01", "latest": "2020-01-31"},
"incidence": 0.95,
},
)
with_these_medications(codelist, return_expectations=None, on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', include_date_of_match=False, date_format=None, ignore_days_where_these_clinical_codes_occur=None, episode_defined_as=None, return_binary_flag=None, return_number_of_matches_in_period=False, return_first_date_in_period=False, return_last_date_in_period=False, include_month=False, include_day=False)
🔗
Patients who have been prescribed at least one of this list of medications in the defined period
Parameters:
Name | Description | Default |
---|---|---|
codelist |
a codelist for requested medication(s) |
required |
return_expectations |
a dictionary defining the incidence and distribution of expected value
within the population in question. If returning an integer (returning number_of_matches_in_period,
number_of_episodes), this is a 2-item key-value dictionary of |
None |
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
find_first_match_in_period |
a boolean indicating if any returned date, code, category, or numeric value should be based on the first match in the period. |
None |
find_last_match_in_period |
a boolean indicating if any returned date, code, category, or numeric value should be based on the last match in the period. |
None |
include_date_of_match |
a boolean indicating if an extra column should be included in the output.
The default value is |
False |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
ignore_days_where_these_clinical_codes_occur |
a codelist that contains codes for medications to be
ignored. if a medication is found on this day, the date is not matched even it matches a
code in the main |
None |
episode_defined_as |
a string expression indicating how an episode should be defined |
None |
return_binary_flag |
a bool indicatin if a binary flag should be returned (deprecated: use |
None |
return_number_of_matches_in_period |
a boolean indicating if the number of matches in a period should be
returned (deprecated: use |
False |
return_first_date_in_period |
a boolean indicating if the first matches in a period should be
returned (deprecated: use |
False |
return_last_date_in_period |
a boolean indicating if the last matches in a period should be
returned (deprecated: use |
False |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
This creates a variable exacerbation_count
returning an int of the number of episodes of oral
steroids being prescribed within the time period where a prescription is counted as part of the same
episode if it falls within 28 days of a previous prescription. Days where oral steroids
are prescribed on the same day as a COPD review are also ignored as may not represent true exacerbations.
exacerbation_count=patients.with_these_medications(
oral_steroid_med_codes,
between=["2019-03-01", "2020-02-29"],
ignore_days_where_these_clinical_codes_occur=copd_reviews,
returning="number_of_episodes",
episode_defined_as="series of events each <= 28 days apart",
return_expectations={
"int": {"distribution": "normal", "mean": 2, "stddev": 1},
"incidence": 0.2,
},
)
with_these_clinical_events(codelist, return_expectations=None, on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', include_date_of_match=False, date_format=None, ignore_missing_values=False, ignore_days_where_these_codes_occur=None, episode_defined_as=None, return_binary_flag=None, return_number_of_matches_in_period=False, return_first_date_in_period=False, return_last_date_in_period=False, include_month=False, include_day=False)
🔗
Patients who have had at least one of these clinical events in the defined period
Parameters:
Name | Description | Default |
---|---|---|
codelist |
a codelist for requested event(s) |
required |
return_expectations |
a dictionary defining the incidence and distribution of expected value
within the population in question. If returning an integer ( |
None |
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
find_first_match_in_period |
a boolean indicating if any returned date, code, category, or numeric value should be based on the first match in the period. If several matches compare equal, then their IDs are used to break the tie. |
None |
find_last_match_in_period |
a boolean indicating if any returned date, code, category, or numeric value should be based on the last match in the period. This is the default behaviour. If several matches compare equal, then their IDs are used to break the tie. |
None |
include_date_of_match |
a boolean indicating if an extra column should be included in the output.
The default value is |
False |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
ignore_days_where_these_codes_occur |
a codelist that contains codes for events to be
ignored. if a events is found on this day, the date is not matched even it matches a
code in the main |
None |
episode_defined_as |
a string expression indicating how an episode should be defined |
None |
ignore_missing_values |
ignore events where the value is missing or zero. We are unable to distinguish between zeros and null values due to limitations in how the data is recorded in TPP. |
False |
return_binary_flag |
a boolean indicating if the number of matches in a period should be
returned (deprecated: use |
None |
return_number_of_matches_in_period |
a boolean indicating if the number of matches in a period should be
returned (deprecated: use |
False |
return_first_date_in_period |
a boolean indicating if the first matches in a period should be
returned (deprecated: use |
False |
return_last_date_in_period |
a boolean indicating if the last matches in a period should be
returned (deprecated: use |
False |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
This creates a variable haem_cancer
returning the first date of a diagnosis of haematology
malignancy within the time period.
haem_cancer=patients.with_these_clinical_events(
haem_cancer_codes,
between=["2015-03-01", "2020-02-29"],
returning="date",
find_first_match_in_period=True,
return_expectations={"date": {earliest; "2015-03-01", "latest": "2020-02-29"}},
)
with_death_recorded_in_primary_care(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, return_expectations=None)
🔗
Identify patients with a date-of-death in their primary care record.
There is generally a lag between the death being recorded in ONS data and appearing in the primary care record, but the date itself is usually reliable when it appears. By contrast, cause of death is often not accurate in the primary care record so we don't make it available to query here.
Args:
!!! on_or_before "date of interest as a string with the format `YYYY-MM-DD`. Filters results"
on or before the given date. The default value is `None`.
!!! on_or_after "date of interest as a string with the format `YYYY-MM-DD`. Filters results"
on or after the given date. The default value is `None`.
!!! between "two dates of interest as a list with each date as a string with the format `YYYY-MM-DD`."
Filters results to between the two dates provided. The default value is `None`.
returning: string indicating value to be returned. Options are:
* `date_of_death`: Date of death
* `binary_flag`: If they died or not
!!! date_format "a string detailing the format of the dates to be returned. It can be `YYYY-MM-DD`,"
`YYYY-MM` or `YYYY` and wherever possible the least disclosive data should be returned. i.e returning
only year is less disclosive than a date with day, month and year.
!!! return_expectations "a dictionary defining the incidence and distribution of expected value"
within the population in question.
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
A variable called died_date_gp
is created that returns the date of death for
any patients have died in the GP dataset.
died_date_gp=patients.with_death_recorded_in_primary_care(
on_or_after="2020-02-01",
returning="date_of_death",
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
),
care_home_status_as_of(date, categorised_as=None, return_expectations=None)
🔗
TPP have attempted to match patient addresses to care homes as stored in the CQC database. At its most simple this query returns a boolean indicating whether the patient's address (as of the supplied time) matched with a care home.
It is also possible return a more complex categorisation based on attributes of the care homes in the CQC database, which can be freely downloaded here: https://www.cqc.org.uk/about-us/transparency/using-cqc-data
At present the only imported fields are: LocationRequiresNursing LocationDoesNotRequireNursing
But we can ask for more fields to be imported if needed.
The categorised_as
argument acts in effectively the same way as for the
categorised_as
function except that the only columns that can be referred
to are those belonging to the care home table (i.e. the two nursing fields
above) and the boolean IsPotentialCareHome
Parameters:
Name | Description | Default |
---|---|---|
date |
date of interest as a string with the format |
required |
categorised_as |
a logic expression that applies an algorithm to specific variables to create categories |
None |
return_expectations |
a dict defining the |
None |
Returns:
Type | Description |
---|---|
list |
of strings which each letter representing a category as defined by the algorithm |
Examples:
This creates a variable called care_home_type
which contains a 2
letter string which represents a type of care home environment. If the
address is not valid, it defaults to an empty string.
care_home_type=patients.care_home_status_as_of(
"2020-02-01",
categorised_as={
"PC":
"""
IsPotentialCareHome
AND LocationDoesNotRequireNursing='Y'
AND LocationRequiresNursing='N'
""",
"PN":
"""
IsPotentialCareHome
AND LocationDoesNotRequireNursing='N'
AND LocationRequiresNursing='Y'
""",
"PS": "IsPotentialCareHome",
"PR": "NOT IsPotentialCareHome",
"": "DEFAULT",
},
return_expectations={
"rate": "universal",
"category": {"ratios": {"PC": 0.05, "PN": 0.05, "PS": 0.05, "PR": 0.84, "": 0.01},},
},
),
with_tpp_vaccination_record(target_disease_matches=None, product_name_matches=None, on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, find_first_match_in_period=None, find_last_match_in_period=None, return_expectations=None)
🔗
Identify patients with a vaccination record for a target disease within the TPP vaccination record
Vaccinations can be recorded via a Vaccination Record or via prescription of a vaccine i.e a product code.
Parameters:
Name | Description | Default |
---|---|---|
target_disease_matches |
the target disease as a string |
None |
product_name_matches |
the product name as a string |
None |
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
find_first_match_in_period |
a boolean that indicates if the data returned is first indication of vaccination if there are multiple matches within the time period |
None |
find_last_match_in_period |
a boolean that indicates if the data returned is last indication of vaccination if there are multiple matches within the time period |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
A variable called flu_vaccine
is created that returns the date of vaccination for
any patients in the GP dataset between 2 dates.
flu_vaccine=patients.with_tpp_vaccination_record(
target_disease_matches="influenza",
between=["2019-09-01", "2020-04-01"],
returning="date",
date_format="YYYY-MM",
find_first_match_in_period=True,
return_expectations={
date": {"earliest": "2019-09-01", "latest": "2020-03-29"}
}
),
household_as_of(reference_date, returning=None, return_expectations=None)
🔗
Return information about the household to which the patient belonged as of the reference date. This is inferred from address data using an algorithm developed by TPP (to be documented soon) so the results are not 100% reliable but are apparently pretty good.
Parameters:
Name | Description | Default |
---|---|---|
reference_date |
date of interest as a string with the format |
required |
returning |
string indicating value to be returned. Options are:
|
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers if Examples:
|
with_these_decision_support_values(algorithm, on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='numeric_value', include_date_of_match=False, date_format=None, ignore_missing_values=False, return_expectations=None)
🔗
Returns values computed by the given decision support algorithm.
Parameters:
Name | Description | Default |
---|---|---|
algorithm |
a string indicating the decision support algorithm. Currently, the only option is |
required |
on_or_before |
the date of interest as a string with the format |
None |
on_or_after |
the date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
find_first_match_in_period |
a boolean indicating if values should be based on the first match in the period. |
None |
find_last_match_in_period |
a boolean indicating if values should be based on the last match in the period. This is the default behaviour. |
None |
returning |
a string indicating the values to return. The options are:
* |
'numeric_value' |
include_date_of_match |
a boolean indicating if an extra column containing the date of the match should be returned. The default value is |
False |
date_format |
a string indicating the format of any dates included in the values. It can be |
None |
ignore_missing_values |
a boolean indicating if matches where the value is missing or zero should be ignored. We are unable to distinguish between null values (missing) and zeros due to limitations in how the data are recorded by TPP. The default value is |
False |
return_expectations |
as described elsewhere. |
None |
ICNARC🔗
These variables are derived from the Intensive Care National Audit and Research Centre Case-Mix Programme (ICNARC-CMP), which collects information on ICU admissions across England. For more information, see the ICNARC data section.
admitted_to_icu(on_or_after=None, on_or_before=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, returning='binary_flag', date_format=None, return_expectations=None, include_month=False, include_day=False)
🔗
Return information about being admitted to ICU.
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
find_first_match_in_period |
a boolean that indicates if the data returned is first admission to icu if there are multiple admissions within the time period |
None |
find_last_match_in_period |
a boolean that indicates if the data returned is last admission to icu if there are multiple admissions within the time period |
None |
returning |
string indicating value to be returned. Options are:
(Note that the terms "basic" and "advanced" are derived from the underlying ICNARC data.) |
'binary_flag' |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value
within the population in question. This is a 2-item key-value dictionary of |
None |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
This returns two variables — one called icu_date_admitted
and another had_resp_support
:
has_resp_support=patients.admitted_to_icu(
on_or_after="2020-02-01",
find_first_match_in_period=True,
returning="had_respiratory_support",
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
),
icu_date_admitted=patients.admitted_to_icu(
on_or_after="2020-02-01",
find_first_match_in_period=True,
returning="date_admitted",
date_format="YYYY-MM-DD",
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
),
SGSS🔗
These variables are derived from Second Generation Surveillance System (SGSS) data which captures routine laboratory surveillance data on infectious diseases across England. For more information, see the SGSS data section.
with_test_result_in_sgss(pathogen=None, test_result='any', on_or_before=None, on_or_after=None, between=None, find_first_match_in_period=None, find_last_match_in_period=None, restrict_to_earliest_specimen_date=True, returning='binary_flag', date_format=None, return_expectations=None)
🔗
Finds COVID lab test results recorded in SGSS (Second Generation Surveillance System).
Please note that all dates used here are "specimen dates" (i.e. the date the specimen was taken), rather than the date the lab result was obtained.
It's important to note that data is supplied in two separate datasets: an "Earliest Specimen" dataset and an "All Tests" dataset.
Earliest Specimen Dataset🔗
Where a patient has multiple positive tests, SGSS groups these into "episodes" (referred to as "Organism-Patient-Illness-Episodes"). Each pathogen has a maximum episode duration (usually 2 weeks) and unless positive tests are separated by longer than this period they are assumed to be the same episode of illness. The specimen date recorded is the earliest positive specimen within the episode.
For SARS-CoV-2 the episode length has been set to infinity, meaning that once a patient has tested positive every positive test will be part of the same episode and record the same specimen date.
This means that using find_last_match_in_period
is pointless when
querying for positive results as only one date will ever be recorded and it
will be the earliest.
Our original assumption, though the documentation didn't state either way, is that every negative result would be treated as unique. However this does not appear to be the case as though some patients do have multiple negative tests in this dataset, the number is far too small to be realistic.
Information about the SARS-CoV-2 episode length was via email from someone at the National Infection Service:
The COVID-19 episode length in SGSS was set to indefinite, so all
COVID-19 records from a single patient will be classified as one
episode. This may change, but is set as it is due to limited
information around re-infection and virus clearance.
All Tests Dataset🔗
This dataset is not subject to the same restriction as above and we expect each individual test result (postive or negative) to appear in this regardless of whether they are considered as within the same infection episode. In an ideal world we could use just this dataset, but there are some fields we need (e.g. SGTF) which are only supplied on the "earliest specimen" dataset.
S-Gene Target Failure🔗
Using the returning="s_gene_target_failure"
option provides additional
output from PCR tests results which can be used as a proxy for the presence
of certain Variants of Concern.
Possible values are "", "0", "1", "9"
Definitions (from email from PHE)
1: Isolate with confirmed SGTF
Undetectable S gene; CT value (CH3) =0
Detectable ORF1ab gene; CT value (CH2) <=30 and >0
Detectable N gene; CT value (CH1) <=30 and >0
0: S gene detected
Detectable S gene (CH3>0)
Detectable y ORF1ab CT value (CH1) <=30 and >0
Detectable N gene CT value (CH2) <=30 and >0
9: Cannot be classified
Null are where the target is not S Gene. I think LFTs are currently
also coming across as 9 so will need to review those to null as well as
clearly this is a PCR only variable.
Case Category (type of test used)🔗
Using the returning="case_category"
option (only available on positive,
earliest specimen date results) reports whether the test was a Lateral Flow
or PCR test. Possible values are:
"LFT_Only", "PCR_Only", "LFT_WithPCR"
For more detail on SGSS in general see PHE_Laboratory_Reporting_Guidelines.pdf
Parameters:
Name | Description | Default |
---|---|---|
pathogen |
pathogen we are interested in. Only SARS-CoV-2 results are included in our data extract so this will throw an error if the specified pathogen is anything other than "SARS-CoV-2". |
None |
test_result |
must be one of "positive", "negative" or "any" |
'any' |
on_or_before |
date of interest as a string with the format
|
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string
with the format |
None |
find_first_match_in_period |
a boolean that indicates if the data returned is first event if there are multiple matches within the time period |
None |
find_last_match_in_period |
a boolean that indicates if the data returned is last event if there are multiple matches within the time period |
None |
restrict_to_earliest_specimen_date |
a boolean indicating whether to use the "earliest specimen" or "all tests" dataset (see above). True by default, meaning that the "earliest specimen" dataset is used. |
True |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
date_format |
a string detailing the format of the dates to be returned.
It can be |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
Two variables are created. One called first_tested_for_covid
is the
first date that a patient has a covid test never mind the result. The
second called first_positive_test_date
is the first date that a
patient has a positive test result.
first_tested_for_covid=patients.with_test_result_in_sgss(
pathogen="SARS-CoV-2",
test_result="any",
on_or_after="2020-02-01",
find_first_match_in_period=True,
returning="date",
date_format="YYYY-MM-DD",
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
),
first_positive_test_date=patients.with_test_result_in_sgss(
pathogen="SARS-CoV-2",
test_result="positive",
on_or_after="2020-02-01",
find_first_match_in_period=True,
returning="date",
date_format="YYYY-MM-DD",
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
),
CPNS🔗
These variables are derived from the COVID-19 Patient Notification System (CPNS), which collects info on all in-hospital covid-related deaths. For more information, see the CPNS data section.
with_death_recorded_in_cpns(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, include_month=False, include_day=False, return_expectations=None)
🔗
Identify patients who with death registered in CPNS dataset
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
A variable called died_date_cpns
is created that returns the date of death for
any patients have died in the CPNS dataset.
died_date_cpns=patients.with_death_recorded_in_cpns(
on_or_after="2020-02-01",
returning="date_of_death",
include_month=True,
include_day=True,
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
),
ONS deaths🔗
These variables are derived from the Death Registry data provided by the Office for National Statistics. For more information, see the ONS deaths section.
died_from_any_cause(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, include_month=False, include_day=False, return_expectations=None)
🔗
Identify patients who with ONS-registered deaths
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
A variable called died_any
is created that returns the date of death for
any patients that have died in the time period.
died_any=patients.died_from_any_cause(
on_or_after="2020-02-01",
returning="date_of_death",
date_format="YYYY-MM-DD",
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
)
with_these_codes_on_death_certificate(codelist, on_or_before=None, on_or_after=None, between=None, match_only_underlying_cause=False, returning='binary_flag', date_format=None, include_month=False, include_day=False, return_expectations=None)
🔗
Identify patients with ONS-registered death, where cause of death matches the supplied icd10 codelist
Parameters:
Name | Description | Default |
---|---|---|
codelist |
a codelist for requested value |
required |
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
match_only_underlying_cause |
boolean for indicating if filters results to only specified cause of death. |
False |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
include_month |
a boolean indicating if day should be included in addition to year (deprecated: use
|
False |
include_day |
a boolean indicating if day should be included in addition to year and
month (deprecated: use |
False |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
A variable called died_ons_covid_flag_any
is created that returns the date of death for
any patients that have covid on their death certificate even if that is the not the underlying cause
of death.
died_ons_covid_flag_any=patients.with_these_codes_on_death_certificate(
covid_codelist,
on_or_after="2020-02-01",
match_only_underlying_cause=False,
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
)
High Cost Drugs🔗
(Documentation on the source of this data will be forthcoming later.)
with_high_cost_drugs(drug_name_matches=None, on_or_before=None, on_or_after=None, between=None, returning='binary_flag', date_format=None, find_first_match_in_period=None, find_last_match_in_period=None, return_expectations=None)
🔗
Returns data from the High Cost Drugs Dataset
More details available in this notebook: https://github.com/opensafely/highcostdrugs-research/blob/master/analysis/High%20Cost%20Drugs%20Dataset.ipynb
Parameters:
Name | Description | Default |
---|---|---|
drug_name_matches |
a drug name as a string, or a list of such names, or a codelist containing such names. Results will be filtered to just rows matching any of the supplied names exactly. Note these are not standardised names, they are just the names however they come to us in the original data. |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
on_or_before |
as described elsewhere |
None |
on_or_after |
as described elsewhere |
None |
between |
as described elsewhere |
None |
find_first_match_in_period |
as described elsewhere |
None |
find_last_match_in_period |
as described elsewhere |
None |
date_format |
only "YYYY" and "YYYY-MM" supported here as day level data not available |
None |
return_expectations |
as described elsewhere |
None |
Examples:
The first month in which each patient received "ACME Drug" after March 2019:
covid_admission_date=patients.with_high_cost_drugs(
drug_name_matches="ACME Drug",
on_or_after="2019-03-01",
find_first_match_in_period=True,
returning="date",
date_format="YYYY-MM",
return_expectations={"date": {"earliest": "2019-03-01"}},
)
SUS🔗
These variables are derived from the Secondary Uses Services (SUS) data, and their underlying datasets:
admitted_to_hospital(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', find_first_match_in_period=None, find_last_match_in_period=None, date_format=None, with_these_diagnoses=None, with_these_primary_diagnoses=None, with_these_procedures=None, with_admission_method=None, with_source_of_admission=None, with_discharge_destination=None, with_patient_classification=None, with_admission_treatment_function_code=None, with_administrative_category=None, return_expectations=None)
🔗
Return information about admission to hospital.
See https://github.com/opensafely/cohort-extractor/issues/186 for in-depth discussion and background.
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
find_first_match_in_period |
a boolean that indicates if the data returned is first event if there are multiple matches within the time period |
None |
find_last_match_in_period |
a boolean that indicates if the data returned is last event if there are multiple matches within the time period |
None |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
with_these_diagnoses |
icd10 codes to match against any diagnosis (note
this uses prefix matching so a code like |
None |
with_these_primary_diagnoses |
icd10 codes to match against the primary
diagnosis note this uses prefix matching so a code like |
None |
with_these_procedures |
opcs4 codes to match against the procedure |
None |
with_admission_method |
string or list of strings to match against |
None |
with_source_of_admission |
string or list of strings to match against |
None |
with_discharge_destination |
string or list of strings to match against |
None |
with_patient_classification |
string or list of strings to match against |
None |
with_admission_treatment_function_code |
string or list of strings to match against |
None |
with_administrative_category |
string or list of strings to match against |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
The day of each patient's first hospital admission for Covid19:
covid_admission_date=patients.admitted_to_hospital(
returning= "date_admitted",
with_these_diagnoses=covid_codelist,
on_or_after="2020-02-01",
find_first_match_in_period=True,
date_format="YYYY-MM-DD",
return_expectations={"date": {"earliest": "2020-03-01"}},
)
attended_emergency_care(on_or_before=None, on_or_after=None, between=None, returning='binary_flag', find_first_match_in_period=None, find_last_match_in_period=None, date_format=None, with_these_diagnoses=None, discharged_to=None, return_expectations=None)
🔗
Return information about attendance of A&E from the ECDS dataset. Please note that there is a limited number of diagnoses allowed within this dataset, and so will not match with the range of diagnoses allowed in other datasets such as the primary care record.
Parameters:
Name | Description | Default |
---|---|---|
on_or_before |
date of interest as a string with the format |
None |
on_or_after |
date of interest as a string with the format |
None |
between |
two dates of interest as a list with each date as a string with the format |
None |
returning |
string indicating value to be returned. Options are:
|
'binary_flag' |
find_first_match_in_period |
a boolean that indicates if the data returned is first event if there are multiple matches within the time period |
None |
find_last_match_in_period |
a boolean that indicates if the data returned is last event if there are multiple matches within the time period |
None |
date_format |
a string detailing the format of the dates to be returned. It can be |
None |
with_these_diagnoses |
a list of SNOMED CT codes |
None |
discharged_to |
a list of members of refset 999003011000000105. |
None |
return_expectations |
a dictionary defining the incidence and distribution of expected value within the population in question. |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
A variable called emergency_care
is created with returns a date of first attendence in A&E if
patient had attended emergency room during the time period.
emergency_care=patients.attended_emergency_care(
on_or_after="2020-01-01",
returning="date_arrived",
date_format="YYYY-MM-DD",
find_first_match_in_period=True,
return_expectations={
"date": {"earliest" : "2020-02-01"},
"rate" : "exponential_increase"
},
)
with_ethnicity_from_sus(returning=None, use_most_frequent_code=None, return_expectations=None)
🔗
Returns ethnicity data from the SUS Datasets
Parameters:
Name | Description | Default |
---|---|---|
returning |
string indicating value to be returned. Options are:
|
None |
use_most_frequent_code |
when multiple codes are present, pick the most frequent one |
None |
return_expectations |
a dictionary describing what dummy data should look like |
None |
Examples:
Patients with ethnicity, grouped to our 16 categories:
ethnicity_by_16_grouping=patients.with_ethnicity_from_sus(
returning="group_16",
use_most_frequent_code=True,
)
Utility functions🔗
These variables create new variable from existing variables. They do not extract any data directly.
random_sample(percent=None, return_expectations=None)
🔗
Flags a random sample of approximately percent
patients.
Parameters:
Name | Description | Default |
---|---|---|
percent |
an integer between 1 and 100 for the percent of patients to include within the random sample |
None |
return_expectations |
a dict containing an expectations definition defining at least an |
None |
Returns:
Type | Description |
---|---|
list |
of integers of |
Examples:
This creates a variable example
, flagging approximately 10% of the population with the value 1
:
example=patients.random_sample(percent=10, expectations={'incidence': 0.1})
categorised_as(category_definitions, return_expectations=None, **extra_columns)
🔗
Patients who had had 1 or more code from 1 or more codelists are categorised into groups according to a prescribed algorithm.
Parameters:
Name | Description | Default |
---|---|---|
category_definitions |
a dict that defines the algorithm and the associated category
The keys of the dict are strings representing categories from a defined by other
arguments such as |
required |
return_expectations |
a dict that defined the ratios of each category. The keys are the category values as strings and the values are ratios as floats. The ratios should add up to 1. |
None |
Retyrns
list: of strings which each letter representing a category as defined by the algorithm
Examples:
This creates a variable of asthma status based on codes for asthma and categorising for recent steroid use.
current_asthma=patients.categorised_as(
{
"1": "DEFAULT",
"2": "recent_asthma_code AND
prednisolone_last_year = 0"
"3": "recent_asthma_code AND prednisolone_last_year > 0"
},
recent_asthma_code=patients.with_these_clinical_events(
asthma_codes, between=["2017-02-01", "2020-01-31"],
),
prednisolone_last_year=patients.with_these_medications(
pred_codes,
between=["2019-02-01", "2020-01-31"],
returning="number_of_matches_in_period",
),
return_expectations={
"category":{"ratios": {"0": 0.8, "1": 0.1, "2": 0.1}}
},
)
satisfying(expression, return_expectations=None, **extra_columns)
🔗
Patients who meet the criteria for one or more expressions. Used as a way of combining groups or making subgroups based on certain characteristics.
Parameters:
Name | Description | Default |
---|---|---|
expression |
a string in that links together 2 or more expressions into one statement. key variables for this expression can be defined under this statement or anywhere in study definition. |
required |
return_expectations |
a dictionary defining the rate of expected value within the population in question |
None |
Returns:
Type | Description |
---|---|
list |
of integers, either |
Examples:
This creates a study population where patients included have asthma and not copd:
population=patients.satisfying(
"""
has_asthma AND NOT
has_copd
""",
has_asthma=patients.with_these_clinical_events(
asthma_codes, between=["2017-02-28", "2020-02-29"],
has_copd=patients.with_these_clinical_events(
copd_codes, between=["2017-02-28", "2020-02-29"],
),
)
date_of(source, date_format=None, include_month=False, include_day=False, return_expectations=None)
🔗
minimum_of(*column_names, **extra_columns)
🔗
Return the minimum value over the supplied columns e.g
min_value=patients.minimum_of("some_column", "another_column")
Note: this ignores "empty values" (i.e. the values used if there is no data for a particular column, such as 0.0 for numeric values or the empty string for dates). This ensures that the minimum of a column with a defined value and one with a missing value is equal to the defined value.
Additional columns can be defined within the function call which will be used in computing the minimum but won't themselves appear in the output:
min_value=patients.minimum_of( "some_column", another_colum=patients.with_these_medications(...) )
maximum_of(*column_names, **extra_columns)
🔗
Return the maximum value over the supplied columns e.g
max_value=patients.maximum_of("some_column", "another_column")
Additional columns can be defined within the function call which will be used in computing the maximum but won't themselves appear in the output:
max_value=patients.maximum_of( "some_column", another_colum=patients.with_these_medications(...) )