PGS Catalog Data Description

This page contains information regarding the contents of the PGS Catalog, describing the curated data fields and tables extracted from PGS publications. The descriptions are based on the major data structures in the Catalog:

Publication

Each publication in the database is given a Polygenic Publication (PGP) ID so that scores and evaluations link to the same source object. When browsing by publications the Number of PGS Developed refers to the number of newly developed PGS in the paper, and the Number of PGS Evaluated refers to the number of PGS (new and existing) that have performance metrics derived in the study. For each publication the following information is extracted:

PubMed ID (PMID) PubMed Identification number.
Digital Object Identifier (doi) The doi of each publication is curated in addition to the PMID to allow unpublished work (e.g. pre-prints) to be added to the catalog.
Title Title of the publication.
Author(s) List of publication authors, the first author is also extracted for a shorter display.
Journal The name of the publication source.
Publication Date Date of publication (with respect to the PMID or doi upon upload to the Catalog).

Polygenic Score (PGS)

Each PGS in the database is given a unique Polygenic Score (PGS) ID to identify it. The following information is extracted, and associated with each PGS in the catalog:

Predicted Trait
Reported Trait The author-reported trait (e.g. body mass index [BMI], or coronary artery disease) that the PGS has been developed to predict.
Mapped Trait(s) The Reported Trait is mapped to Experimental Factor Ontology (EFO) terms and their respective identifiers by PGS Catalog curators. For more information about the ontology traits see the Trait section.
Score Details
PGS Name This may be the name that the authors use to refer to the PGS, or a name that a curator has assigned to identify the score during the curation process (before a PGS ID has been given).
Original Genome Build The version of the genome that the variants present in the PGS are associated with. Listed as NR (Not Reported) if unknown.
Number of Variants Number of variants used to calculate the PGS. In the future this will include a more detailed description of the types of variants present.
Number of Variant Interaction Terms Number of higher-order variant interactions included in the PGS.
PGS Development Method The name or description of the method or computational algorithm used to develop the PGS.
PGS Development Details/Relevant Parameters A description of the relevant inputs and parameters relevant to the PGS development method/process.
PGS Catalog Publication ID (PGP ID) A PGP ID links the PGS to the publication in which it was described.
Citation External link to the original publication source.
Weight Type Whether the author supplied Variant Weight is a: beta (effect size), or something like an OR/HR (odds/hazard ratio).
Terms and Licenses The PGS Catalog distributes its data according to EBI’s standard Terms of use. Some PGS have specific terms, licenses, or restrictions (e.g. non-commercial use) that we highlight in this field, if known.

The following information about the PGS are captured in tables and described in subsequent sections:

Development Samples

Information about the samples used for the development of the PGS. Relevant column descriptions are in the Sample Description section.

Source of Variant Associations (GWAS) A table describing the samples used to define the variant associations/effect-sizes used in the PGS. These data are extracted from the NHGRI-EBI GWAS Catalog when a study ID (GCST) is available.
Score Development/Training A table describing the samples used to develop or train the score (e.g. not used for variant discovery, and non-overlapping with the samples used to evaluate the PGS predictive ability).
Performance Metrics A record of the performance metrics that have been reported for the PGS. Relevant column descriptions are in the Performance Metrics section
Evaluated Samples Information about the samples used in PGS performance evaluation. These samples have an PGS Catalog Sample Set ID (PSS ID) to link them to their associated performance metrics (and across different PGS). Relevant column descriptions are in the Sample Description section.

Trait

Traits in the Catalog are displayed/grouped according to the Mapped Traits rather than the author Reported Traits to facilitate comparability, similar to the NHGRI-EBI GWAS Catalog. For a complete description of why the trait ontology is employed please refer to related documentation from the NHGRI-EBI GWAS Catalog. The Experimental Factor Ontology is hosted and described here: Experimental Factor Ontology (EFO). The information for each EFO trait ID that is stored in the PGS catalog is:

Trait The trait label from the ontology.
Identifier The Experimental Factor Ontology ID (EFO_ID) identifier to consistently refer to traits using the EFO, and to other resources like the NHGRI-EBI GWAS Catalog.
Description Detailed description of the trait from EFO.
Synonyms Other names for the trait.
Mapped Term(s) Includes references to terms in other databases and ontologies (e.g. ICD9/ICD10, MONDO, SNOMEDCT, etc.).

Sample

A consistent set of fields are used to describe the samples used to develop and evaluate each PGS:

PGS Catalog Sample Set ID (PSS ID) PSS IDs are assigned to describe samples used in PGS evaluations (e.g. Performance Metrics). PSS IDs are not uniquely associated with a single PGS, multiple PGS can be evaluated on the sample sets (PSS ID).
Phenotype Definitions and Methods A description of how the phenotype was measured or defined (e.g. ICD codes used to identify cases/phenotypes in EHR data).
Participant Follow-up Time A summary of the follow-up time (mean/median, range/confidence intervals) for participants that are part of a prospective cohort/study design (used to measure disease incidence).
Study Identifiers Identifiers used to link the samples to their initial descriptions (e.g. using PubMed IDs) and if they were used for variant associations to their associated GWAS studies (using NHGRI-EBI GWAS Catalog GCST IDs).
Sample Numbers This field describes the number of individuals included in the sample, along with the number of cases and controls (if the trait is dichotomous), and the percent of participants that are male (if available). In cases where the study does not provide the exact sample size we display the value as NR (not reported) and as missing in the API/metadata downloads.
Age of Study Participants A summary (mean/median, range/confidence intervals) of study participants ages.
Ancestry fields:

Fields describing sample ancestry are curated according to the framework used to record ancestry data from the NHGRI-EBI GWAS Catalog. See our Ancestry Documentation page for a complete description of how ancestry is represented and described in the PGS Catalog.

Broad Ancestry Category Author reported ancestry is mapped to the best matching ancestry category from the NHGRI-EBI GWAS Catalog framework (Table 1, Morales et al. (2018)).
Ancestry A more detailed description of sample ancestry that usually matches the most specific description described by the authors (e.g. French, Chinese).
Sample Ancestry Is displayed on the website, and represents a combination of the two ancestry categories (with the more specific terms in brackets).
Country of recruitment Author reported countries of recruitment (if available).
Additional Ancestry Description Any additional description not captured in the structured data (e.g. founder or genetically isolated populations, or further description of admixed samples).
Cohort(s)

A list of cohorts that collected the samples.

The initial list of common cohorts used in genetics studies that seeded these annotations is from Mills & Rahal. Communications Biology (2019).
Additional Sample/Cohort Information Any additional description about the samples and what they were used for that is not captured by the structured categories (e.g. sub-cohort information).

Performance Metrics

Each evaluation of a PGS is given a PGS Performance Metric (PPM) ID that links it to a description of the results:

PGS Catalog Sample Set ID (PSS ID) ID that links to the samples the displayed PGS evaluated.
Performance Source ID that links to the publication where the performance metrics were reported.
Trait This field displays both the Reported and Mapped Traits. The reported trait often corresponds to the test set names reported in the publication, or more specific aspects of the phenotype being tested (e.g. if the disease cases are incident vs. recurrent events).
Reported values:

The reported values of the performance metrics are all reported similarly (e.g. the estimate is recorded along with the 95% confidence interval (if supplied)) and grouped according to the type of statistic they represent:

PGS Effect Sizes (per SD change) Standardized effect sizes, per standard deviation [SD] change in PGS. Examples include regression coefficients (betas) for continuous traits, Odds ratios (OR) and/or Hazard ratios (HR) for dichotomous traits depending on the availability of time-to-event data.
PGS Classification Metrics Examples include the Area under the Receiver Operating Characteristic (AUROC) or Harrell's C-index (Concordance statistic).
Other Metrics Metrics that do not fit into the other two categories. Examples include: R2 (proportion of the variance explained), or reclassification metrics.
Covariates Included in PGS Model A comma-separated list of covariates used in the prediction model to evaluate the PGS. Examples include: age, sex, smoking habits, etc.
PGS Performance: Other Relevant Information Any other information relevant to the understanding of the performance metrics.