PGS Catalog Ancestry Documentation

This page contains information about how participant ancestry is cataloged and represented for scores and sample descriptions in the PGS Catalog.

Essential to the interpretation of PGS data is the accurate description of the ancestry of samples included in these studies. The PGS Catalog describes the ancestry of samples, in a standardized manner, according to the published framework developed by the NHGRI-EBI GWAS Catalog and endorsed in the recent Polygenic Risk Score Reporting Standards (PRS-RS). Sample ancestry is represented in two forms: (1) an ancestry category from a controlled list and (2) a more detailed ancestry name/description (if available). Detailed descriptions aim to capture accurate, informative, and comprehensive information regarding the ancestry or genealogy of each distinct sample. Category assignment reduces complexity within data sets and enables the establishment of hierarchical relationships, placing samples in context with other samples, groups, and populations.

See also the PGS Catalog Data Description page for a complete description of the metadata captured for samples in PGS studies.

Ancestry Categories

Each sample description is mapped to an ancestry category and the samples are combined if they belong to the same category. We attempt to report ancestry-specific sample descriptions (e.g. sample sizes and case/control breakdowns); however, this is not always possible and samples can be labelled with more than one broad ancestry category.
A complete list of ancestry categories used in the PGS Catalog (adapted from Table 1, Morales et al 2018) can be found below:

Ancestry categoryDisplay category (for charts)Definition
Aboriginal AustralianAdditional Diverse AncestriesIncludes individuals who either self-report or have been described by authors as Australian Aboriginal. These are expected to be descendants of early human migration into Australia from Eastern Asia and can be distinguished from other Asian populations by mtDNA and Y chromosome variation.
African American or
Afro-Caribbean
AfricanIncludes individuals who either self-report or have been described by authors as African American or Afro-Caribbean. This category also includes individuals who genetically cluster with reference populations from this region, for example, 1000 Genomes and/or HapMap ACB or ASW populations. We note that there is likely to be significant admixture with European ancestry populations.
African unspecifiedAfricanIncludes individuals that either self-report or have been described as African, but there was not sufficient information to allow classification as African American, Afro-Caribbean or Sub-Saharan African.
Asian unspecifiedAdditional Asian Ancestries
(including Central, and South East Asian)
Includes individuals that either self-report or have been described as Asian but there was not sufficient information to allow classification as East Asian, Central Asian, South Asian, or South-East Asian.
Central AsianAdditional Asian Ancestries
(including Central, and South East Asian)
Includes individuals who either self-report or have been described by authors as Central Asian. We note that there does not appear to be a suitable reference population for this population and efforts are required to fill this gap.
East AsianEast AsianIncludes individuals who either self-report or have been described by authors as East Asian or one of the sub-populations from this region (e.g., Chinese). This category also includes individuals who genetically cluster with reference populations from this region, for example, 1000 Genomes and/or HapMap CDX, CHB, CHS, and JPT populations.
EuropeanEuropeanIncludes individuals who either self-report or have been described by authors as European, Caucasian, white, or one of the sub-populations from this region (e.g., Dutch). This category also includes individuals who genetically cluster with reference populations from this region, for example, 1000 Genomes and/or HapMap CEU, FIN, GBR, IBS, and TSI populations.
Greater Middle Eastern
(Middle Eastern, North African or Persian)
Greater Middle Eastern
(Middle Eastern, North African or Persian)
Includes individuals who self-report or were described by authors as Middle Eastern, North African, Persian, or one of the sub-populations from this region (e.g., Saudi Arabian). We note there is heterogeneity in this category with different degrees of admixture as well as levels of genetic isolation. We note that there does not appear to be a suitable reference population for this category and efforts are required to fill this gap.
Hispanic or Latin AmericanHispanic or Latin AmericanIncludes individuals who either self-report or are described by authors as Hispanic, Latino, Latin American, or one of the sub-populations from this region. This category includes individuals with known admixture of primarily European, African, and Native American ancestries, though some may have also a degree of Asian (e.g., Peru). We also note that the levels of admixture vary depending on the country, with Caribbean countries carrying higher levels of African admixture when compared to South American countries, for example. This category also includes individuals who genetically cluster with reference populations from this region, for example, 1000 Genomes and/or HapMap CLM, MXL, PEL, and PUR populations.
Native AmericanAdditional Diverse AncestriesIncludes indigenous individuals of North, Central, and South America, descended from the original human migration into the Americas from Siberia. We note that there does not appear to be a suitable reference population for this category and efforts are required to fill this gap.
Not reportedAncestry Not ReportedIncludes individuals for which no ancestry or country of recruitment information is available.
OceanianAdditional Diverse AncestriesIncludes individuals that either self-report or have been described by authors as Oceanian or one of the sub-populations from this region (e.g., Native Hawaiian). We note that there does not appear to be a suitable reference population for this category and efforts are required to fill this gap.
OtherAdditional Diverse AncestriesIncludes individuals where an ancestry descriptor is known but insufficient information is available to allow assignment to one of the other categories.
Other admixed ancestryAdditional Diverse AncestriesIncludes individuals who either self-report or have been described by authors as admixed and do not fit the definition of the other admixed categories already defined (“African American or Afro-Caribbean” or “Hispanic or Latin American”).
South AsianSouth AsianIncludes individuals who either self-report or have been described by authors as South Asian or one of the sub-populations from this region (e.g., Asian Indian). This category also includes individuals who genetically cluster with reference populations from this region, for example, 1000 Genomes and/or HapMap BEB, GIH, ITU, PJL, and STU populations.
South East AsianAdditional Asian Ancestries
(including Central, and South East Asian)
Includes individuals who either self-report or have been described by authors as South East Asian or one of the sub-populations from this region (e.g., Vietnamese). This category also includes individuals who genetically cluster with reference populations from this region, for example, 1000 Genomes KHV population. We note that East Asian and South East Asian populations are often conflated. However, recent studies indicate a unique genetic background for South East Asian populations.
Sub-Saharan AfricanAfricanIncludes individuals who either self-report or have been described by authors as Sub-Saharan African or one of the sub-populations from this region (e.g., Yoruban). This category also includes individuals who genetically cluster with reference populations from this region, for example, 1000 Genomes and/or HapMap ESN, LWK, GWD, MSL, MKK, and YRI populations.
Multi-Ancestry
(including Europeans)
Multi-Ancestry
(including Europeans)
Combined sample of multiple ancestries that includes European ancestry individuals. Used when ancestry-specific sample sizes are unknown or not being considered (e.g. in combined samples).
Multi-Ancestry
(excluding Europeans)
Multi-Ancestry
(excluding Europeans)
Combined sample of multiple ancestries that does not include any European ancestry individuals. Used when ancestry-specific sample sizes are unknown or not being considered (e.g. in combined samples).

To reduce the complexity of displaying the data some of the ancestry categories have been merged into higher-level groupings (Display categories) - these groupings represent the current breadth of data in the PGS Catalog and we plan to review and update these groupings as more data is added.

Displaying Ancestry Distributions

We provide visual summaries of participant ancestry within each stage of PGS development (source of variant associations [GWAS], and score development/training) and evaluation. Here, we outline how the ancestry distribution is calculated in each stage:

PGS Development
Study stageDistribution calculationExample
Source of Variant Associations
(GWAS)

A count of the total number of individuals in each ancestry display category across all GWAS samples used to develop the score. Samples are labelled Multi-ancestry (including or excluding Europeans) if ancestry-specific numbers are not available (see example).

The relative proportion of each ancestry group is displayed as a percentage of the total number of individuals.

  • GWAS A: 2500 individuals - European, South Asian
  • GWAS B: 1500 individuals - European
  • GWAS C: 1000 individuals - East Asian
Distribution:
Multi-Ancestry (including Europeans): 50% (2500 individuals out of 5000 individuals)
European: 30% (1500 individuals out of 5000 individuals)
East Asian: 20% (1000 individuals out of 5000 individuals)
Score Development/Training Calculated the same as above, except for samples used for Score Development/Training.
PGS Evaluation
Study stageDistribution calculationExample
PGS Evaluation

The calculation of the ancestry distribution for samples used in PGS Evaluation is different from those in development, to highlight the diversity of ancestries represented in this stage instead of their relative sample sizes. To calculate the ancestry distribution for PGS Evaluation we first assign an ancestry label to each unique Sample Set the PGS has been applied to. Samples are labelled Multi-ancestry (including or excluding Europeans) if more than one ancestry is present in the combined Sample Set (see example).

The distribution is calculated by dividing the number of Sample Sets with the same ancestry category by the total number of Sample Sets. This can be interpreted as the percentage of a scores’ performance data that is derived from a specific ancestry group; however, it should be noted that the percentage does not account for sample size differences between the sample sets for simplicity.

Example with 4 Sample Sets to evaluate the same polygenic score:

Sample Set A
  • Sample A1: European
  • Sample A2: European
  • Sample A3: East Asian
Combined ancestry:
Multi-ancestry (including European)
Sample Set B
  • Sample B1: African
  • Sample B2: African
Combined ancestry:
African
Sample Set C
  • Sample C1: East Asian
Combined ancestry:
East Asian
Sample Set D
  • Sample D1: African
Combined ancestry:
African
Distribution:
African: 50% (2 Sample Sets [B & D] out of 4 Sample Sets)
East Asian: 25% (1 Sample Set [C] out of 4 Sample Sets)
Multi-ancestry (including European): 25% (1 Sample Set [A] out of 4 Sample Sets)
Example

Here is an example of the ancestry distributions for a single score (PGS000018). You can hover over the pie charts to see the list of ancestries included and their proportions, as well as the total sample sizes.

Ancestry legend
Multi-ancestry (including European)
Multi-ancestry (excluding European)
African
East Asian
South Asian
Additional Asian Ancestries
European
Greater Middle Eastern
Hispanic or Latin American
Additional Diverse Ancestries
Not Reported
Polygenic Score ID & Name PGS Publication ID (PGP) Reported Trait Mapped Trait(s) (Ontology) Number of Variants Ancestry distribution Scoring File (FTP Link)
PGS000018
(metaGRS_CAD)
PGP000007 |
Inouye M et al. J Am Coll Cardiol (2018)
Coronary artery disease coronary artery disease 1,745,179
https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000018/ScoringFiles/PGS000018.txt.gz

Filtering PGS by Ancestry

Users may be interested in finding PGS that have been developed or evaluated with data from a particular ancestry group. To facilitate this we’ve implemented filters on the PGS browsing, Publication and Trait pages that show/hide scores based on the ancestry data and criteria you select. With this comes two display options that easily allow users to identify PGS using multi-ancestry data or scores that are ‘non-European’ (e.g. contain no European ancestry data).
The data can be filtered with two selections: (1) selecting the stage of the PGS study design you’re interested in querying, (2) selecting the ancestry category you are interested in viewing data for.

1 - Study stages

First, you need to select on which PGS stage you want to filter the ancestries (all of the data is displayed by default). The first dropdown contains the possibile stages of a PGS study that can be queried:

All Stages combined [G, D, E] Corresponds to the combination of the stages "Source of Variant Associations (GWAS)", "Score Development/Training" and "Score Evaluation".
Development [G, D] Corresponds to the combination of the stages "Source of Variant Associations (GWAS)" and "Score Development/Training"
GWAS [G] Corresponds to the stage "Source of Variant Associations (GWAS)".
Score Development/Training [D] Corresponds to the stage "Score Development/Training".
PGS Evaluation [E] Corresponds to the stage "PGS Evaluation".
2 - Ancestry selection

Second, you can show/hide PGS based on the ancestries of participants in the study stage you selected. The options are as follows:

List of ancestries includes:
("Select an ancestry" dropdown)
Filter PGS that include data of individuals from the selected ancestry at the selected study stage. This also filters the contributing ancestries within the Multi-ancestry labels (where specific sample size information is unavailable).
Display Options: Show European ancestry data [default: on] This button can be used to hide PGS with any European ancestry data to only show scores with data from other non-European ancestry groups. The button is selected by default, displaying all PGS.
Show only Multi-ancestry data [default: off] Shows only PGS that include data from multiple ancestry groups at the selected study stage, hiding PGS with data from a single ancestry group.
Example Use Cases

Here are some examples of common filter combinations/use-cases that may be relevant to users looking to match their PGS to samples based on reported ancestry labels.

References

Wand, Lambert et al.

Improving reporting standards for polygenic scores in risk prediction studies

Nature (2021) 591, 211–219