Downloads

This page contains information about the PGS Catalog downloads and FTP.

Available PGS Catalog downloads

PGS Scoring Files & Metadata
Individual PGS variants scoring and metadata files
View PGS Score Directories (FTP)
PGS Catalog Metadata
Available PGS global metadata files
Bulk Metadata Downloads.xlsx
PGS Catalog REST API
Programmatic access to the PGS Catalog metadata
REST API endpoint descriptions

PGS Catalog FTP structure

The PGS Catalog FTP allows for consistent access to the bulk downloads, and is indexed by Polygenic Score (PGS) ID to allow programmatic access to score level data. The following diagram illustrates the FTP structure:

ftp://ftp.ebi.ac.uk/pub/databases/spot/pgs
├── metadata/
│   ├── pgs_all_metadata.xlsx
│   ├── pgs_all_metadata_[sheet_name].csv (7 files)
│   ├── pgs_all_metadata.tar.gz (xlsx + csv files)
│   ├── publications/ (metadata for large studies)
│   └── previous_releases/
└── scores/
    ├── PGS000001/
    │   ├── Metadata/
    │   │   ├── PGS000001_metadata.xlsx
    │   │   ├── PGS000001_metadata_[sheet_name].csv (7 files)
    │   │   ├── PGS000001_metadata.tar.gz (xlsx + csv files)
    │   │   └── archived_versions/
    │   └── ScoringFiles/
    │       ├── PGS000001.txt.gz
    │       └── archived_versions/
    ├── PGS000002/
    ·	├─ ···
    ·	└─ ···
    ·
    └── PGS00..../
        ├─ ···
        └─ ···

PGS Scoring Files

Each scoring file (variant information, effect alleles/weights) is formatted to be a gzipped tab-delimited text file, labelled by its PGS Catalog Score ID (e.g. PGS000001.txt.gz).

Note: These files are composed of author-reported variants and annotations, and have only been consistently formatted to have the same column headings and data types within each column. We are currently working on methods to provide harmonized versions of each PGS in different genome builds (GRCh37 and GRCh38), ensuring each variant has a chromosomal position (either by adding it based on rsID or using liftover), and flagging potentially problematic variants (e.g. palindromic SNPs) or those that are inconsistent with the current genome build (e.g. strand flips, and variants not found in the Ensembl Variation databases).

Here is a description of the PGS Scoring Files header:

### PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information
## POLYGENIC SCORE (PGS) INFORMATION
# PGS ID = PGS identifier, e.g. 'PGS000001'
# PGS Name = PGS name, e.g. 'PRS77_BC' - optional
# Reported Trait = trait, e.g. 'Breast Cancer'
# Original Genome Build = Genome build/assembly, e.g. 'GRCh38'
# Number of Variants = Number of variants listed in the PGS
## SOURCE INFORMATION
# PGP ID = PGS publication identifier, e.g. 'PGP000001'
# Citation = Information about the publication
# LICENSE = License and terms of PGS use/distribution - refers to the EMBL-EBI Terms of Use by default
rsIDchr_namechr_positioneffect_allelereference_allele...
Example of PGS Scoring Files header
### PGS CATALOG SCORING FILE - see https://www.pgscatalog.org/downloads/#dl_ftp_scoring for additional information
## POLYGENIC SCORE (PGS) INFORMATION
# PGS ID = PGS000348
# PGS Name = PRS_PrCa
# Reported Trait = Prostate cancer
# Original Genome Build = GRCh37
# Number of Variants = 72
## SOURCE INFORMATION
# PGP ID = PGP000113
# Citation = Black M et al. Prostate (2020). doi:10.1002/pros.24058
# LICENSE = Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). © 2020 Ambry Genetics.
rsIDchr_namechr_positioneffect_allelereference_alleleeffect_weight...

It also has been edited to have consistent column headings based on the following schema:

Column HeaderField NameField DescriptionField Requirement
rsIDdbSNP Accession ID (rsID) The SNP’s rs IDOptional - unless both the chr_name and chr_position columns are absent. This column also contains HLA alleles in the standard notation (e.g. HLA-DQA1*0102) that aren’t always provided with chromosomal positions.
chr_nameLocation - Chromosome Chromosome name/number associated with the variantRequired - may be optional if an rsID for the variant is provided
chr_positionLocation - Position within the ChromosomeChromosomal position associated with the variantRequired - may be optional if an rsID for the variant is provided
effect_alleleEffect AlleleThe allele that's dosage is counted (e.g. {0, 1, 2}) and multiplied by the variant's weight ('effect_weight') when calculating score. The effect allele is also known as the 'risk allele'.Required
reference_alleleReference AlleleThe other allele(s) at the lociOptional - but strongly recommended
effect_weightVariant WeightValue of the effect that is multiplied by the dosage of the effect allele ('effect_allele') when calculating the score.Required
locus_nameLocus NameThis is kept in for loci where the variant may be referenced by the gene (APOE e4). It is also common (usually in smaller PGS) to see the variants named according to the genes they impact.Optional
weight_typeType of WeightWhether the author supplied Variant Weight is a: beta (effect size), or something like an OR/HR (odds/hazard ratio)Optional
allelefrequency_effectEffect Allele FrequencyReported effect allele frequency, if the associated locus is a haplotype then haplotype frequency will be extracted.Optional
is_interactionFLAG: InteractionThis is a TRUE/FALSE variable that flags whether the weight should be multiplied with the dosage of more than one variant. Interactions are demarcated with a _x_ between entries for each of the variants present in the interaction. Optional
is_dominantFLAG: Dominant Inheritance ModelThis is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum if there is at least 1 copy of the effect allele (e.g. it is a dominant allele).Optional
is_recessiveFLAG: Recessive Inheritance ModelThis is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum only if there are 2 copies of the effect allele (e.g. it is a recessive allele).Optional
is_haplotype
is_diplotype
FLAG: Haplotype or DiplotypeThis is a TRUE/FALSE variable that flags whether the effect allele is a haplotype/diplotype rather than a single SNP. Constituent SNPs in the haplotype are semi-colon separated. Optional
imputation_methodImputation MethodThis described whether the variant was specifically called with a specific imputation or variant calling method. This is mostly kept to describe HLA-genotyping methods (e.g. flag SNP2HLA, HLA*IMP) that gives alleles that are not referenced by genomic position.Optional
variant_descriptionVariant DescriptionThis field describes any extra information about the variant (e.g. how it is genotyped or scored) that cannot be captured by the other fields.Optional
inclusion_criteriaScore Inclusion CriteriaExplanation of when this variant gets included into the PGS (e.g. if it depends on the results from other variants).Optional
Extra columns:
OR
HR
Odds Ratio [OR], Hazard Ratio [HR]Author-reported effect sizes can be supplied to the Catalog. If no other effect_weight is given the weight is calculated using the log(OR) or log(HR).Optional
allelefrequency_effect_AncestryPopulation-specific effect allele frequencyReported effect allele frequency in a specific population (described by the authors).Optional
Example of PGS Scoring Files data
Scoring Files header
rsID        chr_name  chr_position  effect_allele  reference_allele  effect_weight
rs2843152   1         2245570       G              C                 -2.76009e-02
rs35465346  1         22132518      G              A                  2.39340e-02
rs28470722  1         38386727      G              A                 -1.74935e-02
rs11206510  1         55496039      T              C                  2.93005e-02
rs9970807   1         56965664      C              T                  4.70027e-02
rs61772626  1         57015668      A              G                 -2.71202e-02
rs7528419   1         109817192     A              G                  2.91912e-02
rs1277930   1         109822143     A              G                  2.60105e-02
rs11102000  1         110298166     G              C                  2.45969e-02
rs11810571  1         151762308     G              C                  2.09215e-02
rs6689306   1         154395946     G              A                 -1.97906e-02
rs72702224  1         154911689     G              A                 -2.81310e-02
rs3738591   1         155764808     C              G                  4.23731e-02
...

PGS Catalog Metadata

Bulk download of the entire PGS Catalog's metadata, describing all PGS in terms of their publication source, samples used for development/evaluation, and related performance metrics. Download Metadata file.xlsx

The bulk download contains a single Excel file with multiple sheets describing each of the data types. The sheets are also provided as individual .csv files for easier import in analysis tools, and are provided on the FTP in the metadata/ folder.

Worksheet Description Download Sheet.csv
ReadmePGS Catalog Release Date and Summary Information.-
PublicationsLists the publication sources for the PGS and PGS evaluations in the catalog.
EFO TraitsLists the ontology-mapped traits information for all PGS in the catalog.
ScoresLists all PGS scores and their associated metadata.
Score Development SamplesLists the samples used to create the PGS: samples used to discover the variant associations (GWAS), samples used for score development/training.
Performance MetricsLists all performance metrics and the associated PGS Scores and Publications.
Evaluation Sample SetsDescribes the samples used to evaluate PGS performance (refferenced as Polygenic Score Sample Sets (PSS).
CohortsLists all the cohorts used in the different samples.