2  Data

The sclerostin pQTLs and GWAS results for are processed using the scripts sourced below. The linkage disequilibrium (LD) matrix of genetic variants in the SOST region (100KB downstream and 125KB upstream1 of SOST) is computed using data from 1000 Genomes (1000 Genomes Project Consortium et al. 2015). All of the genetic data has been aligned to build 37 (hg19) coordinates.

2.1 Datasets

2.1.1 Sclerostin pQTLs

The cis sclerostin pQTLs in Table 2.1 were extracted from Supplementary Table 2 in Zheng et al. (2023).

Table 2.1: cis pQTLs


README
  • rsid - rsID
  • chr - chromosome
  • pos - position (build 37)
  • ea - effect allele
  • oa - other allele
  • eaf - effect allele frequency
  • beta - effect size
  • se - standard error of effect size
  • pvalue - p-pvalue
  • n - number of samples

2.1.2 Study information

The GWAS studies used in the analyses2 are presented in Table 2.2.3

Table 2.2: GWAS study information
README
  • id - dataset ID
  • source - source of dataset
  • pmid - PubMed ID
  • author - author of study
  • link - link to dataset
  • trait - phenotype
  • abbr - abbreviation
  • ancestry - ancestry of study
  • n - number of samples
  • n_cases - number of cases
  • n_controls - number of controls
  • unit - unit of analysis (IRNT = inverse rank normal transformation)
  • flag - flag if the dataset (or equivalent) was used by Zheng et al. (2023) (Y = yes, N = no)

Citations: Xue et al. (2018); Mahajan et al. (2022); Hartiala et al. (2021); Van Der Harst and Verweij (2018); Aragam et al. (2022); Sudlow et al. (2015); Malik et al. (2018); Mishra et al. (2022); Malhotra et al. (2019); Morris et al. (2019)

2.2 Data processing

2.2.1 LD matrix

The LD correlation matrix of the SOST region (100KB downstream and 125KB upstream of SOST) for the European samples from 1000 Genomes Phase 3 V5b (1000 Genomes Project Consortium et al. 2015) is computed using the 01_data_ldmat.R R script located in the scripts folder.

source("./scripts/01_data_ldmat.R")

The output of the script is an Rda file saved in the data folder as 01_data_ld_mat.Rda which contains two objects:

  • ld_snp - a data.frame of variant information with the columns:
    • rsid - rsID
    • chr - chromosome
    • pos - position (build 37)
    • ref - reference allele
    • alt - alternate allele
    • af - allele frequency of the alternate allele
  • ld_mat - a matrix of variant correlations (\(r\) not \(r^2\))

2.2.2 GWAS data

SOST region

The SOST region is extracted and harmonized from the GWAS datasets using the 02_data_gwas_sos_region.R R script located in the scripts folder. The pQTL dataset is also harmonized using this script. The effect allele is aligned to the alternate allele in ld_snps.

source("./scripts/02_data_gwas_region.R")

The output of the script is an Rda file saved in the data folder as 02_data_gwas_sost_region.Rda which contains three objects:

  • gwas - a data.frame of GWAS results from the studies above with the columns:
    • rsid - rsID
    • chr - chromosome
    • pos - position (build 37)
    • ref - reference allele
    • alt - alternate allele (effect allele)
    • af - allele frequency of the alternate allele
    • beta - effect size
    • se - standard error
    • pvalue - p-value
  • pqtls - a data.frame of pQTL results with the columns:
    • rsid - rsID
    • chr - chromosome
    • pos - position (build 37)
    • ref - reference allele
    • alt - alternate allele (effect allele)
    • af - allele frequency of the alternate allele
    • beta - effect size
    • se - standard error
    • pvalue - p-value
  • studies - a data.frame of GWAS study information with the columns:
    • id - dataset ID
    • source - source of dataset
    • pmid - PubMed ID
    • author - author of study
    • trait - phenotype
    • abbr - abbreviation
    • ancestry - ancestry of study
    • n - number of samples
    • n_cases - number of cases
    • n_controls - number of controls
    • unit - unit of analysis (IRNT = inverse rank normal transformation)
    • flag - flag if the dataset (or equivalent) was used by Zheng et al. (2023) (Y = yes, N = no)

pQTLs

The GWAS results for the sclerostin pQTLs are extracted from 02_data_gwas_sost_region using the 03_data_gwas_pqtls.R R script located in the scripts folder. The associations of rs1107747 and rs4793023 with HDL cholesterol and triglycerides are adjusted for rs72836567 using the COJO methodology (Yang et al. 2012) to account for the effects of the CD300LG gene (see Section 3.2 for further details). No good proxy variants (\(r^2 \geq\) 0.8) were identified for the missing sclerostin pQTLs in GCST006867.

source("./scripts/03_data_gwas_pqtls.R")

The output of the script is an Rda file saved in the data folder as 03_data_gwas_pqtls.Rda which contains three objects:

  • gwas - a data.frame of GWAS results from the studies above with the columns:
    • rsid - rsID
    • chr - chromosome
    • pos - position (build 37)
    • ref - reference allele
    • alt - alternate allele (effect allele)
    • af - allele frequency of the alternate allele
    • beta - effect size
    • se - standard error
    • pvalue - p-value
  • pqtls - a data.frame of pQTL results with the columns:
    • rsid - rsID
    • chr - chromosome
    • pos - position (build 37)
    • ref - reference allele
    • alt - alternate allele (effect allele)
    • af - allele frequency of the alternate allele
    • beta - effect size
    • se - standard error
    • pvalue - p-value
  • studies - a data.frame of GWAS study information with the columns:
    • id - dataset ID
    • source - source of dataset
    • pmid - PubMed ID
    • author - author of study
    • trait - phenotype
    • abbr - abbreviation
    • ancestry - ancestry of study
    • n - number of samples
    • n_cases - number of cases
    • n_controls - number of controls
    • unit - unit of analysis (IRNT = inverse rank normal transformation)
    • flag - flag if the dataset (or equivalent) was used by Zheng et al. (2023) (Y = yes, N = no)

  1. An extra 25KB was added upstream to ensure that all relevant CD300LG variants are included in the region.↩︎

  2. The ischemic and cardioembolic stroke GWAS results from METASTROKE (Malik et al. 2016) used by Zheng et al. (2023) were replaced with those from MEGASTROKE (Malik et al. 2018) and the UK Biobank hypertension GWAS results from OpenGWAS used by Zheng et al. (2023) were replaced with those from Pan-UKBB due to licensing restrictions. The GWAS of coronary artery calcification was not available (either publicly or via application) at the time of this analysis (Kavousi et al. 2022).↩︎

  3. Since Zheng et al. (2023) use trans-ethnic GWAS results, we followed suit. In all of the trans-ethnic analyses the majority of the samples were from European ancestry. There was no material difference in the results if we restricted the analyses to analysing Europeans only where possible Section 8.1.↩︎