Polygenic risk score

Last updated: 2024-01-23

Checks: 6 1

Knit directory: QBS-statsgen/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20231230)

The command set.seed(20231230) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 85e59dd

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 85e59dd. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/B81CBE6F/bibliography-index/
    Ignored:    .Rproj.user/B81CBE6F/ctx/
    Ignored:    .Rproj.user/B81CBE6F/pcs/
    Ignored:    .Rproj.user/B81CBE6F/presentation/
    Ignored:    .Rproj.user/B81CBE6F/profiles-cache/
    Ignored:    .Rproj.user/B81CBE6F/sources/per/
    Ignored:    .Rproj.user/B81CBE6F/tutorial/
    Ignored:    .Rproj.user/shared/notebooks/1C2AC29C-e1-gwas-power/
    Ignored:    .Rproj.user/shared/notebooks/1EB0B2DC-e1-gwas/1/s/ce0r78nx8keuu/
    Ignored:    .Rproj.user/shared/notebooks/1EB0B2DC-e1-gwas/1/s/csetup_chunk/
    Ignored:    .Rproj.user/shared/notebooks/1EB0B2DC-e1-gwas/1/s/czxn6jf8lsykc/
    Ignored:    .Rproj.user/shared/notebooks/26ED8139-e2-prs/
    Ignored:    .Rproj.user/shared/notebooks/BC66D613-e2-lmm/
    Ignored:    .Rproj.user/shared/notebooks/FCFC3BD0-e2-finemapping/
    Ignored:    data/e2-ori/
    Ignored:    data/e2/
    Ignored:    output/

Untracked files:
    Untracked:  analysis/e2-finemapping.Rmd
    Untracked:  analysis/e2-lmm.Rmd
    Untracked:  analysis/e2-prs.Rmd
    Untracked:  code/Bayesian-linear-regression.R

Unstaged changes:
    Modified:   .Rproj.user/B81CBE6F/persistent-state
    Modified:   .Rproj.user/B81CBE6F/sources/prop/4C8B7780
    Modified:   .Rproj.user/B81CBE6F/sources/prop/BBFFB970
    Modified:   .Rproj.user/B81CBE6F/sources/prop/INDEX
    Modified:   .Rproj.user/B81CBE6F/sources/s-e0e7218a/34A40D3B
    Modified:   .Rproj.user/B81CBE6F/sources/s-e0e7218a/34A40D3B-contents
    Deleted:    .Rproj.user/B81CBE6F/sources/s-e0e7218a/6C1FFABC
    Modified:   .Rproj.user/B81CBE6F/sources/s-e0e7218a/lock_file
    Modified:   .Rproj.user/shared/notebooks/paths
    Modified:   analysis/index.Rmd
    Deleted:    temp.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Before the Class

Install GCTA

The GCTA software (current version: 1.95.1, released February 2026) can be downloaded from https://yanglab.westlake.edu.cn/software/gcta/#Download

This is the original paper for GCTA: Yang et al. (2011) GCTA: a tool for Genome-wide Complex Trait Analysis. Am J Hum Genet. 88(1): 76-82. [PubMed ID: 21167468]. This is a tool widely used to estimate and partition complex trait variation with large GWAS data sets.

Download data

From here https://rcweb.dartmouth.edu/Szhao/QBS148-statsgen/e2/.

Data used in this analysis

Genotype data The genotype data is given plink1 format: oz.fam, oz.bim, oz.bed

Pay attention to the .fam file

head data/e2/oz.fam

1400 1400t1 0 0 1 -9
1400 1400t2 0 0 2 -9
570 570t1 0 0 1 -9
570 570t2 0 0 1 -9
3413 3413t1 0 0 2 -9
3413 3413t2 0 0 1 -9
1911 1911t1 0 0 2 -9
1911 1911t2 0 0 2 -9
1403 1403t1 0 0 2 -9
1403 1403t2 0 0 1 -9

The .fam file has 6 columns: Family ID, Individual ID, Fathers ID (0=missing), Mothers ID (0=missing), Sex (1=M, 2=F), Phenotype (-9=missing).

Phenotype data

head data/e2/ozht.phen

FID IID ht
1000 1000t1 167.99
1000 1000t2 167.99
1001 1001t1 173
1001 1001t2 164.99
1002 1002t1 164.99
1002 1002t2 151.98
1004 1004t1 162.99
1004 1004t2 156.98
1005 1005t1 176.98

This file contains 3 columns: Family ID, Individual ID, Height (in cm)

Covariates file

head data/e2/ozht.covar
head data/e2/ozht.qcovar

FID IID sex
1000 1000t1 2
1000 1000t2 2
1001 1001t1 2
1001 1001t2 2
1002 1002t1 2
1002 1002t2 2
1004 1004t1 2
1004 1004t2 2
1005 1005t1 2
FID IID age PC1 PC2 PC3 PC4
1000 1000t1 23 0.0105 0.0279 -0.0088 -0.0156
1000 1000t2 23 0.0106 0.0267 -0.0101 -0.0161
1001 1001t1 17 0.0101 0.0259 -0.0089 -0.0163
1001 1001t2 17 0.0104 0.0266 -0.0096 -0.0171
1002 1002t1 20 0.0109 0.0269 -0.0104 -0.0188
1002 1002t2 20 0.0105 0.027 -0.0101 -0.0172
1004 1004t1 19 0.0084 0.0258 -0.007 -0.0153
1004 1004t2 19 0.0089 0.0256 -0.0045 -0.0159
1005 1005t1 20 0.0105 0.0256 -0.0097 -0.0209

We have two covariates file, one for categorical covariates ozht.covar This file contains 8 columns: Family ID, Individual ID, Age, Sex (1=M,2=F), and 4 genetic principle components which we will use to account for the effects of ethnicity in our analyses.

Summary statistics file

head data/e2/weights.prs

SNP CHR BP A1 A2 BETA P
SNP6438 1 995985 T C 9.22365e-03 0.000126
SNP2310 1 1299528 C A 1.32753e-02 9.11e-08
SNP3090 1 1483355 A G 9.02390e-03 3.58e-07
SNP10481 1 2065231 T G 1.02686e-02 0.0113
SNP2707 1 2118624 T C 9.23841e-03 1.93e-07
SNP886 1 2142211 C A 1.63796e-02 1.24e-18
SNP2202 1 2248368 T C 1.49619e-02 7.52e-08
SNP10484 1 2248297 C A -4.84493e-03 0.0113
SNP3705 1 2274422 G A -9.42793e-03 8.48e-07

In this file, A1 is the effect allele.

Compute PRS using the C+PT strategy.

The first method is classically denoted as “Clumping + P-value Thresholding (C+PT)”. This method is also abbreviated as P+T or C+T in certain publications. In brief, the principle of this method is compare various sets of uncorrelated SNPs (e.g., maximum squared pairwise correlation between allele counts at SNPs in the selected set is r2≤0.1 ) that are associated with the trait / disease as certain p-value threshold (e.g., p<0.01). We then select the set of SNPs that yields the largest prediction accuracy with the trait / disease of interest in a validation set. This method is broadly used because of it simplicity but may not often yield the largest accuracy. In this practical, we will use PLINK and R to determine these optimal sets of SNPs. PGS will be calculated using marginal/GWAS SNP effects as weights.

Step 1: clump and select SNPs. Run from command line:

window_kb=1000 # 1000 kb = 1 Mb window
r2_thresh=0.1  # LD threshold for clumping
pv_thresh=5e-8

plink --bfile data/e2/oz \
      --clump data/e2/weights.prs \
      --clump-kb ${window_kb} \
      --clump-p1 ${pv_thresh} \
      --clump-p2 ${pv_thresh} \
      --clump-r2 ${r2_thresh} \
      --out output/c-pt_rsq_${r2_thresh}_p_below_${pv_thresh}

How many SNPs are picked?

Step 2: calculate the PGS with PLINK using the –score command (help: https://www.cog-genomics.org/plink/1.9/score).

plink --bfile data/e2/oz\
      --score data/e2/weights.prs 1 4 6 header sum \
      --extract output/c-pt_rsq_${r2_thresh}_p_below_${pv_thresh}.clumped \
      --out output/c-pt_rsq_${r2_thresh}_p_below_${pv_thresh}.pred

Assuming that the 1rt column is the SNP ID; 4th column is the effective allele information; the 6th column is the effect size estimate; and that the file contains a header. We use Sum of the effects. See here for plink documentation.

Take a look at the output:

head output/c-pt_rsq_0.1_p_below_5e-8.pred.profile

    FID      IID  PHENO    CNT   CNT2 SCORESUM
   1400   1400t1     -9   1902    353 -0.38628
   1400   1400t2     -9   1786    340 0.222768
    570    570t1     -9   1770    355 -0.208286
    570    570t2     -9   1770    355 -0.208286
   3413   3413t1     -9   1686    341 -0.0885919
   3413   3413t2     -9   1682    320 -0.199319
   1911   1911t1     -9   1970    399 -0.336112
   1911   1911t2     -9   1970    399 -0.336112
   1403   1403t1     -9   1774    362 -0.576168

We will use linear mixed model to assess r2 of our PRS score as our samples contains related individuals. (If all individuals are unrelated, we can just use linear regression.) First prepare covariates file, the last column is the PRS score

qcovar <- read.table("data/e2/ozht.qcovar", header = T)
prs <- read.table("output/c-pt_rsq_0.1_p_below_5e-8.pred.profile", header = T)
all=merge(qcovar, prs,  by=c("IID", "FID"))
write.table(all[,c("FID", "IID", "age", "PC1", "PC2", "PC3", "PC4", "SCORESUM")], "output/ozht.prs_rsq_0.1_p_below_5e-8.qcovar", quote = F, row.names = F)

Then run GCTA to estimate fixed effect size for PRS score.

From command line run this:

gcta --bfile data/e2/oz --make-grm --out output/ozGCTA
gcta --reml\
     --reml-est-fix\
     --grm output/ozGCTA\
     --pheno data/e2/ozht.phen\
     --qcovar output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.qcovar\
     --covar data/e2/ozht.covar\
     --out output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.reml
     
grep X_7 output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.reml.log > output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.effect.txt

Note X_7 is the variable for PRS score in this example. The order depends on the order in the --covar and --qcovar files.

Let’s calculate r2 and p value for this PRS score back in R

phen <- read.table("data/e2/ozht.phen", header = T)
all = merge(all, phen, by=c("IID", "FID"))
res <- read.table("output/ozht.prs_rsq_0.1_p_below_5e-8.effect.txt", header = F)
betahat <- res[1,2]
sehat <- res[1,3]
r2 <- (betahat/sd(all$ht, na.rm = T)*sd(all$SCORESUM, na.rm = T))**2
pval <- (1-pt(q=betahat/sehat, df = 1894, lower.tail = T))*2
cat("r2: ", r2, "; p value: ", pval)

r2:  0.009820507 ; p value:  0.0005102413

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /software/R-4.1.0-no-openblas-el7-x86_64/lib64/R/lib/libRblas.so
LAPACK: /software/R-4.1.0-no-openblas-el7-x86_64/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9       rstudioapi_0.13  knitr_1.42       magrittr_2.0.1  
 [5] workflowr_1.6.2  R6_2.5.0         rlang_1.1.0      fastmap_1.1.0   
 [9] fansi_0.5.0      stringr_1.4.0    tools_4.1.0      xfun_0.38       
[13] utf8_1.2.1       cli_3.6.1        git2r_0.28.0     jquerylib_0.1.4 
[17] htmltools_0.5.5  ellipsis_0.3.2   rprojroot_2.0.2  yaml_2.2.1      
[21] digest_0.6.27    tibble_3.1.2     lifecycle_1.0.3  crayon_1.5.2    
[25] later_1.2.0      sass_0.4.0       vctrs_0.3.8      promises_1.2.0.1
[29] fs_1.6.1         glue_1.4.2       cachem_1.0.5     evaluate_0.20   
[33] rmarkdown_2.21   stringi_1.6.2    bslib_0.4.2      compiler_4.1.0  
[37] pillar_1.6.1     jsonlite_1.7.2   httpuv_1.6.1     pkgconfig_2.0.3

Polygenic risk score

Siming Zhao

2026-04-27

Before the Class

Data used in this analysis

Compute PRS using the C+PT strategy.