Last updated: 2024-01-23
Checks: 6 1
Knit directory: QBS-statsgen/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git. To know which version of the R
Markdown file created these results, you’ll want to first commit it to
the Git repo. If you’re still working on the analysis, you can ignore
this warning. When you’re finished, you can run
wflow_publish to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20231230) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 85e59dd. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/B81CBE6F/bibliography-index/
Ignored: .Rproj.user/B81CBE6F/ctx/
Ignored: .Rproj.user/B81CBE6F/pcs/
Ignored: .Rproj.user/B81CBE6F/presentation/
Ignored: .Rproj.user/B81CBE6F/profiles-cache/
Ignored: .Rproj.user/B81CBE6F/sources/per/
Ignored: .Rproj.user/B81CBE6F/tutorial/
Ignored: .Rproj.user/shared/notebooks/1C2AC29C-e1-gwas-power/
Ignored: .Rproj.user/shared/notebooks/1EB0B2DC-e1-gwas/1/s/ce0r78nx8keuu/
Ignored: .Rproj.user/shared/notebooks/1EB0B2DC-e1-gwas/1/s/csetup_chunk/
Ignored: .Rproj.user/shared/notebooks/1EB0B2DC-e1-gwas/1/s/czxn6jf8lsykc/
Ignored: .Rproj.user/shared/notebooks/26ED8139-e2-prs/
Ignored: .Rproj.user/shared/notebooks/BC66D613-e2-lmm/
Ignored: .Rproj.user/shared/notebooks/FCFC3BD0-e2-finemapping/
Ignored: data/e2-ori/
Ignored: data/e2/
Ignored: output/
Untracked files:
Untracked: analysis/e2-finemapping.Rmd
Untracked: analysis/e2-lmm.Rmd
Untracked: analysis/e2-prs.Rmd
Untracked: code/Bayesian-linear-regression.R
Unstaged changes:
Modified: .Rproj.user/B81CBE6F/persistent-state
Modified: .Rproj.user/B81CBE6F/sources/prop/4C8B7780
Modified: .Rproj.user/B81CBE6F/sources/prop/BBFFB970
Modified: .Rproj.user/B81CBE6F/sources/prop/INDEX
Modified: .Rproj.user/B81CBE6F/sources/s-e0e7218a/34A40D3B
Modified: .Rproj.user/B81CBE6F/sources/s-e0e7218a/34A40D3B-contents
Deleted: .Rproj.user/B81CBE6F/sources/s-e0e7218a/6C1FFABC
Modified: .Rproj.user/B81CBE6F/sources/s-e0e7218a/lock_file
Modified: .Rproj.user/shared/notebooks/paths
Modified: analysis/index.Rmd
Deleted: temp.csv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish() to start tracking its development.
The GCTA software (current version: 1.95.1, released February 2026) can be downloaded from https://yanglab.westlake.edu.cn/software/gcta/#Download
This is the original paper for GCTA: Yang et al. (2011) GCTA: a tool for Genome-wide Complex Trait Analysis. Am J Hum Genet. 88(1): 76-82. [PubMed ID: 21167468]. This is a tool widely used to estimate and partition complex trait variation with large GWAS data sets.
From here https://rcweb.dartmouth.edu/Szhao/QBS148-statsgen/e2/.
oz.fam, oz.bim, oz.bedPay attention to the .fam file
head data/e2/oz.fam
1400 1400t1 0 0 1 -9
1400 1400t2 0 0 2 -9
570 570t1 0 0 1 -9
570 570t2 0 0 1 -9
3413 3413t1 0 0 2 -9
3413 3413t2 0 0 1 -9
1911 1911t1 0 0 2 -9
1911 1911t2 0 0 2 -9
1403 1403t1 0 0 2 -9
1403 1403t2 0 0 1 -9
The .fam file has 6 columns: Family ID, Individual ID, Fathers ID (0=missing), Mothers ID (0=missing), Sex (1=M, 2=F), Phenotype (-9=missing).
head data/e2/ozht.phen
FID IID ht
1000 1000t1 167.99
1000 1000t2 167.99
1001 1001t1 173
1001 1001t2 164.99
1002 1002t1 164.99
1002 1002t2 151.98
1004 1004t1 162.99
1004 1004t2 156.98
1005 1005t1 176.98
This file contains 3 columns: Family ID, Individual ID, Height (in cm)
head data/e2/ozht.covar
head data/e2/ozht.qcovar
FID IID sex
1000 1000t1 2
1000 1000t2 2
1001 1001t1 2
1001 1001t2 2
1002 1002t1 2
1002 1002t2 2
1004 1004t1 2
1004 1004t2 2
1005 1005t1 2
FID IID age PC1 PC2 PC3 PC4
1000 1000t1 23 0.0105 0.0279 -0.0088 -0.0156
1000 1000t2 23 0.0106 0.0267 -0.0101 -0.0161
1001 1001t1 17 0.0101 0.0259 -0.0089 -0.0163
1001 1001t2 17 0.0104 0.0266 -0.0096 -0.0171
1002 1002t1 20 0.0109 0.0269 -0.0104 -0.0188
1002 1002t2 20 0.0105 0.027 -0.0101 -0.0172
1004 1004t1 19 0.0084 0.0258 -0.007 -0.0153
1004 1004t2 19 0.0089 0.0256 -0.0045 -0.0159
1005 1005t1 20 0.0105 0.0256 -0.0097 -0.0209
We have two covariates file, one for categorical covariates
ozht.covar This file contains 8 columns: Family ID,
Individual ID, Age, Sex (1=M,2=F), and 4 genetic principle components
which we will use to account for the effects of ethnicity in our
analyses.
head data/e2/weights.prs
SNP CHR BP A1 A2 BETA P
SNP6438 1 995985 T C 9.22365e-03 0.000126
SNP2310 1 1299528 C A 1.32753e-02 9.11e-08
SNP3090 1 1483355 A G 9.02390e-03 3.58e-07
SNP10481 1 2065231 T G 1.02686e-02 0.0113
SNP2707 1 2118624 T C 9.23841e-03 1.93e-07
SNP886 1 2142211 C A 1.63796e-02 1.24e-18
SNP2202 1 2248368 T C 1.49619e-02 7.52e-08
SNP10484 1 2248297 C A -4.84493e-03 0.0113
SNP3705 1 2274422 G A -9.42793e-03 8.48e-07
In this file, A1 is the effect allele.
The first method is classically denoted as “Clumping + P-value Thresholding (C+PT)”. This method is also abbreviated as P+T or C+T in certain publications. In brief, the principle of this method is compare various sets of uncorrelated SNPs (e.g., maximum squared pairwise correlation between allele counts at SNPs in the selected set is r2≤0.1 ) that are associated with the trait / disease as certain p-value threshold (e.g., p<0.01). We then select the set of SNPs that yields the largest prediction accuracy with the trait / disease of interest in a validation set. This method is broadly used because of it simplicity but may not often yield the largest accuracy. In this practical, we will use PLINK and R to determine these optimal sets of SNPs. PGS will be calculated using marginal/GWAS SNP effects as weights.
Step 1: clump and select SNPs. Run from command line:
window_kb=1000 # 1000 kb = 1 Mb window
r2_thresh=0.1 # LD threshold for clumping
pv_thresh=5e-8
plink --bfile data/e2/oz \
--clump data/e2/weights.prs \
--clump-kb ${window_kb} \
--clump-p1 ${pv_thresh} \
--clump-p2 ${pv_thresh} \
--clump-r2 ${r2_thresh} \
--out output/c-pt_rsq_${r2_thresh}_p_below_${pv_thresh}
How many SNPs are picked?
Step 2: calculate the PGS with PLINK using the –score command (help: https://www.cog-genomics.org/plink/1.9/score).
plink --bfile data/e2/oz\
--score data/e2/weights.prs 1 4 6 header sum \
--extract output/c-pt_rsq_${r2_thresh}_p_below_${pv_thresh}.clumped \
--out output/c-pt_rsq_${r2_thresh}_p_below_${pv_thresh}.pred
Assuming that the 1rt column is the SNP ID; 4th column is the effective allele information; the 6th column is the effect size estimate; and that the file contains a header. We use Sum of the effects. See here for plink documentation.
Take a look at the output:
head output/c-pt_rsq_0.1_p_below_5e-8.pred.profile
FID IID PHENO CNT CNT2 SCORESUM
1400 1400t1 -9 1902 353 -0.38628
1400 1400t2 -9 1786 340 0.222768
570 570t1 -9 1770 355 -0.208286
570 570t2 -9 1770 355 -0.208286
3413 3413t1 -9 1686 341 -0.0885919
3413 3413t2 -9 1682 320 -0.199319
1911 1911t1 -9 1970 399 -0.336112
1911 1911t2 -9 1970 399 -0.336112
1403 1403t1 -9 1774 362 -0.576168
We will use linear mixed model to assess r2 of our PRS score as our samples contains related individuals. (If all individuals are unrelated, we can just use linear regression.) First prepare covariates file, the last column is the PRS score
qcovar <- read.table("data/e2/ozht.qcovar", header = T)
prs <- read.table("output/c-pt_rsq_0.1_p_below_5e-8.pred.profile", header = T)
all=merge(qcovar, prs, by=c("IID", "FID"))
write.table(all[,c("FID", "IID", "age", "PC1", "PC2", "PC3", "PC4", "SCORESUM")], "output/ozht.prs_rsq_0.1_p_below_5e-8.qcovar", quote = F, row.names = F)
Then run GCTA to estimate fixed effect size for PRS score.
From command line run this:
gcta --bfile data/e2/oz --make-grm --out output/ozGCTA
gcta --reml\
--reml-est-fix\
--grm output/ozGCTA\
--pheno data/e2/ozht.phen\
--qcovar output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.qcovar\
--covar data/e2/ozht.covar\
--out output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.reml
grep X_7 output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.reml.log > output/ozht.prs_rsq_${r2_thresh}_p_below_${pv_thresh}.effect.txt
Note X_7 is the variable for PRS score in this example. The order
depends on the order in the --covar and
--qcovar files.
Let’s calculate r2 and p value for this PRS score back in R
phen <- read.table("data/e2/ozht.phen", header = T)
all = merge(all, phen, by=c("IID", "FID"))
res <- read.table("output/ozht.prs_rsq_0.1_p_below_5e-8.effect.txt", header = F)
betahat <- res[1,2]
sehat <- res[1,3]
r2 <- (betahat/sd(all$ht, na.rm = T)*sd(all$SCORESUM, na.rm = T))**2
pval <- (1-pt(q=betahat/sehat, df = 1894, lower.tail = T))*2
cat("r2: ", r2, "; p value: ", pval)
r2: 0.009820507 ; p value: 0.0005102413
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /software/R-4.1.0-no-openblas-el7-x86_64/lib64/R/lib/libRblas.so
LAPACK: /software/R-4.1.0-no-openblas-el7-x86_64/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
[4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=C
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 rstudioapi_0.13 knitr_1.42 magrittr_2.0.1
[5] workflowr_1.6.2 R6_2.5.0 rlang_1.1.0 fastmap_1.1.0
[9] fansi_0.5.0 stringr_1.4.0 tools_4.1.0 xfun_0.38
[13] utf8_1.2.1 cli_3.6.1 git2r_0.28.0 jquerylib_0.1.4
[17] htmltools_0.5.5 ellipsis_0.3.2 rprojroot_2.0.2 yaml_2.2.1
[21] digest_0.6.27 tibble_3.1.2 lifecycle_1.0.3 crayon_1.5.2
[25] later_1.2.0 sass_0.4.0 vctrs_0.3.8 promises_1.2.0.1
[29] fs_1.6.1 glue_1.4.2 cachem_1.0.5 evaluate_0.20
[33] rmarkdown_2.21 stringi_1.6.2 bslib_0.4.2 compiler_4.1.0
[37] pillar_1.6.1 jsonlite_1.7.2 httpuv_1.6.1 pkgconfig_2.0.3