From GWAS Data to Causal Inference
By Tayyaba Alvi
Source: mrcieu.github.io.
This is a critical step to ensure the alleles and their effects are aligned between the exposure and outcome datasets. The harmonise_data()
function handles two main issues: strand mismatches and problematic palindromic SNPs.
Alleles match perfectly between exposure and outcome datasets.
Exposure
Effect: 0.5
Outcome
Effect: 0.05
Alleles are on the complementary strand. They can be corrected.
Exposure
Effect: 0.5
Outcome (Original)
Effect: -0.05
Outcome (Corrected)
(Alleles flipped to match exposure strand)
Effect: +0.05 (Sign flipped)
Effect alleles match, but other alleles do not. Cannot be resolved.
Exposure
Effect: 0.5
Outcome
Effect: 0.05
Allele frequencies are non-ambiguous (e.g., not near 0.5) and suggest a flip.
Exposure
EAF: 0.11
Outcome
EAF: 0.91
Inference: Since 0.11 ≈ 1 - 0.91, the effect allele is likely different. The data is harmonized by flipping the effect sign of the outcome.
Allele frequencies are ambiguous (near 0.5), so the correct strand is unknown.
Exposure
EAF: 0.50
Outcome
EAF: 0.50
Problem: Impossible to determine if the effect alleles are aligned. The direction of effect is ambiguous.
Genetic instruments (SNPs) must be independent. We use a process called LD ClumpingLinkage Disequilibrium (LD) clumping removes SNPs that are highly correlated, ensuring each instrument provides independent information. to select the most significant SNP in a region and remove others in high LD with it. The ld_clump()
function handles this.
Clumping ensures that the selected instruments (green circles) are not correlated due to LD.
What it shows: The relationship between the SNP effects on the exposure vs. the outcome. The slope of the line is the causal estimate.
What it shows: The causal effect estimated by each individual SNP. The combined estimate (e.g., IVW) is shown at the bottom.
What it shows: Checks if a single SNP is driving the overall result. If all points are consistent, the finding is robust.
What it shows: Used to visually inspect for heterogeneity and potential directional pleiotropyDirectional pleiotropy occurs when genetic variants affect the outcome through pathways other than the exposure, which can bias MR results.. A symmetrical plot is expected.
For years, observational studies have shown a strong correlation: higher levels of C-Reactive Protein (CRP), a marker of inflammation, are associated with a higher risk of Coronary Heart Disease (CHD).
↑ CRP → ↑ CHD ?
This raises critical questions that MR is uniquely positioned to answer:
IL-6 is a key upstream driver of CRP production through classic and trans-signaling pathways.
Now, let's test this hypothesis. The TwoSampleMR
package is your primary tool. Here are the essential libraries you'll need: