Perform fast queries against a massive database of complete GWAS summary data (2024)

Perform fast queries against a massive database of complete GWAS summary data (1)

GibranHemani

Source: vignettes/guide.Rmd

guide.Rmd

The OpenGWAS databasecomprises over 50,000 curated, QC’d and harmonised complete GWAS summarydatasets and can be queried using an API. See here for documentation on theAPI itself. This R package is a wrapper to make generic calls to theAPI, plus convenience functions for specific queries.

Authentication

From 1st May 2024, most queries to the OpenGWAS API will require userauthentication. For more information on why this is necessary, see thisblogpost.

To authenticate, you need to generate a token from the OpenGWASwebsite. The token behaves like a password, and it will be used toauthorise the requests you make to the OpenGWAS API. Here are the stepsto generate the token and then have ieugwasr automaticallyuse it for your queries:

  1. Login to https://api.opengwas.io/profile/
  2. Generate a new token
  3. Add OPENGWAS_JWT=<token> to your.Renviron file. This file could be either in your homedirectory or in the working directory of your R session. You can checkthe location of your .Renviron file by runningSys.getenv("R_ENVIRON_USER") in R.
  4. Restart your R session
  5. To check that your token is being recognised, runieugwasr::get_opengwas_jwt(). If it returns a long randomstring then you are authenticated.
  6. To check that your token is working, run user(). Itwill make a request to the API for your user information using yourtoken. It should return a list with your user information. If it returnsan error, then your token is not working.

Now any query to OpenGWAS will automatically include your token toauthorise the request.

IMPORTANT NOTE: Do not share this token with othersas it is equivalent to a password. If you think your token has beencompromised, you can generate a new one from the OpenGWAS website.

Deprecated Google authentication

Note that previously we used Google OAuth2 for authentication, inorder for users to access private datasets to which they had specificaccess. This authentication method is no longer supported, and all usersshould now use the JWT token method described above. You can delete theieugwasr_oauth2 directory as it will no longer beneeded.

Allowance

Due to very high usage, we have had to limit the number of queriesthat can be made in a given time period. Every user has an allowancethat is reset periodically, and is used based on the queries that yousubmit. If this is posing an issue do get in touch and we can discusshow to manage this. See here for full details on the allowance system:https://api.opengwas.io/api/#allowance.

General API queries

The API has a number of endpoints documented here. A general way to accessthem in R is using the api_query function. There are twotypes of endpoints - GET and POST.

  • GET - you provide a single URL which includes theendpoint and query. For example, for the associationendpoint you can obtain some rsids in some studies, e.g.
    • api_query("associations/ieu-a-2,ieu-a-7/rs234,rs123")
  • POST - Here you send a “payload” to the endpoint. So,the path specifies the endpoint and you add a list of queryspecifications. This is useful for long lists of rsids being queried,for example
    • api_query("associations", query=list(rsid=c("rs234", "rs123"), id=c("ieu-a-2", "ieu-a-7")))

The api_query function returns a responseobject from the httr package. See below for a list offunctions that make the input and output to api_query moreconvenient.

Get API status

library(ieugwasr)api_status()

Get list of all available studies

gwasinfo()

Get list of a specific study

gwasinfo("ieu-a-2")

Extract particular associations from particular studies

Provide a list of variants to be obtained from a list of studies:

associations(variants=c("rs123", "7:105561135"), id=c("ieu-a-2", "ieu-a-7"))

By default this will look for LD proxies using 1000 genomes referencedata (Europeans only, the reference panel has INDELs removed and onlyretains SNPs with MAF > 0.01). This behaviour can be turned off usingproxies=0 as an argument.

Note that the queries are performed on rsids, but chromosome:positionvalues will be automatically converted. A range query can be done usinge.g.

associations(variants="7:105561135-105563135", id=c("ieu-a-2"), proxies=0)

Get the tophits from a study

The tophits can be obtained using

tophits(id="ieu-a-2")

Note that it will perform strict clumping by default (r2 = 0.001 andradius = 10000kb). This can be turned off with clump=0.

Perform PheWAS

Lookup association of specified variants across every study,returning at a particular threshold. Note that no LD proxy lookups aremade here.

phewas(variants="rs1205", pval=1e-5)

PheWAS can also be performed in only specific subsets of the data.The datasets in the IGD are organised by batch, you can see info aboutit here: https://gwas.mrcieu.ac.uk/datasets/ or get a list ofbatches and their descriptions using:

batches()

You can perform PheWAS in only specified batches using:

phewas(variants="rs1205", pval=1e-5, batch=c('ieu-a', 'ukb-b'))

By default PheWAS is performed in all batches (which is of coursesomewhat slower).

LD clumping

The API has a wrapper around plink version 1.90 andcan use it to perform clumping with an LD reference panel from 1000genomes reference data.

a <- tophits(id="ieu-a-2", clump=0)b <- ld_clump( dplyr::tibble(rsid=a$name, pval=a$p, id=a$id))

There are 5 super-populations that can be requested via thepop argument. By default this will use the Europeans subset(EUR super-population). The reference panel has INDELs removed and onlyretains SNPs with MAF > 0.01 in the selected population.

Note that you can perform the same operation locally if you provide apath to plink and a bed/bim/fam LD reference dataset. e.g.

ld_clump( dplyr::tibble(rsid=a$name, pval=a$p, id=a$id), plink_bin = "/path/to/plink", bfile = "/path/to/reference_data")

See the following vignette for more information: Running local LD operations

LD matrix

Similarly, a matrix of LD r values can be generated using

ld_matrix(b$variant)

This uses the API by default but is limited to only 500 variants. Youcan use, instead, local plink and LD reference data in the same manneras in the ld_clump function, e.g.

ld_matrix(b$variant, plink_bin = "/path/to/plink", bfile = "/path/to/reference_data")

There are 5 super-populations that can be requested via thepop argument. By default this will use the Europeans subset(EUR super-population). The reference panel has INDELs removed and onlyretains SNPs with MAF > 0.01 in the selected population.

Super-populations:

  • EUR = European
  • EAS = East Asian
  • AMR = Admixed American
  • SAS = South Asian
  • AFR = African

See the following vignette for more information: Running local LD operations

Variant information

Translating between rsids and chromosome:position, while also gettingother information, can be achieved.

The chrpos argument can accept the following

  • <chr>:<position>
  • <chr>:<start>-<end>

For example

a <- variants_chrpos(c("7:105561135-105563135", "10:44865737"))

This provides a table with dbSNP variant IDs, gene info, and variousother metadata. Similar data can be obtained from searching by rsid

b <- variants_rsid(c("rs234", "rs333"))

And a list of variants within a particular gene region can also befound. Provide a ensembl or entrez gene ID (e.g.ENSG00000123374 or1017) to the following:

c <- variants_gene("ENSG00000123374")

Extracting GWAS summary data based on gene region

Here is an example of how to obtain summary data for some datasetsfor a gene region. As an example, we’ll extract CDK2 (HGNC number 1017)from a BMI dataset (ieu-a-2)

Use the mygenebioconductor package to query the mygene.info API.

library(mygene)a <- mygene::getGene("1017", fields="genomic_pos_hg19")r <- paste0(a[[1]]$genomic_pos_hg19$chr, ":", a[[1]]$genomic_pos_hg19$start, "-", a[[1]]$genomic_pos_hg19$end)b <- ieugwasr::associations(r, "ieu-a-2")

1000 genomes annotations

The OpenGWAS database contains a database of population annotationsfrom the 1000 genomes project - the alternative allele frequencies andthe LD scores for each variant, calculated for each super populationseparately. Only variants are present if they are MAF > 1% in atleast one super population. You can access this info in differentways

  1. Look up a particular set of rsids

    ieugwasr::afl2_rsid(c("rs234", "rs123"))
  2. Look up a set of positions or regions

    ieugwasr::afl2_chrpos("1:100000-900000")
  3. Extract annotations for a list of 20k variants that are common inall super populations, and evenly spaced across the genome

    ieugwasr::afl2_list()
  4. Extract annotations for a 1.3 million HapMap3 variants

    ieugwasr::afl2_list("hapmap3")
  5. Infer the ancestry of a particular study by comparing the allelefrequencies with different super population reference frequencies

    snplist <- ieugwasr::afl2_list()eur_example <- associations(snplist$rsid, "ieu-a-2")ieugwasr::infer_ancestry(eur_example, snplist)eas_example <- associations(snplist$rsid, "bbj-a-10")ieugwasr::infer_ancestry(eur_example, snplist)
Perform fast queries against a massive database of complete GWAS summary data (2024)

References

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6440

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.