RegulomeDB is a database that provides functional context to variants or regions of interest and serves as a tool to prioritize functionally important single nucleotide variants (SNVs) located within the non-coding regions of the human genome. RegulomeDB queries any given variant by intersecting its position with the genomic intervals that were identified to be functionally active regions from the computational analysis outputs of functional genomic assays such as TF ChIP-seq and DNase-seq (from the ENCODE database) as well as those overlapping the footprints and QTL data.
All the source data used in RegulomeDB v2.1 can be found on the ENCODE website using these two links under the Data button at the top of the page: Experiments and Annotations. RegulomeDB also provides further information about those hits by incorporating them into prediction scores, thereby, providing a way to interpret the probability of these variants to be of real functional significance.
Querying variants with RegulomeDB
Users can submit queries to the RegulomeDB database in the following formats (Note: one can toggle between the hg19 and GRCh38 coordinates using the toggle button above the search box):
Single nucleotide positions: expressed in BED format, i.e. chrom:chromStart-chromEnd.
Chromosomal regions: expressed in BED format, i.e. chrom:chromStart-chromEnd. In this case all the common dbSNPs with a minor allele frequency >1% in this region will be queried and returned.
After supplying a list of search queries in the search box, and upon clicking the search button below, users are redirected to a summary table representing prediction scores for all the given query variants (see the explanation of scores in FAQ).
Users can download the search result output table using the two download buttons on the top of the page: either in BED file format or a tab separated file format. One may also continue to explore each of the results individually by clicking on one of the outputs in the table. Upon clicking on any variant of interest, the users are redirected to the results page that is further subdivided into six sections as seen in the next screenshots. The data sources for each of the sections are explained in FAQ.
The top-most section of any variants page provides a summary of the results on the top, and includes key information such as the rsid of the variant, the number of peaks found intersecting that variant, its prediction rank and score values as well as the allelic frequency of that rsid in different populations as reported in different data sources such as GnomAD, 1000Geomes, TOPMED, and others.
Each of the sub-sections can be further expanded by clicking on one of the six sections at a time individually.
TF binding sites (ChIP-seq): This page provides further information about the TF (transcription factor) ChIP peaks that intersected the variant of interest.
The first section on this page provides the users with a bar plot representation showing the number of peaks that intersected with the variant of interest along with their targets. The peak numbers on each of the bars within the chart represents the number of biosamples where the same TF target was found to be having a peak signal that contained the intersected variant position.
Below the bar chart, a user can explore the underlying data on a tabular view that provides further metadata details of all the assays that produced the peak files: such as the intersecting peak location (chromosome start and end), the biosample information (along with the organ) that was used in each of the TF-ChIP assay, the ENCODE file ids (ENCFF ids) that was a source for the the peak information and its’ corresponding dataset ids (ENCSR ids). The ENCODE file accessions and the dataset accessions on this table are hyperlinked to the corresponding file objects and dataset objects on the ENCODE website for further metadata exploration.
Chromatin accessibility: This section provides the users with a bar plot graphical representation showing the number of times the variant was found to be within peaks called from chromatin accessibility assays using each biosample.
Each of the bars on the bar plot can be further expanded to view the underlying data table by clicking the title to the left of each bar.
Just like the ChIP data page, users can click on the hyperlinked ENCFF (file ids) or ENCSR (dataset ids) and that leads them to the corresponding ENCODE pages showing further metadata of the file or dataset information.
Note: in cases where we have more than one biosample DNase peak, they are not necessarily redundant. The DNase-seq samples can be derived from different donors and different treatment conditions. One could explore the exact underlying metadata by looking on the dataset linkouts to the ENCODE portal.
TF motifs & footprints: This page provides information regarding the position weighted matrices (PWMs) representing TF motifs and matching with the sequence overlapping the variant of interest, as well as footprints information that intersected with the variant of interest.
We provide a list of biosamples that were the source files for DNase-seq peak files used in the TRACE pipeline for predicting the footprints. (See how TF motifs and footprints are computed in FAQ.)
The biosamples list is hyperlinked to the corresponding ENCODE annotation filesets that contain the TRACE output files in the bed format. The ENCODE page also provides information about the exact chromatin accessibility file used for the TRACE pipeline.
Similarly the PWM file (when available) is also listed as a hyperlinked ENCFF id above the biosamples list box and can be further explored on the ENCODE website.
The exact genome reference region that overlaps with all the output motifs is represented on the top section along with a “boxed” letter that represents the variant of interest.
Each query could match a footprint (sometimes with no significant PWM score), or match the PWM itself outside of a footprint.
eQTLs & caQTLs (chromatin accessibility QTLs): The tables in this section show the information of eQTL and caQTL studies where the query variant is identified to be associated with gene expression levels and chromatin accessibility.
The caQTL data comes from curated publications, viewable at ENCODE portal.
The corresponding ENCODE file ids and their corresponding dataset ids are also listed on the table and hyperlinked for further exploration.
The biosample information and population ethnicity information (when available) are also listed on the caQTL table and correspond to the original biosample information used for that study in the publication.
Example: rs75982468 has both biosample and population information that comes from the publication listed here: PMID:30650056.
Chromatin states:
This section shows predicted chromatin states from chromHMM.
The variant positions are intersected with those chromatin states and displayed on an interactive human body map as well as in a tabular representation.
The body map is colored by the most active state among all biosamples in each organ. Thus, it shows the users a pictorial representation of candidate organs where the query variant is likely to be functional and within different categories of regulatory elements.
For example, if the variant is within an active enhancer region for that biosample, it might lead to changes in the gene expression that is regulated by that enhancer.
Users can use the body map diagram to filter down the search results to display only a few organs of interest. Users can also filter the search results using the list of biosamples or the various chromatin states that are listed on the panels next to the body map.
The tabular view below provides further details on the biosample, classification, organ as well as the source ENCODE datasets and files (hyperlinked to ENCODE for further metadata exploration).
Genome browser: Users can explore the nearby genes of the variant (shown as a yellow highlight on the browser tracks). The browser shows the tracks from TF ChIP-seq and DNase-seq assays with overlapping peaks of the variant.
Users can use the “Refine your search” interface located above the browser tracks to further narrow down the list of tracks as needed.
This interface allows users to select from a variety of faceting options. For example, users can filter down the browser tracks displayed in the browser using the file types (bigWig or bigBed), dataset types (ChIP-seq or DNAse-seq), organ or cell type, biosamples as well as the targets used in respective ChIP-seq assays.
Users can also expand the track information section using the expand button on the lower right corner of each track. This expanded view allows the users to see the underlying ENCODE file and ENCODE dataset (both of which are hyperlinked to the respective ENCODE pages).
To cite RegulomeDB: Dong, S., Zhao, N., Spragins, E., Kagda, M. S., Li, M., Assis, P. R., Jolanki, O., Luo, Y., Michael Cherry, J., Boyle, A. P., & Hitz, B. C. (2022). Annotating and prioritizing human non-coding variants with RegulomeDB. In bioRxiv. Boyle AP, Hong EL, Hariharan M, Cheng Y, Schaub MA, Kasowski M, Karczewski KJ, Park J, Hitz BC, Weng S, Cherry JM, Snyder M. Annotation of functional variation in personal genomes usingRegulomeDB. Genome Research 2012, 22(9):1790-1797. PMID: 22955989.