Transcription Factor Graph in Prolog

From MagnetoWiki
Jump to navigation Jump to search

Attempt to model transcription factor (TF) interactions in Prolog, by mapping JASPAR / ENCODE TF consensus sites to gene promoters (defining the promoter region as -1000bp to +1000bp of transcription start), and then querying TFs that bind to same genes, or genes that bind same TFs.


Transcription Start Sites (TSS)

Schematic example of locating gene promoter region from -1000bp to +1000bp of TSS for 2 genes, one on "+"-strand and one on "-" strand.

Chromosomal positions were tabulated for regions of human chromosomes corresponding to 1000 bp upstream and 1000 bp downstream of TSS for 21857 genes. The choice of TSSs is inspired by Jonathan Dennis's nucleosome mapping projects.

1. scg3_copy.bed.xlsx file from Jonathan Dennis (email 2021-10-27), with human TSS coordinates from HG19. Columns you will be initially interested in are A, B, F and M : Chromosome, start, strand, and name. "The reason we use HG19 instead of HG38 is because until recently HG19 has been much better annotated."

see UCSD on BED file format for details on position format.

TODO: make sure we are getting boundary conditions correct, i.e. are we including 2001 bases with TSS as middle base, or 2000 bases, with TSS as base +1 (is there no 0?)

2. Find -1000 bp and +1000 bp from TSS. Assume that TSS is ChromStart (column B) if gene is on (+)-strand, otherwise start is ChromEnd (column C) if gene is on (-)-strand.

So:

  • if on (+)-strand, TSS is ChromStart, upstream ChromStart-1000 to downstream ChromStart+1000.
  • if on (-)-strand, TSS is ChromEnd, downstream ChromEnd-1000 to upstream ChromEnd+1000 (note orientation is flipped)


3. Save copy as scg3_copy.bed.plusminus1000.xlsx with additional columns for TSS, 1000 bp upstream and 1000 bp downstream (with postions all on the (+)-strand)

Gene Facts Database

Generate prolog facts for each gene promoter of the form:

gene(GeneName,position(Chrom,Upstream1000,Downstream1000,Strand)).

where

  • GeneName is gene abbreviation, eg. 'OR4F5'
  • Chrom is chromosome number, 'chr1' through 'chr22', 'chrX', 'chrY'
  • Upstream1000 is position 1000 bp upstream of TSS relative to (+)-strand, eg. 68090
  • Downstream1000 is position 1000 bp downstream of TSS relative to (+)-strand, e.g. 700090
  • Strand is either "." (=no strand) or "+" or "-". Note that orientation and up/downstream relative to TSS is flipped based on the strand.

so promoter region of first gene on chr1 is asserted as:

gene('OR4F5',position(68090,70090,'+')).

Facts are constructed in Excel by concatenating fields in scg3_copy.bed.plusminus1000.xlsx, then saved to gene-facts.prolog text file.