A data mining study of g-quadruplexes and their effect on DNA replication
Abstract
G-quadruplexes are guanine rich sequences of DNA that can form non-Watson-Crick
four stranded structures. They have been found to exist in various regions of the genome and
are believed to play a biological role. We hypothesize that the presence of these structures
poses a barrier to DNA replication by standard DNA polymerases and thus requires the
intervention of alternative robust but error-prone polymerases for the completion of DNA
replication. To test this hypothesis in silico, we assumed that the presence of error-prone
replication could be inferred by studying the degree of variation at these sites. We analyzed
the density of single nucleotide polymorphisms in the neighborhood of potential Gquadruplex
sequences in the human genome. The analysis shows a significantly higher
density of single nucleotide polymorphisms within G-quadruplexes. Further, there is
evidence of a directional bias in the extent of error, seen as an asymmetry in the incidence of
single nucleotide polymorphisms on either side of quadruplexes. Taken together, the
evidence favors the hypothesis that G-quadruplexes have a deleterious effect on the fidelity
of DNA replication. A secondary research goal of the thesis is to reduce the number of false positives in
the prediction of G-quadruplexes based only on sequence information. Most current algorithms are regular expression searches based on sequences that have shown potential to
form G-quadruplexes. Using the results from our investigation on sequence variation,
predicted melting temperature and machine learning models, attributes derived solely from
the sequences were analyzed to determine if classification can be accurately performed. We
conclude that factors external to the sequence may be important in determining if and when
G-quadruplexes form.
Table of Contents
Introduction -- SNP dentistry analysis -- Melting temperature analysis -- Machine learning analysis -- Conclusion
Degree
M.S.