Permutation compression with applications to genomic data
[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT AUTHOR'S REQUEST.] High Sequencing Technology generates data at an increasing rate. The technology is used widely in molecular biology. Technologies similar to this are producing sequences at comparatively lower cost than the cost of storing the massive sequences. In applications like gene rearrangement, genomic comparison, and extraction of phylogenetic information, the order of genes is important as comparing the chromosomes can give insights into the similarity or difference in the positions in gene arrangement. Since the chromosomes are represented as permutations of genes, for large genomic species like Human, these permutations are needed to be handled efficiently in terms of storage. This requires an effcient compression technique that could store the permutations in compact form with reduced the size. This can reduce the cost of storage as well as save time and effort to retrieve the required information about the position of the gene. In computer science, traditional compression techniques which are used widely for compressing integers are developed around the frequency of symbols in the dataset. However, in permutation where each integer appears exactly once, the existing algorithms developed so far cannot be used to compress these types of datasets. This opens-up the area for the researchers to develop novel compression technique that could exploit the characteristics of such type of data represented using notation of permutations. This research presents a novel compression algorithm developed by utilizing the unique feature of genomic data.
Access to files is limited to the University of Missouri--Columbia.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.