School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, PR China

Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an 710049, PR China

Departments of Orthopedic Surgery and Basic Medical Sciences, University of Missouri - Kansas City, Kansas City, MO 64108, USA

Center of Systematic Biomedical Research, Shanghai University of Science and Technology, Shanghai 200093, PR China

College of Life Sciences and Engineering, Beijing Jiao Tong University, Beijing 100044, PR China

Abstract

Background

Recently introduced pathway-based approach is promising and advantageous to improve the efficiency of analyzing genome-wide association scan (GWAS) data to identify disease variants by jointly considering variants of the genes that belong to the same biological pathway. However, the current available pathway-based approaches for analyzing GWAS have limited power and efficiency.

Results

We proposed a new and efficient permutation strategy based on SNP randomization for determining significance in pathway analysis of GWAS. The developed permutation strategy was evaluated and compared to two previously available methods, i.e. sample permutation and gene permutation, through simulation studies and a study on a real dataset. Results showed that the proposed permutation strategy is more powerful and efficient with greatly reducing the computational complexity.

Conclusion

Our findings indicate the improved performance of SNP permutation and thus render pathway-based analysis of GWAS more applicable and attractive.

Background

Genome-wide association scan (GWAS) study is becoming a popular and power method to identify genes underling complex disorders/traits

Wang and his colleagues developed an enrichment score based pathway method for GWAS

In this study, we proposed a new and efficient permutation strategy based on SNP randomization for the significance assessment in pathway-based analysis. Our approach not only dramatically reduced the computational complexity but also improved the power to detect potential pathways involving genes with joint effects on complex disorders/traits. Extensive simulations were conducted to assess the performance of the proposed strategy, the sample randomization and gene randomization strategies. We also applied the three permutation strategies to a real dataset (see

Methods

Pathway-based analysis algorithm

To make this article self-contained, herein we briefly describe the pathway-based analysis algorithm that was recently extended to GWAS by Wang et al. _{i }(^{2 }for case/control association test or _{j }(_{1},..., _{M}. For any given pathway/gene set _{S }genes, an enrichment score (_{S}), which is a weighted Kolmogorov-Smirnov-like (K-S-like) running sum statistic

Where _{S }will be observed.

Permutation strategies

Permutation processes are adopted to approximate the null distribution for the test statistic of each pathway/gene set (_{S }for a given pathway S. In each permutation, this approach shuffles all SNPs across the genome and calculates the statistic for each gene. The scheme of SNP permutation process as well as the other two existing permutation processes is depicted in Fig.

The scheme of different permutation processes

**The scheme of different permutation processes**. Horizontal dashed lines denote genome-wide genotype information of a study subject. Vertical lines denote SNP positions. Black boxes represent regions in which SNPs are annotated to a specific gene.

Step 1: Perform general genome-wide association analyses to determine the SNP-phenotype association statistic for every SNP in the collected dataset.

Step 2: Shuffle all SNPs across the genome to generate a permuted GWAS dataset.

Step 3: With the permuted dataset, as analyzing the observed dataset, calculate the association statistic for each gene and compute the enrichment statistic (_{S}) for each pathway/gene set using Formula (1).

Step 4: Repeat Steps 2 and 3 till to complete a pre-set number of times (e.g. 100,000) to get the null distribution of _{s }for each pathway/gene set.

Step 5: Based on the pool of null distributions of _{S }over all pathways/gene sets, determine the significance of each pathway/gene set according to following strategy.

Estimating significance

Nominal _{S }is greater than the observed one.

Nominal _{S }may not be comparable between pathways/gene sets which usually have different number of genes. To make the enrichment score comparable between pathways, a normalized

Similar to general GWAS, multiple-testing adjustment is needed to correct the large number of pathways/gene sets tested simultaneously. False-discovery rate (FDR), a procedure frequently used to control the fraction of expected false-positive findings to stay below a certain threshold, is utilized to adjust for multiple testing and to compare the performance of the three strategies _{S}) is calculated as the ratio between the fraction of permutations over all pathways/gene sets with

Experimental datasets

A Caucasian GWAS sample including 1,000 unrelated subjects selected from our established and expanding genetic repertoire was used for both the simulation studies and the experimental study

BioCarta pathway database

Simulation studies

Using our experimental genotype data, we carried out simulation studies to compare the proposed permutation strategy with sample randomization and gene randomization, based on the distribution and significance of _{S }obtained through the three permutation strategies under two scenarios.

Scenario 1: It aimed to demonstrate the differences in the distributions of _{S }for the three permutation approaches under the null hypothesis of no marker-phenotype association across the genome. It was simulated by randomly generating the phenotype data according to a standard normal distribution.

Scenario 2: It aimed to illustrate the differences in the distributions of _{S }for the three permutation approaches under the null hypothesis that there are existing gene-disease associations but no gene set enriched with genes ranking at the top of the entire gene lists in the genome. We randomly selected one gene from each of the 166 pathways. After removing duplications, seventy-five unique genes remained. Phenotype data were then simulated under the assumption that each of the 75 genes accounting for 1% genetic variation.

Before general association analyses and pathway analyses, population stratification was tested and controlled in the experimental GWAS dataset. The population stratification inflation factor

To compare the _{S }distributions, 100,000 SNP and gene permutations were conducted under both simulation scenarios and for the real dataset, respectively, but only 1000 sample permutations were performed due to the extreme computational complexity.

Results

Simulation studies

Fig._{S }value distribution of the three permutation strategies under scenario 1. Under the null hypothesis of no marker-phenotype association,

Results of general genome-wide association analysis and pathway analysis under scenario 1

**Results of general genome-wide association analysis and pathway analysis under scenario 1**. A is quartile-quartile plot of general genome-wide association analysis. B is the _{fdr }value distribution of 166 pathways for the three permutation approaches in pathway analysis. Times of 100,000 permutations were performed for SNP or gene randomization and 1,000 permutations for sample randomization.

Fig. _{S }value distribution of the three permutation strategies under scenario 2. With simulated genetic association, we observed an excess number of SNPs in the tail of statistical distribution showing association to the phenotype (Fig. _{S }values should be uniformly distributed. Indeed, sample permutation recognized no enriched pathway. However, the gene permutation method detected most of the pathways (91.56%) as significant with a _{S }value cutoff of 0.05. The SNP permutation approach exhibited an intermediate performance with only one _{S }value less than 0.05 (Fig.

To evaluate computation efficiency, we also assessed the CPU runtime required by the three permutation strategies in the simulation studies. Computation time as well as computation resources used in the simulation studies were summarized in Table

Runtime comparison for three permeation methods

**Permutation methods**

**Computation resource**

**Times of permutation**

**Runtime (hour)**

**Scenario 1 **

**Scenario 2 **

Sample

One Cluster of 4 nodes, each of which has 8 Intel^{® }Pentium^{® }P4 2.0 GHz processor, 7 GB RAM

1,000

12.32

12.35

SNP

Intel^{® }Pentium^{® }4 3.4 GHz dual processors and 2.0 GB RAM

100,000

2.81

2.81

Gene

Intel^{® }Pentium^{® }4 3.4 GHz dual processors and 2.0 GB RAM

100,000

1.90

1.89

Results of general genome-wide association analysis and pathway analysis under scenario 2

**Results of general genome-wide association analysis and pathway analysis under scenario 2**. A is quartile-quartile plot of general genome-wide association analysis. B is the _{fdr }value distributions of 166 pathways for the three permutation approaches. Times of 100,000 permutations were performed for SNP or gene randomization and 1,000 permutations for sample randomization.

Application to the empirical GWAS dataset

We evaluated and compared the relative performance of the study strategies by analyzing an empirical dataset, the aim of which was to explore osteoporosis susceptible genes. General genome-wide association analysis for hip BMD was conducted previously _{S }values were greater than 0.10. While Results obtained from gene permutation showed high false error rate since more than one hundred pathways get _{S }values less than 0.05, which sharply contrast with those reported by sample permutation (correlation coefficient equals -0.16). Interestingly, signals generated by SNP permutation were analogous to those from sample permutation with similar trends and shapes but steeper peaks. The _{S }values obtained by SNP permutation were highly correlated with those obtained by sample permutation, with a correlation coefficient of 0.87 (p < 0.001). SNP permutation detected Phospholipase C-epsilon pathway (plcePathway) of the most statistically significance of enrichment after adjustment for multiple testing (_{S }≤ 0.01).

Although plcePathway is a proposed model for b2-AR- and prostanoid-receptor-mediated PLC and calcium signaling

Pathway-based genome-wide association results for the experimental dataset

**Pathway-based genome-wide association results for the experimental dataset**. Results for randomization of gene, sample, and SNP are colored in blue, red, and black, respectively. The X-axis shows the tested pathways. The Y-axis is the log of observed _{fdr }value.

Discussion

Genome-wide association analysis has become a mainstay in genomic and genetic research

Pathway-based approach for GWAS has a number of advantages. First, pathway-based approach integrates a group of genes belonging to the same pathway/gene set in the background of the entire gene list in a genome-wide scan. Second, it preserves gene-gene correlations among specific gene sets when testing for significance. Third, pathway-based approach easily interprets a large scale association study by identifying pathways or gene set processes rather than focusing on high scoring genes, and allows researchers to refine gene subsets to elucidate biological mechanisms. Fourth, it is robust to background noises and is more likely to detect genes with moderate effects.

Permutation is a crucial process for assessing significance in pathway analysis of gene expression data

Our newly proposed permutation strategy of SNP randomization is informative and efficient. On one hand, comparing to gene permutation, SNP permutation is more rational since it assumes that the existed genetic effects are randomly scattered across genome rather than among genes. In pathway analyses, the statistics for a gene are combined from SNP-level statistics. The randomization of the integrated gene statistics ignores the variation of the number of SNPs between genes. For example (please refer to Fig. _{A}, _{B }present the gene statistics for gene A and B, separately. When we shuffle the gene statistics in a permutation, gene A may take the statistic value _{B}, which is based on 20 rather than 10 SNPs. The distributions for gene statistics are expected to be different to construct from statistics of different number of SNPs. With more times of gene permutation, the number of SNPs related to the combined gene statistics for a gene from genome varies greatly, which introduces quite a lot of noises in the significance determination process. This may partly explain the inflated type I error rate of gene permutation. Since SNP permutation shuffles the SNP-level statistics and calculates gene statistic in each permutation, it overcomes the above problem in gene permutation. On the other hand, comparing to sample randomization, SNP randomization not only is highly efficient but also maintains the acceptable accuracy level (i.e. SNP randomization is not subject to an inflation of type I error rate). Although previous strategy of sample permutation is well accepted, it has not been widely applied due to its huge computation requirement to pursue a large number of replications. Given millions of genotyped markers in thousands of subjects for current GWAS, very limited replications (such as 1,000) of sample randomization can be obtained within a reasonable time frame. Overall, SNP randomization as proposed in current study inherits the merit from sample permutation making full use of the observed data and eliminates the problem of computation intensity at the same time. SNP randomization also combines the advantage of gene permutation that utilizes the output of general GWAS instead of raw genotype data. Therefore, SNP permutation is not only powerful but also cost-effective.

One potential limitation of SNP randomization might be that the independent SNP sampling may not preserve the linkage disequilibrium among SNPs and the correlation structures among functionally related genes. In our own experience, this potential problem can be overcome by increasing the number of randomization times. The larger the number of permutation, the more accurate the null distribution will be, and thus more truly reflect the distribution of enrichment of gene-phenotype association signals by random. Actually, it can be seen from the results of our empirical dataset (see Fig. _{S }values determined from SNP permutation (100,000 randomizations) is highly correlated with those from sample permutation (1,000 randomizations). Based on our application, over 50,000 SNP permutations will produce relatively stable null distribution for significance determination (The results, not shown, of 50,000, 100,000 and 150,000 SNP permutations were almost the same).

Recently, two new algorithms were proposed for pathway analysis of GWAS

Conclusion

We report here a SNP permutation scheme that is capable of effectively approximating a comprehensive null distribution to determine statistical significance, which will greatly facilitate pathway-based analysis for genome-wide data. With the improved performance and the implementation of our new SNP permutation strategy, pathway-based GWAS approach becomes more attractive and can be more broadly applied to genome-wide association datasets. Along with single marker/gene based analysis, pathway-based GWAS will enhance our understanding of pathogenesis of complex disorders.

Authors' contributions

YG designed, conducted and analyzed the simulations and prepared a draft of this article. JL participated in project design. LZ and YC provided experimental data management and participated in project design. HD designed and coordinated the work, and participated in the interpretation of the results and the manuscript writing. All authors read and approved the final manuscript.

Acknowledgements

Investigators of this work were partially supported by grants from NIH (R01 AR050496, R21 AG027110, R01 AG026564, P50 AR055081, and R21 AA015973).