Department of Mathematics and Statistics, University of Missouri-Kansas City, Kansas City, MO 64110, USA

Department of Statistics, Hacettepe University, 06800Beytepe-Ankara, Turkey

Biomedical Engineering Department, Tulane University, New Orleans, LA 70118, USA

Departments of Orthopedic Surgery and Basic Medical Sciences, School of Medicine, University of Missouri-Kansas City, Kansas City, MO, 64108, USA

To study chromosomal aberrations that may lead to cancer formation or genetic diseases, the array-based Comparative Genomic Hybridization (aCGH) technique is often used for detecting DNA copy number variants (CNVs). Various methods have been developed for gaining CNVs information based on aCGH data. However, most of these methods make use of the log-intensity ratios in aCGH data without taking advantage of other information such as the DNA probe (e.g., biomarker) positions/distances contained in the data. Motivated by the specific features of aCGH data, we developed a novel method that takes into account the estimation of a change point or locus of the CNV in aCGH data with its associated biomarker position on the chromosome using a compound Poisson process. We used a Bayesian approach to derive the posterior probability for the estimation of the CNV locus. To detect loci of multiple CNVs in the data, a sliding window process combined with our derived Bayesian posterior probability was proposed. To evaluate the performance of the method in the estimation of the CNV locus, we first performed simulation studies. Finally, we applied our approach to real data from aCGH experiments, demonstrating its applicability.

1. Introduction

Cancer progression, tumor formations, and many genetic diseases are related to aberrations in some chromosomal regions. Chromosomal aberrations are often reflected in DNA copy number changes, also known as copy number variations (CNVs) [1]. To study such chromosomal aberrations, experiments are often conducted based on tumor samples from a cell-line-using technologies such as aCGH or SNP arrays. For instance, in aCGH experiments, a DNA test sample and a diploid reference sample are first fluorescently labeled by Cy3 and Cy5. Then, the samples are mixed and hybridized to the microarray. Finally, the image intensities from the test and reference samples can be obtained for all DNA probes (bio-markers) along the chromosome [2, 3]. The log-base-2 ratios of the test and reference intensities, usually denoted as

There are a number of computational and statistical methods developed for the detection of CNVs based on aCGH data and SNP data. Examples include a finite Gaussian mixture model [6], pair wise

The above-mentioned algorithms, however, have not taken advantage of other information such as the positions of the DNA probes or biomarkers along the chromosome. Recently, many researchers have begun to consider variations in the distance between biomarkers, gene density, and genomic features in the process of identifying increased or decreased chromosomal region of gene expression [5]. Several notable methods emerged along this line and we list a few of them here. Levin et al. [5] developed a scan statistics for detecting spatial clusters of genes on a chromosome based on gene positions and gene expression data modeled by a compound Poisson process on the basis of two independent simple Poisson processes. Daruwala et al. [20] developed a statistical algorithm for the detection of genomic aberrations in human cancer cell lines, where the location of aberrations in the copy numbers was modeled by a Poisson process. They distinguished genes as "regular" and "deviated", where the regular genes refer to those that have not been affected by chromosomal aberrations while the deviated genes are those whose log-transformed expression follows a Gaussian distribution with unknown mean and variance [20]. Sun et al. [21] developed a SNP association scan statistic similar to that of Levin et al. [5] using a compound Poisson process, which considers the complex distribution of genome variations in chromosomal regions with significant clusters of SNP associations.

Improvements have been made with the above more sophisticated modeling of the aCGH using both the log-intensity ratios and biomarker positions. The computation involved in this type of modeling is usually demanding and further improvement is needed. Motivated by these existing works, we propose to use a compound Poisson process approach to model the genomic features in identifying chromosomal aberrations. We use a Bayesian approach to determine an aberration (or a change-point) in the aCGH profile modeled by a compound Poisson process. In our model, the occurrences of the biomarkers are modeled by a homogeneous Poisson process and the aCGH is modeled by a Gaussian distribution. This novel method is able to identify the aberration corresponding to the CNVs with associated distance between biomarkers on the chromosome. The proposed method is inspired by the scan statistic [5, 21], which is widely used for identifying chromosomal aberrations. However, our method differs from the work of Levin et al. [5] in that our method uses a statistical change-point model with a compound Poisson process for the identification of CNVs.

2. Methods

2.1. Modeling aCGH Data Using a Compound Poisson Change Point Model

To describe our approach, we first describe a change-point model for a compound Poisson process in terms of the normalized log ratio

represent the distance between the

where

Note the fact that the distances

Assume that the given interval with base pair length

Given that

The problem is if there is an aberration (increase or decrease) in the sequence

where

For illustration purpose, in the following Figure

Simulated compound Poisson process data with one change: The upper panel is a plot of the simulated log ratio intensities (normally distributed) against the genomic positions, and the lower panel is a plot of the interval length against the corresponding genomic positions (distributed with Poisson)

**Simulated compound Poisson process data with one change: The upper panel is a plot of the simulated log ratio intensities (normally distributed) against the genomic positions, and the lower panel is a plot of the interval length against the corresponding genomic positions (distributed with Poisson)**.

2.2. A Bayesian Analysis for Locating the Change Point

The change-point model in the compound Poisson process described above can be viewed as a hypothesis testing problem. It tests the null hypothesis,

versus the alternative hypothesis

The alternative hypothesis (7) above defines a change-point model. For this model, we propose a Bayesian approach for the estimate of

The following joint prior distribution is given for

and for the common variance

Under those assumptions, the likelihood function of

The joint posterior distribution of the parameters

Integrating (12) above with respect to

for

The probability

In order to compute the probability given by (15), the occurrence rates

With these MLEs, (15) becomes

Therefore, with the Poisson probabilities given by (17),

Finally, the marginal posterior distribution of the locus

where

Based on the above theoretical results, we provide the computational implementation of our approach in the next subsection.

2.3. Computational Implementation of the Bayesian Approach

To implement our above Bayesian approach to real data, it is necessary to define the number,

Although our approach was given for the single change-point model in compound Poisson process, it can be easily extended to the multiple change points (or aberrations) by using a sliding window approach [21, 22]. Sun et al. [21] have taken the sliding window sizes as 3 to 10 consecutive markers in their application. Our numerical experiments suggest that the sliding window of sizes ranging from 12 to 35 subintervals should be effective in searching for multiple changes in the aCGH data based on our proposed Bayesian approach. To avoid intermediate edge problems within each window, the two adjacent windows have to overlap. Many of such issues were also discussed in [22]. For the searching of multiple change points with the sliding window approach, a practical question is how to set the threshold value for the maximum posterior probabilities associated with all windows. In our application, we used the heuristic threshold of 0.5 (which is popular in probability sense) for the maximum posterior probabilities.

As a summary of our method, we give the following steps to implement our proposed Bayesian approach to the compound Poisson change-point model (Bayesian-CPCM).

(1) If it is known that a chromosome has potentially one aberration region, calculate the posterior probability (19) and identify the locus

(2) If there are multiple aberration regions on a chromosome or genome, choose a total of

(3) For window

(4) Count the number of biomarkers,

(5) Compute the posterior probabilities for

(6) Convert the identified change position

(7) Repeat steps 3–6 above for

The Matlab code of the Bayesian-CPCM approach has been written and is available upon readers' request.

3. Results

3.1. Simulation Results

The proposed method provides a theoretic framework of detecting CNVs using both biomarker positions and log-intensity ratios. Since there is no suitable metric that can be used to compare the proposed approach with all existing algorithms, we carried simulation studies based on a commonly used approach for evaluating the estimation of a change point. We simulated sequences as independent normal distributions with moderate sample size

Simulation results

**When **

**When **

**
**

**
**

**
**

**
**

**
**

**
**

**
**

**
**

**
**

3

2.8870

0.8210

0.4034

3

2.8960

0.8630

0.2903

12

6

5.9710

0.9040

0.3774

6

5.9510

0.9070

0.4635

9

8.7930

0.8560

1.6906

9

8.9130

0.8940

0.8038

5

5.0010

0.9800

0.0230

5

5.0050

0.9910

0.0150

20

10

10.0180

0.9800

0.0200

10

10.0110

0.9850

0.0150

15

15.0090

0.9800

0.0310

15

15.0130

0.9810

0.0190

8

8.0070

0.9930

0.0070

8

8.0040

0.9960

0.0040

32

16

16.0020

0.9900

0.0100

16

16.0000

0.9980

0.0020

24

24.0020

0.9960

0.0040

24

23.9980

0.9980

0.0020

10

10.0020

0.9980

0.0020

10

10.0030

0.9970

0.0000

40

20

20.0040

0.9960

0.0040

20

20.010

0.9990

0.0010

30

30.0000

1.0000

0.0040

30

30.0010

0.9990

0.0010

20

20.000

1.0000

0.0000

20

20.0000

1.0000

0.0000

80

40

40.0000

1.0000

0.0000

40

40.0000

1.0000

0.0000

60

60.0000

1.0000

0.0000

60

60.0000

1.0000

0.0000

30

30.0030

0.9970

0.0030

30

30.0000

1.0000

0.0000

120

60

60.0000

1.0000

0.0000

60

60.0000

1.0000

0.0000

90

90.0000

1.0000

0.0000

90

90.0000

1.0000

0.0000

In this table,

The simulation results given in Table

3.2. Applications to aCGH Datasets on 9 Fibroblast Cell lines

Several aCGH experiments were performed on 15 fibroblast cell lines and the normalized averages of the

For the 9 fibroblast cell lines analyzed in many followup papers of [23], we also used our posterior probabilities (19) to locate the locus (or loci) on those chromosomes where the alterations had been identified. It turned out that our method can identify the locus (or loci) of the DNA copy number alterations that are exactly corresponding to the karyotyping results [23]. The CNVs found by our proposed Bayesian approach (with sliding windows when appropriate) are summarized in the following Tables

Results of the Bayesian approach on chromosomes with one change identified

**Cell line**

**Chromosome**

**
**

GM01535

chromosome 5

176824

.5237

GM01750

chromosome 9

26000

.9666

GM01750

chromosome 14

11545

.7867

GM03563

chromosome 3

10524

.8808

GM03563

chromosome 9

2646

1.000

GM07081

chromosome 7

57971

.6390

GM13330

chromosome 1

156276

.9994

GM13330

chromosome 4

173943

.9999

The posterior probability shown is the maximum posterior probability for the chromosome.

Results of the Bayesian approach on chromosomes with two changes identified

**Cell line**

**Chromosome**

**
**

**Window size**

GM01524

chromosome 6

74205, 145965

.9501, .7411

17

GM03134

chromosome 8

99764, 146000

.9397, 9602

20

GM05296

chromosome 10

64187, 110412

.7229, .8955

30

GM05296

chromosome 11

34420, 43357

.8496, .9852

18

GM13031

chromosome 17

50231, 58122

.9434, .7701

20

The posterior probability shown is the maximum posterior probability for the chromosome at the respective loci.

According to the posterior probability (19), we found that there was one copy number change on chromosome 5 of the cell line GM01535, chromosomes 9 and 14 of the cell line GM01750, chromosomes 3 and 9 of the cell line GM03563, chromosome 7 of the cell line GM07081, and chromosomes 1 and 4 of the cell line GM13330. No false positives were found on these chromosomes with the threshold of 0.5 for the maximum posterior probability (20). These findings are consistent with the karyotyping result of Snijders et al. [23]. In Figures

Chromosome 3 of GM03563 [23] with identified change locus and the posterior probability distribution: A red circle indicates a significant DNA copy number change point such that the segment before this red circle (inclusive of the red circle) is different from the successor segment after the red circle (exclusive of the red circle)

**Chromosome 3 of GM03563 **[23] **with identified change locus and the posterior probability distribution: A red circle indicates a significant DNA copy number change point such that the segment before this red circle (inclusive of the red circle) is different from the successor segment after the red circle (exclusive of the red circle)**

Chromosome 7 of GM07081 [23] with identified change locus and the posterior probability distribution: A red circle indicates a significant DNA copy number change point such that the segment before this red circle (inclusive of the red circle) is different from the successor segment after the red circle (exclusive of the red circle)

**Chromosome 7 of GM07081 **[23] **with identified change locus and the posterior probability distribution: A red circle indicates a significant DNA copy number change point such that the segment before this red circle (inclusive of the red circle) is different from the successor segment after the red circle (exclusive of the red circle)**.

Our posterior probability function of (20) combined with the sliding window approach signals two or more possible copy number changes on chromosome 6 of GM01524, chromosome 8 of GM03134, chromosomes 10 and 11 of GM05296, and chromosome 17 of GM13031. These results were given in Table

Chromosome 6 of GM01524 [23] with identified change loci (indicated by red arrows) and the posterior probability distributions with a window size of 20

**Chromosome 6 of GM01524 **[23] **with identified change loci (indicated by red arrows) and the posterior probability distributions with a window size of 20**

Chromosome 17 of GM13031 [23] with identified change loci (indicated by red arrows, while the green arrow indicates a false positive) and the posterior probability distributions with a window size of 20

**Chromosome 17 of GM13031 **[23] **with identified change loci (indicated by red arrows, while the green arrow indicates a false positive) and the posterior probability distributions with a window size of 20**.

3.3. Comparison of the Performances of the Proposed Bayesian-CPCM with CBS on the Fibroblast Cell-Lines Datasets

There are many approaches (computational or statistical) now available for analyzing aCGH data in the relative literature. But many of those approaches, especially CBS [4], have targeted on modeling the log ratio intensity in aCGH data. Now, in this paper, we have used a new concept to model both the gene position and the log ratio intensity in aCGH data. That is, the most distinct feature of the proposed Bayesian-CPCM approach, among other existing methods in the literature, is its usage of the information of the gene positions (hence gene distances) and the log ratio intensities in the model.

Although there is no suitable metric that can be used to compare all the existing methods for CNV data analysis, we used the specificity and sensitivity as comparison metric to evaluate the performance of our proposed method with one of the most popularly used CBS method. The comparison results are given in the following Table

Comparison of the changes found using CBS and the proposed Bayesian-CPCM on the nine fibroblast cell lines

**Cell line/chromosome**

**CBS**

**Bayesian-CPCM approach**

**
**

**
**

GM01524/6

Yes

Yes

Yes

Number of false positives

6

2

0

Specificity

72.7%

90.9%

100%

Sensitivity

100%

100%

100%

GM01535/5

Yes

Yes

Yes

GM01535/12

No

No

No

Number of false positives

2

0

0

Specificity

90.5%

100%

100%

Sensitivity

50%

50%

100%

GM01750/9

Yes

Yes

Yes

GM01750/14

Yes

Yes

Yes

Number of false positives

1

0

0

Specificity

95.2%

100%

100%

Sensitivity

100%

100%

100%

GM03134/8

Yes

Yes

Yes

Number of false positives

3

1

3

Specificity

86.4%

95.5%

97.9%

Sensitivity

100%

100%

100%

GM03563/3

Yes

Yes

Yes

GM03563/9

No

No

Yes

Number of false positives

8

5

0

Specificity

61.9%

76.2%

100%

Sensitivity

50%

50%

100%

GM05296/10

Yes

Yes

Yes

GM05296/11

Yes

Yes

Yes

Number of false positives

3

0

2

Specificity

88%

100%

99.3%

Sensitivity

100%

100%

100%

GM07081/7

Yes

Yes

Yes

GM07081/15

No

No

No

Number of false positives

1

0

0

Specificity

95.2%

100%

100%

Sensitivity

50%

50%

100%

GM13031/17

Yes

Yes

Yes

Number of false positives

5

3

1

Specificity

79.2%

87.5%

98.8%

Sensitivity

100%

100%

100%

GM13330/1

Yes

Yes

Yes

GM13330/4

Yes

Yes

Yes

Number of false positives

8

5

0

Specificity

61.9%

76.2%

100%

Sensitivity

100%

100%

100%

From Table

It is worth noting that the CNV or aberration regions in these 9 fibroblast cell lines that were found using our proposed Bayesian-CPCM approach are also consistent with those identified in Olshen et al. [4], Chen and Wang [19], Venkatraman and Olshen [24]. However, our new approach, Bayesian-CPCM, neither involve heavy computations as that of CBS algorithm in Olshen et al. [4], nor any asymptotic distribution as required in our earlier work [19].

4. Conclusion

A Bayesian approach for identifying CNVs in aCGH profile modeled by a compound Poisson process is proposed in this paper. Theoretical results of the Bayesian analysis are obtained and the algorithm has been implemented with Matlab. Applications of the proposed method to several aCGH data sets have demonstrated its effectiveness. Extensive simulation results indicate that the proposed method can work effectively for various cases. The most distinct feature of the proposed Bayesian-CPCM approach, when compared with existing methods in the literature, is its use of both biomarker positions (hence distances) and the log-intensity ratio information in the model. Another important aspect of the proposed approach is that it characterizes the posterior probability of the loci being a CNV. With the common knowledge of probability, the users can easily judge if there is a CNV at a locus by using the posterior probability together with their biological knowledge.

There are many computational and statistical approaches now available for analyzing aCGH data in the literature. But those approaches, especially the CBS of Olshen et al. [4] and MVCM of Chen and Wang [19], are all targeted on modeling the log ratio in aCGH data. In this paper, we have used a new approach to model both the biomarker position and the log ratio intensity in aCGH data. In other words, the most distinct feature of the proposed Bayesian-CPCM approach, among other existing methods, is the use of both biomarker position information (hence distances) and the log-intensity ratios in the model. The size of the sliding window is very important in search multiple change points in a whole sequence. The criterion of choosing the optimal window size remains to be done in the future.

Acknowledgments

Part of the paper was done while A. Yi