Abstract

Motivation: In current databases, there are many genes with inconsistent mapping positions between their cytogenetic annotations and sequence map positions. However, not all inconsistencies are the same. Some of them may be problematic which should be corrected in the future; while others may result from the imprecise nature of chromosomal banding which may be tolerable. It is important to stratify the cytogenetic position information into different confidence groups with the recognition of the impreciseness of cytogenetic banding.

Results: When plotting their cytogenetic annotations against sequence map positions on a 2D plane, the consistent genes tend to have a compact linear distribution; while genes with inconsistent positions are more scattered. The overlapping areas between these two groups are defined as the tolerable imprecision zones by linear regression and distance analysis. The system was implemented using sequence information from NCBI Map Viewer Build 36.3 and cytogenetic annotations from NCBI Entrez Gene. The genes' position information is classified into five confidence groups: inconsistent-intolerable, inconsistent-tolerable, consistent-imprecise, consistent-precise and consistent-rough. Using information from NCBI Map Viewer Build 36.3 and NCBI Entrez Gene, the percentages of these confidence groups are 1.4%, 7.0%, 54.0%, 35.4% and 2.2%, respectively. Using information from NCBI Map Viewer Build 36.3 and NCBI online Mendelian inheritance in man (OMIM), the percentages are 3.7%, 16.9%, 49.0%, 19.0% and 11.4%, respectively. Combining these two results, a confidence table of genes' position information was constructed.

Availability: The detailed results are accessible over the Internet at http://centrallab.hosp.ncku.edu.tw/imz.

Contact:  [email protected]

1 INTRODUCTION

Cytogenetic annotation and sequence map are two different systems to determine the position of genes on human chromosomes. The cytogenetic annotations of many genes had been determined by fluorescence in situ hybridization (FISH) (Kirsch et al., 2000; Korenberg et al., 1999), which is a technique based on metaphase chromosomal preparations. Similarly, many disease-associated cytogenetic regions were determined by metaphase chromosomal techniques, such as G-banding, comparative genomic hybridization (Kallioniemi et al., 1992), and spectral karyotyping (Liyanage et al., 1996; Schrock et al., 1996). There is a lot of cytogenetic information in the public domain. For example, the Mitelman database of chromosome aberrations in cancer is a web-service that systematically collects disease-specific structural abnormalities (Mitelman and Heim, 1988; Mitelman et al., 2008), while the NCBI Entrez Gene is a comprehensive source for genes' cytogenetic locations (Maglott et al., 2005). NCBI online Mendelian inheritance in man (OMIM) (McKusick, 1998) is a catalog of human genes and genetic disorders. On the other hand, the human genome sequence provides precise positions of genes on chromosomes. The sequence mapping of genes is stored in the NCBI RefSeq database (Pruitt et al., 2005). NCBI Map Viewer is an alignment viewer designed to integrate feature identity information with whole-genome sequencing results (Wheeler et al., 2006). It is a comprehensive resource for the sequence map of genes based on RefSeq. Many disease-associated chromosomal regions were determined by sequence-based techniques, such as digital karyotyping (Wang et al., 2002) and array CGH (Davies et al., 2005, Solinas-Toldo et al., 1997). These techniques will give rise to very precise genome mapping and are gaining increasing popularities.

Since each system has its own value (large amount of existing information versus position preciseness), the integration of cytogenetic annotation and sequence map information would provide the most comprehensive solution for genome research (Furey and Haussler, 2003; Knutsen et al., 2005). Recently, a numerical transformation algorithm, designated as the cytoband query system (CQS), was developed to perform accurate cytobands searching using cytogenetic annotations (Yen et al., 2005). In an attempt to integrate CQS with the sequence map, we found much inconsistency between cytogenetic and sequence information. Such inconsistency had been reported in 2006 (Cuticchia et al., 2006). Using information from OMIM and Ensembl for the 6830 records with HUGO gene symbols, Cuticchia et al. found that the percentage of inconsistency is 18%. However, not all inconsistencies are the same. Some of them may be truly problematic, while others may result from the imprecise nature of human interpretation in chromosomal banding. Given the same FISH image, different researchers may assign a gene to different cytogenetic bands which are usually close to each other. It is not surprising to find genes annotated to different cytogenetic bands in different literatures. The difference does not dictate that one is correct while the other is wrong. It may reflect human imprecision rather than errors and therefore should be tolerated. The imprecise nature of human judgment needs to be considered when using cytogenetic information. In this study, a method to define the imprecision zone was developed and the result was stratified into five groups: inconsistent-intolerable, inconsistent-tolerable, consistent-precise, consistent-imprecise and consistent-rough. Such grouping provides a practical guide for genomic scientists to use cytogenetic annotations.

2 SYSTEM AND METHODS

2.1 The discordant gene pairs identified by positional order analysis

It is believed that a pair of genes' cytogenetic positional order conforms to their locations on the sequence map. That is, if gene A is before gene B on the sequence map then gene A must precede gene B in the cytogenetic annotation. However, many genes in current databases violate this intuitive positional parallelism. Table 1 shows four examples from the NCBI Map Viewer Homo sapiens build 36.3. In example 1 (Ex. 1), the sequence region of SKI (2149994, 2229316) is before ABCA4 (94230981, 94359293), but the cytogenetic annotation of SKI (1q22-q24) is after ABCA4 (1p22.1-p21). The other three examples in Table 1 are extracted from chromosome 2, 3 and 4, respectively.

Table 1.

Four examples of discordant gene pairs

Ex.Gene idGene nameseq_startseq_stopCytoband
16497SKI214999422293161q22-q24
24ABCA494230981943592931p22.1-p21
2617BCS1L2192326232192364102q33
33ACADL2107609592107983922q34-q35
35067CNTN374394412746530333p26
30ACAA138139211381536193p23-p22
411275KLHL21663482381664637494q21.2
2243FGA1557237301557313474q28
Ex.Gene idGene nameseq_startseq_stopCytoband
16497SKI214999422293161q22-q24
24ABCA494230981943592931p22.1-p21
2617BCS1L2192326232192364102q33
33ACADL2107609592107983922q34-q35
35067CNTN374394412746530333p26
30ACAA138139211381536193p23-p22
411275KLHL21663482381664637494q21.2
2243FGA1557237301557313474q28

Examples of position discordance between cytogenetic annotation and sequence map.

Table 1.

Four examples of discordant gene pairs

Ex.Gene idGene nameseq_startseq_stopCytoband
16497SKI214999422293161q22-q24
24ABCA494230981943592931p22.1-p21
2617BCS1L2192326232192364102q33
33ACADL2107609592107983922q34-q35
35067CNTN374394412746530333p26
30ACAA138139211381536193p23-p22
411275KLHL21663482381664637494q21.2
2243FGA1557237301557313474q28
Ex.Gene idGene nameseq_startseq_stopCytoband
16497SKI214999422293161q22-q24
24ABCA494230981943592931p22.1-p21
2617BCS1L2192326232192364102q33
33ACADL2107609592107983922q34-q35
35067CNTN374394412746530333p26
30ACAA138139211381536193p23-p22
411275KLHL21663482381664637494q21.2
2243FGA1557237301557313474q28

Examples of position discordance between cytogenetic annotation and sequence map.

Figure 1 states the algorithm to find all discordant gene pairs by positional order analysis. This algorithm will be used throughout the system to ensure the consistency of genes' positions in the target set.

Algorithm to find all discordant pairs of genes in database.
Fig. 1.

Algorithm to find all discordant pairs of genes in database.

However, discordant information does not look all the same. For example in Table 1, the cytogenetic annotation of SKI seems to be incorrect. Sequence map states that SKI is in the short arm of chromosome 1; but cytogenetic annotation locates the gene to the long arm. On the contrary in example 2, BCS1L and ACADL are next to each other in both sequence map and cytogenetic annotation. Such discordance seems to be tolerable, if taken into account the imprecise nature of cytogenetic banding.

2.2 The imprecise nature of cytogenetic banding

NCBI Entrez Gene and NCBI OMIM are two major resources of cytogenetic annotations. Many of the annotations are based on literature reviews. Since different researchers may assign a gene to different cytogenetic locations, the annotators of Entrez Gene and OMIM may select different literatures and annotate the gene to different cytobands accordingly. Indeed, there are quite a few discrepancies between Entrez Gene and OMIM. Table 2 shows some examples selected from chromosome 1. This demonstrates the imprecise nature of cytogenetic banding and annotation.

Table 2.

Examples of imprecise nature of cytogenetic banding

Ex.Gene idSymbolEntrez GeneOMIM idOMIM
580263TRIM451p13.16093181p22
610542HBXIP1p13.36085211p13.2
72135EXTL21p216024111p12-p11
85792PTPRF1p341795901p32
98718TNFRSF251p36.26033661p36.3
Ex.Gene idSymbolEntrez GeneOMIM idOMIM
580263TRIM451p13.16093181p22
610542HBXIP1p13.36085211p13.2
72135EXTL21p216024111p12-p11
85792PTPRF1p341795901p32
98718TNFRSF251p36.26033661p36.3

Examples of different cytogenetic annotations in NCBI Entrez Gene and NCBI OMIM for the same genes.

Table 2.

Examples of imprecise nature of cytogenetic banding

Ex.Gene idSymbolEntrez GeneOMIM idOMIM
580263TRIM451p13.16093181p22
610542HBXIP1p13.36085211p13.2
72135EXTL21p216024111p12-p11
85792PTPRF1p341795901p32
98718TNFRSF251p36.26033661p36.3
Ex.Gene idSymbolEntrez GeneOMIM idOMIM
580263TRIM451p13.16093181p22
610542HBXIP1p13.36085211p13.2
72135EXTL21p216024111p12-p11
85792PTPRF1p341795901p32
98718TNFRSF251p36.26033661p36.3

Examples of different cytogenetic annotations in NCBI Entrez Gene and NCBI OMIM for the same genes.

2.3 Mapping the cytogenetic annotation to sequence map

The method described in Section 2.1 offers an elegant way to find all positional discordance in databases without introducing any additional presumption other than the assumed positional parallelism between sequence map and cytogenetic annotation. However, this method cannot determine which gene causes the discordance in the gene pair. In order to analyze the level of consistency between each gene's cytogenetic annotation and its sequence position, we employ another strategy using the ideogram of NCBI Map Viewer Build 36.3 (ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/mapview/ideogram.gz), which defines the positions of cytogenetic banding junctions on the sequence map. We use the junction information in the ideogram to transform cytogenetic annotations into sequence-start and sequence-stop positions on the sequence map. For example, Table 3 shows the mapping results of genes in Table 1. In example 1-1 (Ex. 1-1), the cytogenetic annotation of SKI (1q22-q24) is transformed into the sequence segment cyto2seq (153300000, 171200000), which does not overlap with the original sequence segment of SKI, seq (2149994, 2229316) derived from RefSeq. A gene with non-overlapping cyto2seq and seq segments would be regarded to have inconsistent information between cytogenetic annotation and sequence map. On the other hand, for ABCA4, its cyto2seq (92000000, 107000000) overlaps with seq (94230981, 94359293). Genes with overlapping cyto2seq and seq segments would be regarded as consistent.

Table 3.

Transform the cytogenetic annotation to the sequence map

Ex.Gene idcyto2seq_startcyto2seq_stopseq_startseq_stop
1-1649715330000017120000021499942229316
24920000001070000009423098194359293
2-1617197100000209100000219232623219236410
33209100000221300000210760959210798392
3-15067187000007439441274653033
3030800000436000003813921138153619
4-1112757920000087100000166348238166463749
2243124000000139500000155723730155731347
Ex.Gene idcyto2seq_startcyto2seq_stopseq_startseq_stop
1-1649715330000017120000021499942229316
24920000001070000009423098194359293
2-1617197100000209100000219232623219236410
33209100000221300000210760959210798392
3-15067187000007439441274653033
3030800000436000003813921138153619
4-1112757920000087100000166348238166463749
2243124000000139500000155723730155731347

Examples of position discrepancies between the cytogenetic annotation and the sequence map. cyto2seq: the sequence segment transformed from cytogenetic annotation. Seq: the sequence segment derived from RefSeq.database.

Table 3.

Transform the cytogenetic annotation to the sequence map

Ex.Gene idcyto2seq_startcyto2seq_stopseq_startseq_stop
1-1649715330000017120000021499942229316
24920000001070000009423098194359293
2-1617197100000209100000219232623219236410
33209100000221300000210760959210798392
3-15067187000007439441274653033
3030800000436000003813921138153619
4-1112757920000087100000166348238166463749
2243124000000139500000155723730155731347
Ex.Gene idcyto2seq_startcyto2seq_stopseq_startseq_stop
1-1649715330000017120000021499942229316
24920000001070000009423098194359293
2-1617197100000209100000219232623219236410
33209100000221300000210760959210798392
3-15067187000007439441274653033
3030800000436000003813921138153619
4-1112757920000087100000166348238166463749
2243124000000139500000155723730155731347

Examples of position discrepancies between the cytogenetic annotation and the sequence map. cyto2seq: the sequence segment transformed from cytogenetic annotation. Seq: the sequence segment derived from RefSeq.database.

2.4 Using cyto2seq and seq segments in the consistent set of genes to determine the positional trend of each chromosome

Let cyto2seq_cen be the center of cyto2seq segment of a gene and let seq_cen be the center of seq segment of a gene. We call (cyto2seq_cen, seq_cen) a position point of a gene on the X–Y plane. The consistent and inconsistent genes show different distribution patterns of their position points. For example, in Figure 2, the center position points of all genes in chromosome 1 are depicted. The consistent genes tend to have a compact linear distribution except for some outliers (Fig. 2A); while the inconsistent genes are more loosely distributed (Fig. 2B). We use the consistent genes to define the positional trend of each chromosome for further analysis. The positional trend for consistent genes, in theory, should be linear with a slope of 1 because the cyto2seq and seq segments are based on the same sequence map. We use least squares method to find the linear regression. The outliers in Figure 2A would be removed beforehand because they may cause errors in regression analysis. They are genes with rough cytogenetic annotations, which usually span over several cytobands, or even the whole-chromosomal arms.

The position points of genes in chromosome 1 where x-axis is cyto2seq and y-axis is seq. All position points of consistent genes in chromosome 1 are shown in (A). (B) The inconsistent-tolerable (shown as ‘.’) and inconsistent-intolerable (shown as ‘x’) genes. (C) All position points of consistent-non-rough genes in chromosome 1. (D) All position points of consistent-non-rough (shown as ‘.’), inconsistent-tolerable (shown as ‘o’) and inconsistent-intolerable (shown as ‘x’) genes in chromosome 1. For genes in other chromosomes, please see Supplementary Material at http://centrallab.hosp.ncku.edu.tw/imz.
Fig. 2.

The position points of genes in chromosome 1 where x-axis is cyto2seq and y-axis is seq. All position points of consistent genes in chromosome 1 are shown in (A). (B) The inconsistent-tolerable (shown as ‘.’) and inconsistent-intolerable (shown as ‘x’) genes. (C) All position points of consistent-non-rough genes in chromosome 1. (D) All position points of consistent-non-rough (shown as ‘.’), inconsistent-tolerable (shown as ‘o’) and inconsistent-intolerable (shown as ‘x’) genes in chromosome 1. For genes in other chromosomes, please see Supplementary Material at http://centrallab.hosp.ncku.edu.tw/imz.

To further remove rough genes, we use the roughness of cytobands to set the cutoff criteria. For each chromosome, there are wider cytobands and narrower ones. The width (length) of cytobands can be calculated from the ideogram of NCBI Map Viewer. For each chromosome C, there are n cyto- bands in the ideogram on this chromosome. In addition to C, p-arm and q-arm, we sort all cytobands by their lengths and define the preceding 5% as rough bands. The number of rough cytobands is calculated by the equation: [floor(n×0.05)+3], where floor is a function which gets the next lowest integer value by rounding down (n×0.05) if necessary. A gene is considered to have a rough cytogenetic annotation when the length of its cyto2seq segment is greater than or equal to the minimum length of the rough cytobands on the same chromosome. Otherwise, the gene would be non-rough.Figure 2C shows all position points of consistent-non-rough genes (referred to as ‘non-rough’ in figures) in chromosome 1, which shows fewer noises with an obvious linear regression.

The least squares method shown below is used to find the linear regression of consistent-non-rough genes, where m and b are the slope and intersect of simple linear regression line, respectively.
We can find the linear regression equation of chromosome 1 as
The slope is approximated to be 1, which is the theoretic value. We use the same method to analyze the position points in other chromosomes and obtain similar results (Table 4).
Table 4.

The slopes and intersects of all chromosomes

ChromosomeSlopeIntersectChrSlopeIntersect
11.0038−288 399131.0023−108 180
21.0021−369 914140.999457 936
31.0036−317 729151.0042−253 520
41.002224 409161.0035−415 309
51.0053−169 830171.0138−644 569
61.0018−373 321181.0109−276 035
71.0016−143 082191.0091−208 275
81.0104−506 096201.0184−597 812
91.0004−20 072211.0067−692 491
101.0028−12 779220.95771 328 787
111.0078−840 896X0.9999289 892
120.9974176 683Y1.0188145 485
ChromosomeSlopeIntersectChrSlopeIntersect
11.0038−288 399131.0023−108 180
21.0021−369 914140.999457 936
31.0036−317 729151.0042−253 520
41.002224 409161.0035−415 309
51.0053−169 830171.0138−644 569
61.0018−373 321181.0109−276 035
71.0016−143 082191.0091−208 275
81.0104−506 096201.0184−597 812
91.0004−20 072211.0067−692 491
101.0028−12 779220.95771 328 787
111.0078−840 896X0.9999289 892
120.9974176 683Y1.0188145 485

Linear regression of non-rough genes.

Table 4.

The slopes and intersects of all chromosomes

ChromosomeSlopeIntersectChrSlopeIntersect
11.0038−288 399131.0023−108 180
21.0021−369 914140.999457 936
31.0036−317 729151.0042−253 520
41.002224 409161.0035−415 309
51.0053−169 830171.0138−644 569
61.0018−373 321181.0109−276 035
71.0016−143 082191.0091−208 275
81.0104−506 096201.0184−597 812
91.0004−20 072211.0067−692 491
101.0028−12 779220.95771 328 787
111.0078−840 896X0.9999289 892
120.9974176 683Y1.0188145 485
ChromosomeSlopeIntersectChrSlopeIntersect
11.0038−288 399131.0023−108 180
21.0021−369 914140.999457 936
31.0036−317 729151.0042−253 520
41.002224 409161.0035−415 309
51.0053−169 830171.0138−644 569
61.0018−373 321181.0109−276 035
71.0016−143 082191.0091−208 275
81.0104−506 096201.0184−597 812
91.0004−20 072211.0067−692 491
101.0028−12 779220.95771 328 787
111.0078−840 896X0.9999289 892
120.9974176 683Y1.0188145 485

Linear regression of non-rough genes.

2.5 Defining the imprecision zone of each chromosome

The linear regression is used to define the imprecision zone. The distances of position points to the linear regression are calculated. The largest distance of consistent-non-rough genes will decide the outer limit of the imprecision zone. The smallest distance of inconsistent genes will determine the inner limit. The outer limit will divide the inconsistent genes into inconsistent-tolerable (IC-T) and inconsistent-intolerable (IC-IT). The inner limit will divide the consistent-non-rough into consistent-precise (C-P) and consistent-imprecise (C-IP). The width of the imprecision zone varies along the chromosomal length; therefore the imprecision-zone analysis should be performed in segments. We use the original landmark cytobands to divide the chromosome into regions for the analysis. These cytobands were chosen to be the landmarks to subdivide the p and q arms back in 1971 by a group of 50 human cytogenetic experts during the Fourth International Congress of Human Genetics held in Paris.

We use the region defined by the landmark cytobands as a default segment to find the imprecision zones. Every gene in a region has a distance from the regression line to its position.

The equation of distance from one point P(x0,y0) to a line l is:
Using the equation, we calculate all distances between position points and the regression line in each chromosome. In each region, we can find the maximal distance (called MaxD) of genes in the consistent-non-rough group. We use MaxD to define the maximum boundary of an imprecision zone in each region. Genes in the inconsistent group whose distance is smaller than or equal to MaxD is called inconsistent-tolerable genes (IC-T) and whose distance is greater than MaxD is called inconsistent-intolerable (IC-IT). Figure 2D shows all position points of consistent-non-rough (shown as ‘.’), inconsistent-tolerable (shown as ‘o’) and inconsistent-intolerable (shown as ‘x’) genes in chromosome 1.

Similarly, we can find the minimal distance (called MinD) of genes in the inconsistent group of each region. We use MinD to define the minimum boundary of an imprecision zone in each region. Consistent-non-rough genes whose distance are greater than or equal to MinD are called consistent-imprecise (C-IP) genes and whose distance are smaller than MinD are called consistent-precise (C-P).

Figure 3 shows the hierarchical structure of confidence groups of cytogenetic annotations. All genes in database have been classified into five groups: consistent-rough (C-R), C-P, C-IP, IC-T and IC-IT. The C-P group is most reliable; while the IC-IT group should contain a lot of erroneous data. The usability of C-IP and IC-T groups would be in-between C-P and IC-IT. The C-R group needs to be annotated more precisely in order to be useful.

The hierarchical structure of confidence groups.
Fig. 3.

The hierarchical structure of confidence groups.

3 IMPLEMENTATION

The system flowchart is shown in Figure 4. Briefly, at data retrieval stage, the raw data of cytogenetic annotations and sequence map were downloaded from NCBI Entrez Gene (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz), NCBI OMIM (ftp://ftp.ncbi.nlm.nih.gov/repository/OMIM/) and NCBI Map Viewer (ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/mapview/seq_gene.md.gz). In the data preprocessing stage, cytogenetic annotations and sequence map information were integrated for each gene, ensuring that every gene in the target set will have both types of information. At the inconsistency finding stage, all genes with inconsistencies were separated from consistent genes.

The system flowchart of finding imprecision zones of cytogenetic banding.
Fig. 4.

The system flowchart of finding imprecision zones of cytogenetic banding.

In the first step of regression analysis, the consistent genes were divided into C-R and non-rough groups by their cytogenetic annotations. In the second step of regression analysis, linear regression of each chromosome was calculated. The final stage is the precision zone analysis, in which the consistent-non-rough group was divided into C-IP and C-P; while the inconsistent group was divided into IC-IT and IC-T. The following paragraphs describe the data preprocessing stage in more detail.

The raw data file of cytogenetic annotation, gene_info, was downloaded from NCBI Entrez Gene dated August 5, 2008. This file, containing summary information for each gene, is tab-delimited with one line per GeneID of several kinds of species. The first field is tax_id, second field is GeneID, the seventhth field is chromosome, and the eighth field is map location. We extracted the lines whose tax_id equals ‘9606’ (H.sapiens) and saved the results as temporary_cyto_info. There were 39 898 lines of genes information in the file temporary_cyto_info, of which 246 of them had more than one cytogenetic annotation, 37 of them are mitochondrial genomes, 797 of them did not have cytogenetic annotation and in 41 of them the cytogenetic annotation was not defined. A total of 38 777 lines of genes' information had been retrieved and saved as cyto_info.

Another raw data file of cytogenetic annotation, ‘genemap’, was downloaded from NCBI OMIM dated August 5, 2008. Each line in this file is a list of fields, separated by the ‘|’ character. The fifth field is the map location and the 10th field is the MIM number which needs to be matched to gene id. The definition of each field is described in the file genemap.key which can be downloaded from NCBI OMIM. There were 11 124 records in the genemap, 25 pairs of which had duplicated MIM numbers and 12 of which did not have cytoband locations. In the end, 11 062 of NCBI OMIM records were retrieved.

The raw data file of sequence map, seq_gene.md, was downloaded from NCBI Map Viewer Build 36.3 which is a tab-delimited file that gives the position of sequence map. The field names are in the first line of the file. We extracted NCBI Map Viewer's genes information from this file in which the value of group_label field is ‘reference’. The fields of chromosome, chrStart and chrEnd, refer to the sequence positions on the chromosome. There were 33 202 records, 272 pairs of which had the same GeneID and 37 of which were mitochondrial genomes. In the end, 32 621 lines of sequence map information were retrieved and the resulting file was named seq_info.

In the NCBI Entrez Gene and NCBI Map Viewer 32 342 genes occurred simultaneously. In the NCBI OMIM and Map Viewer, 9718 genes occurred simultaneously. These two sets of genes formed the target sets for this study.

4 RESULTS

The result of imprecision-zone analysis of the common genes of NCBI Entrez Gene and NCBI Map Viewer is shown in Table 5. The result of the common genes of NCBI OMIM and NCBI Map Viewer is shown in Table 6. In these two tables, genes in the target database are classified into five groups: IC-IT, IC-T, C-IP, C-P and C-R.

Table 5.

The statistics of gene grouping in NCBI Entrez Gene and NCBI Map Viewer

ChromosomeIC-ITIC-TC-IPC-PC-RSum
15125220746151053097
2511951039770182073
3101151127351271630
4139668149971296
5239364365671422
611169628851371696
72473605967261695
81462667387251155
916991001259161391
104164771306711253
1119172828944171980
1216105970365291485
13152821033521609
142462880407241397
153366623372551149
1666486529291236
1724106843613821668
18226533837471
1915159788870121844
2055045729415821
2115452109416380
22104153314431759
X1097748621391515
Y035230550320
Total4482274174741145069632342
ChromosomeIC-ITIC-TC-IPC-PC-RSum
15125220746151053097
2511951039770182073
3101151127351271630
4139668149971296
5239364365671422
611169628851371696
72473605967261695
81462667387251155
916991001259161391
104164771306711253
1119172828944171980
1216105970365291485
13152821033521609
142462880407241397
153366623372551149
1666486529291236
1724106843613821668
18226533837471
1915159788870121844
2055045729415821
2115452109416380
22104153314431759
X1097748621391515
Y035230550320
Total4482274174741145069632342

The grouping results are shown by each chromosome in NCBI Entrez Gene and NCBI MapViewer. The field ‘Sum’ refers to the summation of these five groups.

Table 5.

The statistics of gene grouping in NCBI Entrez Gene and NCBI Map Viewer

ChromosomeIC-ITIC-TC-IPC-PC-RSum
15125220746151053097
2511951039770182073
3101151127351271630
4139668149971296
5239364365671422
611169628851371696
72473605967261695
81462667387251155
916991001259161391
104164771306711253
1119172828944171980
1216105970365291485
13152821033521609
142462880407241397
153366623372551149
1666486529291236
1724106843613821668
18226533837471
1915159788870121844
2055045729415821
2115452109416380
22104153314431759
X1097748621391515
Y035230550320
Total4482274174741145069632342
ChromosomeIC-ITIC-TC-IPC-PC-RSum
15125220746151053097
2511951039770182073
3101151127351271630
4139668149971296
5239364365671422
611169628851371696
72473605967261695
81462667387251155
916991001259161391
104164771306711253
1119172828944171980
1216105970365291485
13152821033521609
142462880407241397
153366623372551149
1666486529291236
1724106843613821668
18226533837471
1915159788870121844
2055045729415821
2115452109416380
22104153314431759
X1097748621391515
Y035230550320
Total4482274174741145069632342

The grouping results are shown by each chromosome in NCBI Entrez Gene and NCBI MapViewer. The field ‘Sum’ refers to the summation of these five groups.

Table 6.

The statistics of gene grouping in NCBI OMIM and NCBI Map Viewer

ChromosomeIC-ITIC-TC-IPC-PC-RSum
146182395225126974
23010822518864615
316832856876528
415692213231368
511892737133477
61411721117250564
7215313418340431
813451873552332
915691509034358
1033601324774346
11111174142945616
1214803066863531
13930703621166
1417511354551299
1522471084258277
1645719310028382
1721923962159589
18021484314126
19181224221763642
207201227421244
2142144427118
22428818848249
X138221210741455
Y10021931
Total35916434764184411089718
ChromosomeIC-ITIC-TC-IPC-PC-RSum
146182395225126974
23010822518864615
316832856876528
415692213231368
511892737133477
61411721117250564
7215313418340431
813451873552332
915691509034358
1033601324774346
11111174142945616
1214803066863531
13930703621166
1417511354551299
1522471084258277
1645719310028382
1721923962159589
18021484314126
19181224221763642
207201227421244
2142144427118
22428818848249
X138221210741455
Y10021931
Total35916434764184411089718

The grouping results are shown by each chromosome in NCBI OMIM and NCBI MapViewer. The field ‘Sum’ refers to the summation of these five groups.

Table 6.

The statistics of gene grouping in NCBI OMIM and NCBI Map Viewer

ChromosomeIC-ITIC-TC-IPC-PC-RSum
146182395225126974
23010822518864615
316832856876528
415692213231368
511892737133477
61411721117250564
7215313418340431
813451873552332
915691509034358
1033601324774346
11111174142945616
1214803066863531
13930703621166
1417511354551299
1522471084258277
1645719310028382
1721923962159589
18021484314126
19181224221763642
207201227421244
2142144427118
22428818848249
X138221210741455
Y10021931
Total35916434764184411089718
ChromosomeIC-ITIC-TC-IPC-PC-RSum
146182395225126974
23010822518864615
316832856876528
415692213231368
511892737133477
61411721117250564
7215313418340431
813451873552332
915691509034358
1033601324774346
11111174142945616
1214803066863531
13930703621166
1417511354551299
1522471084258277
1645719310028382
1721923962159589
18021484314126
19181224221763642
207201227421244
2142144427118
22428818848249
X138221210741455
Y10021931
Total35916434764184411089718

The grouping results are shown by each chromosome in NCBI OMIM and NCBI MapViewer. The field ‘Sum’ refers to the summation of these five groups.

In Table 5, 448 (1.4%) genes are in the IC-IT group, 2274 (7.0%) in the IC-T group, 17 474 (54.0%) genes are in the C-IP group, 11 450 (35.4%) genes are in the C-P group and 696 (2.2%) of genes are in the C-R group. In Table 6, the corresponding statistics values of the five groups are 346 (3.6%), 1656 (17.0%), 4764 (49.0%), 1844 (19.0%) and 1108 (11.4%), respectively.

We further integrate the results of Tables 5 and 6 to construct a confidence table of genes position information (Table 7). Table 8 shows five genes extracted from chromosome 1 to exemplify the five categories in Tables 5–7. Genes in the IC-IT group tend to have longer distance between cyto2seq segment and seq segment, while their distances to the regression line(regression distance) are also long. For example, genes id 159 and 1031 in Table 8 are both inconsistent genes, but the regression distance of gene id 159 (IC-IT) is much longer than that of gene id 1031 (IC-T). Genes in the C-P group has more accurate cytogenetic information (shorter cyto2seq segment) than those in the C-IP group. The C-IP genes have longer regression distances. For example, genes id 864 and 93 183 are both consistent genes, but the regression distance of gene id 864 (C-IP) is much longer than gene id 93 183 (C-P). Genes in the C-R group has large cyto2seq segment-like gene id 6723 in Table 8. The distance information for all genes can be reviewed at http://centrallab.hosp.ncku.edu.tw/imz, where every number in Tables 5–7 can be clicked to bring out a gene list with detailed distance information.

Table 7.

The integration of gene grouping of NCBI Entrez Gene versus NCBI Map Viewer and NCBI OMIM versus NCBI Map Viewer.

O_IC-ITO_IC-TO_C-IPO_C-PO_C-RO_None
G_IC-IT1893326926165
G_IC-T2111111915176824
G_C-IP79298361465747512351
G_C-P5216282810803239005
G_C-R10132714188444
G_None826783320
O_IC-ITO_IC-TO_C-IPO_C-PO_C-RO_None
G_IC-IT1893326926165
G_IC-T2111111915176824
G_C-IP79298361465747512351
G_C-P5216282810803239005
G_C-R10132714188444
G_None826783320

The prefix ‘G’ refers to NCBI Entrez Gene versus NCBI Map Viewer and the prefix ‘O’ refers to NCBI OMIM versus NCBI Map Viewer. The postfix ‘None’ indicates that there is no such data.

Table 7.

The integration of gene grouping of NCBI Entrez Gene versus NCBI Map Viewer and NCBI OMIM versus NCBI Map Viewer.

O_IC-ITO_IC-TO_C-IPO_C-PO_C-RO_None
G_IC-IT1893326926165
G_IC-T2111111915176824
G_C-IP79298361465747512351
G_C-P5216282810803239005
G_C-R10132714188444
G_None826783320
O_IC-ITO_IC-TO_C-IPO_C-PO_C-RO_None
G_IC-IT1893326926165
G_IC-T2111111915176824
G_C-IP79298361465747512351
G_C-P5216282810803239005
G_C-R10132714188444
G_None826783320

The prefix ‘G’ refers to NCBI Entrez Gene versus NCBI Map Viewer and the prefix ‘O’ refers to NCBI OMIM versus NCBI Map Viewer. The postfix ‘None’ indicates that there is no such data.

Table 8.

Five examples of the genes and their chromosomal locations in the five groups in Table 7

Gene idCytobandCyto2Seq segmentSeq segmentRegression distance
IC-IT1591cen-q12(124300000, 142400000)(242638419, 242682036)76 977 923
IC-T10311p32(51300000, 60900000)(51206955, 51212897)3 413 425
C-IP8641p36(1, 27800000)(25098589, 25164088)8 077 673
C-P931831q23.2(157300000, 158800000)(158264086, 158268407)82 985
C-R67231p36-p22(1, 94500000)(11037236, 11042678)25 494 214
Gene idCytobandCyto2Seq segmentSeq segmentRegression distance
IC-IT1591cen-q12(124300000, 142400000)(242638419, 242682036)76 977 923
IC-T10311p32(51300000, 60900000)(51206955, 51212897)3 413 425
C-IP8641p36(1, 27800000)(25098589, 25164088)8 077 673
C-P931831q23.2(157300000, 158800000)(158264086, 158268407)82 985
C-R67231p36-p22(1, 94500000)(11037236, 11042678)25 494 214

Examples to show the characteristics of the five groups.

Table 8.

Five examples of the genes and their chromosomal locations in the five groups in Table 7

Gene idCytobandCyto2Seq segmentSeq segmentRegression distance
IC-IT1591cen-q12(124300000, 142400000)(242638419, 242682036)76 977 923
IC-T10311p32(51300000, 60900000)(51206955, 51212897)3 413 425
C-IP8641p36(1, 27800000)(25098589, 25164088)8 077 673
C-P931831q23.2(157300000, 158800000)(158264086, 158268407)82 985
C-R67231p36-p22(1, 94500000)(11037236, 11042678)25 494 214
Gene idCytobandCyto2Seq segmentSeq segmentRegression distance
IC-IT1591cen-q12(124300000, 142400000)(242638419, 242682036)76 977 923
IC-T10311p32(51300000, 60900000)(51206955, 51212897)3 413 425
C-IP8641p36(1, 27800000)(25098589, 25164088)8 077 673
C-P931831q23.2(157300000, 158800000)(158264086, 158268407)82 985
C-R67231p36-p22(1, 94500000)(11037236, 11042678)25 494 214

Examples to show the characteristics of the five groups.

We further grouped the inconsistent genes according to the banding levels where inconsistency occurs, i.e. whether the cytogenetic and sequence positions are on different chromosomal arms, regions, bands or sub-band, etc. The results are summarized in Table 9. One may observe that the intolerable group has a tendency to be inconsistent at the arm, region or band levels. Table 10 shows four examples extracted from Table 9. Genes id 7049 and 5784 are two genes whose cytogenetic and sequence positions differ at the level of chromosomal regions (1p3 versus 1p2; and 1q3 versus 1q4), but the regression distance of gene id 7049 is much longer than that of gene id 5784. The regression distance of gene id 7049 is longer than MaxD (defined in Section 2.5) and is thus classified as IC-IT. The regression distance of gene id 5784 is smaller than MaxD and is thus classified as IC-T. Gene id 5567 and 553115 are two other examples whose cytogenetic and sequence positions are both on different bands. The complete result is in the Supplementary Material. Table 9 plus the (regression-distance)/MaxD ratios provide a practical guide to use inconsistent cytogenetic information. Genes inconsistent at the arm level, i.e. their cytogenetic and sequence positions are on different chromosomal arms are apparently problematic. Genes inconsistent at the region or band levels with large (regression-distance)/MaxD ratios are likely to be problematic. In this study, we chose a ratio of 1 to be the cutoff threshold. Inconsistent genes with a ratio ≤1 are considered tolerable. Since we provide the ratio values of all inconsistent genes, one may choose a different threshold ratio for one's particular purposes.

Table 9.

Further classify the inconsistent genes according to the banding levels where inconsistency occurs

ArmRegionBand1st2nd3rd
Sub-bandSub-bandSub-band
G_IC-IT85831888390
G_IC-T0214041080648121
O_IC-IT61801516070
O_IC-T02332478946245
ArmRegionBand1st2nd3rd
Sub-bandSub-bandSub-band
G_IC-IT85831888390
G_IC-T0214041080648121
O_IC-IT61801516070
O_IC-T02332478946245

The prefix ‘G’ refers to NCBI Entrez Gene versus NCBI Map Viewer and the prefix ‘O’ refers to NCBI OMIM versus NCBI Map Viewer.

Table 9.

Further classify the inconsistent genes according to the banding levels where inconsistency occurs

ArmRegionBand1st2nd3rd
Sub-bandSub-bandSub-band
G_IC-IT85831888390
G_IC-T0214041080648121
O_IC-IT61801516070
O_IC-T02332478946245
ArmRegionBand1st2nd3rd
Sub-bandSub-bandSub-band
G_IC-IT85831888390
G_IC-T0214041080648121
O_IC-IT61801516070
O_IC-T02332478946245

The prefix ‘G’ refers to NCBI Entrez Gene versus NCBI Map Viewer and the prefix ‘O’ refers to NCBI OMIM versus NCBI Map Viewer.

Table 10.

Four examples of the genes and their chromosomal locations in Table 9

Gene idSymbolCytobandseq2cytoRegression distanceMaxDRatio
IC-IT7049TGFBR31p33-p321p22.2-1p22.127 090 8519 056 5412.99
IC-T5784PTPN141q32.21q413 368 6799 056 5410.37
IC-IT5567PRKACB1p36.11p31.144 201 9669 056 5414.88
IC-T553115PEF11p341p35.25 971 7479 056 5410.66
Gene idSymbolCytobandseq2cytoRegression distanceMaxDRatio
IC-IT7049TGFBR31p33-p321p22.2-1p22.127 090 8519 056 5412.99
IC-T5784PTPN141q32.21q413 368 6799 056 5410.37
IC-IT5567PRKACB1p36.11p31.144 201 9669 056 5414.88
IC-T553115PEF11p341p35.25 971 7479 056 5410.66

In this table the field ‘Ratio’ refers to ‘Regression distance’ ‘MaxD’.

Table 10.

Four examples of the genes and their chromosomal locations in Table 9

Gene idSymbolCytobandseq2cytoRegression distanceMaxDRatio
IC-IT7049TGFBR31p33-p321p22.2-1p22.127 090 8519 056 5412.99
IC-T5784PTPN141q32.21q413 368 6799 056 5410.37
IC-IT5567PRKACB1p36.11p31.144 201 9669 056 5414.88
IC-T553115PEF11p341p35.25 971 7479 056 5410.66
Gene idSymbolCytobandseq2cytoRegression distanceMaxDRatio
IC-IT7049TGFBR31p33-p321p22.2-1p22.127 090 8519 056 5412.99
IC-T5784PTPN141q32.21q413 368 6799 056 5410.37
IC-IT5567PRKACB1p36.11p31.144 201 9669 056 5414.88
IC-T553115PEF11p341p35.25 971 7479 056 5410.66

In this table the field ‘Ratio’ refers to ‘Regression distance’ ‘MaxD’.

5 DISCUSSIONS AND CONCLUSIONS

Considering the imprecise nature of cytogenetic banding, we classified the genes into five groups which may help the direction of future work. Genes in the IC-IT group are most likely erroneously annotated. Their cytogenetic information should be reviewed and corrected if necessary. The C-R group needs to be annotated more precisely in order to be useful. Experiments such as FISH may be performed. The C-P group is most believable. The C-IP and IC-T groups may be taken into consideration if one wants to include all possible genes associated with interested chromosomal regions, especially when the regions are determined by metaphase chromosomal techniques, such as G-banding, comparative genomic hybridization and spectral karyotyping. These techniques share the same imprecise nature of cytogenetic banding. We determine the imprecision zones based on two major cytogenetic databases. Since there are many mismatched cytogenetic annotations between NCBI Entrez Gene and NCBI OMIM. The integration of the two imprecision-zone analyses offers the opportunity for reciprocal corrections. For example, nine genes are IC-IT in Entrez Gene but are C-P in OMIM. Their cytogenetic annotations in Entrez Gene could be corrected by using the OMIM information. Indeed, the gene SKI was annotated to 1q22-q24 by Entrez Gene, probably according to an older literature (Chaganti et al., 1986). A newer literature (Colmenares et al., 2002) has remapped SKI to 1p36, which is likely the source of OMIM annotation. SKI also stands for a good example of how the regular database updates could contribute to the correction of inconsistencies. Only 189 genes are IC-IT in both databases. Their cytogenetic annotations need to be reviewed and perhaps corrected.

Cuticchia et al. (2006) described the phenomenon of inconsistencies between cytogenetic annotations and sequence positions but did not provide a systematic grouping method for the genes. In this study, we analyze the inconsistent genes with the consideration of tolerable human impreciseness. The effort results in more practical grouping of genes for genomic scientists to use genes' cytogenetic information.

The sizes of the cytogenetic annotations may play some roles in our analysis. The larger cytogenetic annotations (bigger cyto2seq) are more likely to have overlapping cyto2seq and seq segments and hence would be included in the consistent group. Genes' seq segments are much smaller than their cyto2seq, almost always confined in the smallest cytogenetic banding. Therefore, the size effects of seq segments are overshadowed by those of cyto2seq segments.

There are potentially five explanations for each observed inconsistency. One is that only the cytogenetic annotation is wrong, another is that only the sequence information is wrong and the third is that the cytogenetic annotation and the sequence information are both wrong. The sequence map is generally accepted as the ‘gold standard’ of gene positions, and the method described in this study should be useful in identifying incorrect cytogenetic annotations for data cleaning. However, we cannot exclude the possibility that some small regions of the sequence map are wrong but un-noticed. The fourth explanation is that the inconsistency is introduced by the ideogram. In this study, we assumed that the cytoband junctions defined in the ideogram of NCBI Map Viewer are correct. The assumption may or may not be true. In fact, there is no method to determine whether the ideogram is correct or not, since the cytobanding technology is in itself not very precise. This is the reason why imprecision-zone analysis is critical when utilizing cytogenetic annotations. The last but most intriguing explanation for the inconsistency is that both cytogenetic and sequence positions are correct, but the general belief about positional parallelism is wrong. This explanation would require some wild imagination such as periodically looping back during the chromosomal condensation process to produce the metaphase chromosome.

One may suggest using the sequence map to assign cytogenetic positions to all genes. By doing so, all inconsistencies in the database would be eliminated. There are several problems to this approach. First, it is based on the assumption that the sequence map is absolutely correct, which may or may not be true. Second, sequence-determined cytogenetic position is not based on metaphase chromosomes and may not correlate with existing information derived from techniques using metaphase chromosomal preparations. Third, it is not possible to map the junctions of cytobands on the metaphase chromosome to the sequence map without any arbitrariness. This would affect the assignment of cytogenetic positions for genes near the cytoband junctions. Instead of forcing everything to conform to the sequence map, one should take into consideration its imprecise nature when using the cytogenetic banding information.

ACKNOWLEDGEMENTS

The authors thank Mr Isaac Ho for his editorial help on this article.

Funding: Department of Health, Taiwan (DOH-TD-B-111-004 to C.-Y Chou); National Science Council, Taiwan (NSC94-2320-B-006-002 to C.-L.H.).

Conflict of Interest: none declared.

References

Chaganti
RS
et al.
,
The cellular homologue of the transforming gene of SKV avian retrovirus maps to human chromosome region 1q22-q24
Cytogenet. Cell Genet.
,
1986
, vol.
43
(pg.
181
-
186
)
Colmenares
C
et al.
,
Loss of the SKI proto-oncogene in individuals affected with 1p36 deletion syndrome is predicted by strain-dependent defects in Ski
Nat. Genet.
,
2002
, vol.
30
(pg.
106
-
109
)
Cuticchia
AJ
et al.
,
Inconsistencies between human genetic cytolocations and those derived using genomic sequence
Cytogenet. Genome Res.
,
2006
, vol.
112
(pg.
1
-
5
)
Davies
JJ
et al.
,
Array CGH technologies and their applications to cancer genomes
Chromosome Res.
,
2005
, vol.
13
(pg.
237
-
248
)
Furey
TS
Haussler
D
,
Integration of the cytogenetic map with the draft human genome sequence
Hum. Mol. Genet.
,
2003
, vol.
12
(pg.
1037
-
1044
)
Kallioniemi
A
et al.
,
Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors
Science
,
1992
, vol.
258
(pg.
818
-
821
)
Kirsch
IR
et al.
,
A systematic, high-resolution linkage of the cytogenetic and physical maps of the human genome
Nat. Genet.
,
2000
, vol.
24
(pg.
339
-
340
)
Knutsen
T
et al.
,
The interactive online SKY/M-FISH & CGH database and the entrez cancer chromosomes search database: linkage of chromosomal aberrations with the genome sequence
Genes Chromosomes Cancer
,
2005
, vol.
44
(pg.
52
-
64
)
Korenberg
JR
et al.
,
Human genome anatomy: BACs integrating the genetic and cytogenetic maps for bridging genome and biomedicine
Genome Res.
,
1999
, vol.
9
(pg.
994
-
1001
)
Liyanage
M
et al.
,
Multicolour spectral karyotyping of mouse chromosomes
Nat. Genet.
,
1996
, vol.
14
(pg.
312
-
315
)
Maglott
D
et al.
,
Entrez Gene: gene-centered information at NCBI
Nucleic Acids Res.
,
2005
, vol.
33
(pg.
D54
-
D58
)
McKusick
VA
Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders
,
1998
12
Baltimore
Johns Hopkins University Press
Mitelman
F
Heim
S
,
Consistent involvement of only 71 of the 329 chromosomal bands of the human genome in primary neoplasia-associated rearrangements
Cancer Res.
,
1988
, vol.
48
(pg.
7115
-
7119
)
Mitelman
F
et al.
Mitelman database of chromosome aberrations in cancer
,
2008
 
Pruitt
KD
et al.
,
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts, and proteins
Nucleic Acids Res.
,
2005
, vol.
33
(pg.
D501
-
D504
)
Schrock
E
et al.
,
Multicolor spectral karyotyping of human chromosomes
Science
,
1996
, vol.
273
(pg.
494
-
497
)
Solinas-Toldo
S
et al.
,
Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances
Genes Chromosomes Cancer
,
1997
, vol.
20
(pg.
399
-
407
)
Wang
T-L
et al.
,
Digital karyotyping
Proc. Natl Acad. Sci. USA
,
2002
, vol.
99
(pg.
16156
-
16161
)
Wheeler
DL
et al.
,
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res.
,
2006
, vol.
34
(pg.
D173
-
D180
)
Yen
K-H
et al.
,
A precise and scalable method for querying genes in chromosomal banding regions based on cytogenetic annotations
Bioinformatics
,
2005
, vol.
21
(pg.
3469
-
3474
)

Author notes

Associate Editor: John Quackenbush