Downloadable Structure Files of NCI Open Database Compounds
Release 3 Files - September 2003
September 2003 SD File of Combined DTP Releases, 2D/3D, with Canonical
Properties Added
The most complete collection of Open NCI Database compounds as of September
2003 that we are aware of. These are 260,071 structures, combined
from DTP releases from Oct. 1999,
Aug. 2000, Feb. 2003, and Sep. 2003. All the identifier-type information
that we were able to associate with the structures are included in this
file: NSC numbers; DTP names for ~53,000 records (including some WLN strings);
Unique
SMILES, calculated by CACTVS
according to Daylight's original (1989) canonicalization
rules;
the new IUPAC/NIST
ICHI
chemical identifier (calculated with [beta] version 0.932 of NIST's program),
(see Note 3.);
IUPAC names, calculated with ACD/Lab's
program ACD/Name Batch;
eight different CACTVS hash codes, including a tautomer-invariant but stereochemistry-,
multifragment-, charge- and isotope-sensitive hash code that is essentially
a unique, calculable identifier for any (small-molecule) chemical.
Additional properties, some of them helpful to categorize structures when
dealing with several databases simultaneously, are explained in the
Technical Notes.
The 2003 DTP releases now have many structures with at least some, if not
full, stereochemistry specification. This allowed 3D coordinates of reliable
stereoisomers to be calculated in many cases. Where such 3D structures would
have potentially shown the wrong chemical, or would otherwise have been
doubtful, 2D coordinates were kept. See the
Technical Notes for more details.
Also be aware of the fact that for a very large number of entries (on the
order of 100,000), the structure shown in the 2003 DTP releases is slightly
different from that shown in previous releases. In the vast majority
of those cases, the structure is now represented as a different tautomer.
Notes as of April 2007:
- This version of the NCI Open Database, which adds ~10,000 new structures, is
not (yet) included in our Enhanced
NCI Database Browser web service. An update of that service is
underway.
- This identifier has been renamed several times. It has gone from
ICHI to INChI to now
InChI.
260,071
structures in SDF format, 2D or 3D (see
Technical Notes for more explanations). Beta version/update
no. 2, 25-Nov-03. WARNING:
This is
a 214 MB gzip'ed file that uncompresses to about 1.6 GB!! Use
the "Save Link As..." (Netscape/Firefox) or "Save Target As..." (IE) option of
your web browser to download the file.
Release 2 Files - August 2000
August 2000 2D File
The "raw" structure data that were used to build the Release 2 of the Enhanced NCI Database Browser. These are 250,251 2D structures calculated with CACTVS.
Attention:
Stereochemistry assigned by CACTVS according to default rules due to lack
of stereochemical information in the original NCI data. The SMILES string
and the CAS RN (where available) are also included for each structure.
250,251
2D structures in SDF format. WARNING: This is a 90 MB file that
uncompresses to about 982 MB! Use the "Save Link As..." (Netscape)
or "Save Target As..." (IE) option of your web browser to download the
file. It has the name NCI_aug00_2D.sdz. To uncompress, rename the file
to something like "NCI_aug00_2D.sdf.gz" and gunzip it.
August 2000 SMILES Strings
A SMILES version of the 250,251 August 2000 structures. These are Unique
SMILES (USMILES) strings, calculated according to Daylight's original (1989)
canonicalization rules. (These rules have been changed in the meantime,
but are not published.)
250,251
structures in USMILES format.Caution: This is a 2.9 MB file that
uncompresses to about 13 MB. Use the "Save Link As..."
(Netscape) or "Save Target As..." (IE) option of your web browser to download
the file. It has the name NCI_aug00_SMI.sdz. To uncompress, rename the
file to something like "NCI_aug00_SMI.gz" and gunzip it.
New Structures Only
These are 1,170 structures that were not in the previous (October
1999) release. This file may be most interesting for those who have already
downloaded the previous structure file(s) and only need the difference
set. It contains 3D coordinates calculated by the program CORINA.
Please note the same warning regarding stereochemistry as for the large
3D file (see below).
1,170
new 3D structures in SDF format. Note: This is a 0.8 MB file that
uncompresses to about 5.7 MB. Use the "Save Link As..." (Netscape)
or "Save Target As..." (IE) option of your web browser to download the
file. It has the name NCI_new_oct99-aug00_3D.sdz. To uncompress, rename
the file to something like "NCI_new_oct99-aug00_3D.sdf.gz" and gunzip it.
New in August 2006: A 3D version of the 0D file with some properties
added. Their values are the
same as those shown in the Enhanced NCI Datebase Browser. This file contains 250,250
structures as of August 2000 (one missing because of technical reasons). 3D
coordinates have been calculated by Corina 3.0 and are available for 248,574 structures.
The following properties are included:
- NSC Number
- Molecular weight
- Name (ACD)
- Formula
- CAS Registry Number
- SMILES string
- KOW logP
- Experimental logP
- ACD logP
- Drug Likeness (std)
- Drug Likeness (neg)
Release 1 Files - October 1999
"0D"
The "raw" structure data that were used to build the previous version of the Enhanced NCI Database Browser, plus
about 2,900 new structures. These are 249,081 "0D" structures (i.e. all
coordinates set to 0.0) as of October 1999 in SDF format, in one file compressed
with the widely available program gzip.
249,081 0D
structures in SDF format. Caution: This is a 16.5 MB file that uncompresses
to about 380 MB! Use the "Save Link As..." (Netscape) or "Save
Target As..." (IE) option of your web browser to download the file.
SMILES
A SMILES version of the structures (i.e. the above "0D" dataset) that
were used to build this service, plus about 2,900 new structures. These
are 249,081 structures as of October 1999 in
SMILES
format, in one file compressed with the widely available program gzip.
SMILES string were generated with the help of CACTVS.
(This is a newly generated dataset and therefore not guaranteed to contain
SMILES strings identical, for each compound, with those in previous SMILES
string files, such as downloadable data from DTP
.)
249,081 structures
in SMILES format. Caution: This is a 3.2 MB file that uncompresses
to about 18.5 MB. Use the "Save Link As..." (Netscape)
or "Save Target As..." (IE) option of your web browser to download the
file.
2D
2D version of NCI Open Database compounds as of October 1999. 2D
coordinates (essentially structure drawings) calculated with
CACTVS.
Attention:
Stereochemistry assigned by CACTVS according to default rules due to lack
of stereochemical information in the original NCI data. (See also the 3D
section.)
249,081 2D
structures in SDF format. WARNING: This is a 40 MB file that uncompresses
to about 527 MB! Use the "Save Link As..." (Netscape) or "Save Target
As..." (IE) option of your web browser to download the file.
2D + Biological Data
2D versions of NCI Open Database compounds as of October 1999, with
biological test data added. These data are publicly available from
the DTP Human
Tumor Cell Line Screen and/or the DTP
AIDS Antiviral Screen. 2D coordinates (essentially structure
drawings) calculated with CACTVS.
Attention:
Stereochemistry assigned by CACTVS according to default rules due to lack
of stereochemical information in the original NCI data. (See also the 3D
section.)
3D
A 3D version of the 0D file, containing 249,071 structures as of October
1999. The program
CORINA
v. 1.7 was used to generate the 3D coordinates. Please note that, just
as with the 3D results provided by the Enhanced NCI Database Browser, stereochemistry
of chiral compounds is not guaranteed to be correct due to the lack of
stereochemical information in the original data. This is not a shortcoming
of CORINA. Please also note that, as of now, the 3D structures in this
bulk file were not generated with the same version of CORINA as is used
in the Browser, the latter being somewhat newer. This file is the result
of a one-time conversion; no efforts have been undertaken to compare the
conformations in it with those you obtain from the Browser (although we
don't necessarily expect huge differences.)
249,071 3D
structures in SDF format.WARNING: This is a 127 MB file that uncompresses
to about 574 MB! Use the "Save Link As..." (Netscape) or "Save Target
As..." (IE) option of your web browser to download the file.
Notes:
All these files are based on the publicly and freely available data
from NCI's Developmental Therapeutics
Program (DTP). We collected the structures and biological data from
DTP, combined them where applicable, and generated SMILES and MDL SD files
from this information.
These files were compressed with the program gzip. This program is available
for many platforms, and comes preloaded on most of the recent versions
of many major varieties of Unix. In order to prevent possible problems
with web browsers trying to uncompress "on the fly", and display on your
screen (!), a file with the extension ".gz", the names of the downloadable
files were changed to NCInDA99.sdz (n = 0, 2, 3 for the 0D, 2D, 3D file,
respectively; "A99" stands for October 1999 [with hexadecimal notation
for the month]), CAN2DA99.sdz, AID2DA99.sdz etc.
You may have to rename them to NCInDA99.sdf.gz etc. before gunzip'ing
them. If you (have to) rename them to NCInDA99.gz, gunzip will uncompress
them to a file name NCInDA99, unless you use the gunzip option "-N", which
will restore the name NCInDA99.sdf. (These file names were chosen to conform
to the 8.3 file name convention for those users that may download, e.g.,
to DOS-type FAT 16 file systems. This practice may be discontinued in future.)
All files (after decompression) are in MDL's SDFile format with two
identification fields:
-
NSC - the NCI's internal identification number of the database entry
-
CAS_RN - the CAS Registry Number. Present with a value other than 999-99-9
(dummy value) only for those compounds for which it was entered in the
NCI database. (This does not mean that a compound with a CAS_RN of 999-99-9
does not necessarily
have a CAS Registry Number - it just was not
entered
in the NCI database.)
In the 2D files with biological data, you'll find the following additional
fields (not necessarily present in all files for all compounds):
-
NLOGGI50 - Log GI50 data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGGI50, INDN, TOTN
-
NLOGTGI - Log TGI data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGTGI, INDN, TOTN
-
NLOGLC50 - Log LC50 data, comprising the following columns:
CONCUNIT, LCONC, PANEL, CELL, PANELNBR, CELLNBR, NLOGLC50, INDN, TOTN
-
NCI_AIDS_Antiviral_Screen_Conclusion - AIDS Screening result (CI = Confirmed
Inactive, CM = Confirmed Moderate[ly active], CA = Confirmed Active)
-
NCI_AIDS_Antiviral_Screen_EC50 - AIDS EC50 result with four columns:
HiConc, ConcUnit, Flag, EC50, NumExp.
Note that for some compounds, the EC50 has been measured more than
once.
-
NCI_AIDS_Antiviral_Screen_IC50 - AIDS IC50 result with four columns:
HiConc, ConcUnit, Flag, IC50, NumExp.
Note that for some compounds, the IC50 has been measured more than
once.
For more explanation on these data, in particular the meaning of the column
headings, please see the Web pages of the DTP
Human Tumor Cell Line Screen and/or the DTP
AIDS Antiviral Screen.
Please also note that no editing of the biological test data has been
performed. This means that all DTP results for which the chemical
structure is available have been included. This includes data from
"non-production" cell lines, i.e. cell lines that were used only a short
time during test phases, as well as data from those ten cell lines that
were replaced by a new block of ten around 1992. It is up to the user to
do their own evaluation, statistics, and, if necessary, (pre-)processing,
of these data before using them for any purpose.
In the 3D file, hydrogens were added by CORINA, whereas they are not
present in the 0D and 2D files. In the 2D files, the stereochemistry
shown is in fact meaningless since decided upon at random. This is not
easily changeable.
In previous versions of this page, the 0D information was called "2D".
This has been changed to avoid confusion with the new 2D information added.
The file that was previously called NCI2D397.sdz is therefore mostly identical
with the new file NCI0DA99.sdz with the exception of the newly added compounds.
The sizes listed for the uncompressed files are in "real" MB, i.e. 1024
x 1024 bytes.
Our 249,081 structure set is a combination of three sets:
1) the March 1997 set, still downloadable here as NCI3D397.sdz
2) 689 supplemental structures selected from the DTP
Human Tumor Cell Line Screen 3D SD files as of August 1999
3) 2,212 supplemental structures selected from
DTP
AIDS Antiviral Screen 3D SD files as of October 1999.
Our 2D files with AIDS data contain 2 more structure data than the one
available at the DTP
AIDS Antiviral Screen .
The DTP
Human Tumor Cell Line Screen biological data file contains cancer screen
data for 370 more entries for which we don't have the structure (these
structures are not available on the DTP site).
For 10 out of the 249,081 structures, the 3D generation process failed.
Acknowledgments
All the SD files were prepared with the help of the SDF_toolkit.
Thanks to Bruno Bienfait for both the toolkit and this work.
We gratefully acknowledge Prof. Gasteiger's group at the Computer
Chemistry Center (CCC), Institute of Organic Chemistry, University
of Erlangen-Nuremberg, Germany, for providing us with their program CORINA,
and help with the database conversion.
Note to Windows users: While downloading with Netscape on Unix platforms
usually works flawlessly, we've received reports (and have confirmed in very
limited tests) that in Windows, using Netscape with the "Save Link As..." option
may produce corrupted binary files. In these cases, you may want to try newer
versions of Mozilla/Firefox, or Internet Explorer with the option "Save Target
As..." for downloading. Because of the file name/extension used for some of the
files(.sdz), you may have to either rename the downloaded binary file or open it
manually with a program such as WinZip or similar.
Home
Last change: M. C. Nicklaus,
2007-04-27