In my last post, I doubted the accuracy of fingerprint based substructure search and pointed out sometimes fingerprint loosed hits. In fact, something went wrong in my code. As I was reading from SDF directly, while IteratingMDLReader does not percieve atom type or detect aromaticity automatically. This cause the incorrect matchings of UniversalIsomorphismTester, sorry for the incorrect post. I’ve run the test again, using the SMILES provided by Guha as input. The groovy script is also attached here.
| # | Query | Subgraph Isomorphism | Entended CDK | Missing | Extra |
| 1 | ![]() |
20 | 24 | 0 | 4 |
| 2 | ![]() |
7 | 103 | 0 | 96 |
| 3 | ![]() |
69 | 100 | 0 | 31 |
| 4 | ![]() |
6 | 10 | 0 | 4 |
| 5 | ![]() |
31 | 41 | 0 | 10 |
| 6 | ![]() |
23 | 23 | 0 | 0 |
| 7 | ![]() |
7 | 75 | 0 | 68 |
Well, CDK fingerprint is OK.
Tags: CDK, Chemoinformatics, Fingerprint
The test code used in this post has fatal error, which caused the test result to be completely incorrect. Please see details here.
Abstract: Doubting on the accuracy of fingerprint based molecule substructure search, I did a test between different fingerprints implemented in CDK and subgraph isomorphism. The result is very interesting, CDK fingerprints should never be used alone in substructure search, but combine CDK fingerprints and subgraph isomorphism, we can have a balance between speed and accuracy.
Guha wrote a post about benchmarking of different type of fingerprints with benchmarking strategies described by Bender & Glen and Godden et al. The benchmark is based on Tanimoto similarity, which is the foundation of most chemistry database’s similar structure search. Another impo
rtant feature of molecule structure search is substructure, currently, subgraph isomorphism and fingerprint are both used in substructure search. Adel Golovin and Kim Henrick’s article Chemical Substructure Search in SQL provides a pure SQL subgraph isomorphism strategies, Rich Apodaca’s post fingerprint based MySQL substructure search in MySQL also found a solution with limited binary operation in MySQL.
Fingerprint is obviously faster as it’s much less consuming than subgraph isomorphism. But the real question is, does the fingerprint method really find all the substructures? Are there any hits misjudged as substructure?
Using DrugBank small molecule drugs as test dataset, several hand-draw structures of different types as queries, I performed substructure search using subgraph isomorphism and fingerprints implemented in CDK. If a hit in search results of fingerprint method is also found in search results of subgraph isomorphism method, I count this hit as a correct hit, other wise the hit will be counted as a incorrect hit.
All fingerprints implemented in CDK are tested, generated using Fingerprinter, ExtendedFingerprinter, GraphOnlyFingerprinter, SubstructureFingerprinter, MACCSFingerprinter and EStateFingerprinter. All parameters are kept default.
To test the accuracy of queries with different complexity, I drew several structures, as listed below.







The result is listed below. The result is listed in the format of “A/B”. A represents current hits, i.e. hits also founded by subgraph isomorphism method. B represents incurrent hits. For example, 31/49 stands for 31 current hits and 49 incurrent hits are found. Higher A and lower B is better.
| # | Query | Subgraph Isomorphism | EState | MACCS | Standard CDK | Entended CDK | Graphonly CDK FIngerpint | Substructure Fingerprint |
| 1 | ![]() |
31 | 0/0 | 0/43 | 31/49 | 31/54 | 31/574 | 0/0 |
| 2 | ![]() |
54 | 0/0 | 0/1 | 21/95 | 18/84 | 54/1512 | 11/30 |
| 3 | ![]() |
29 | 0/0 | 0/20 | 29/74 | 29/83 | 29/1793 | 0/5 |
| 4 | ![]() |
3 | 0/0 | 0/85 | 3/6 | 3/3 | 3/36 | 0/4 |
| 5 | ![]() |
31 | 0/0 | 29/93 | 31/14 | 31/13 | 31/1593 | 27/53 |
| 6 | ![]() |
23 | 0/0 | 0/0 | 23/0 | 23/0 | 23/23 | 0/0 |
| 7 | ![]() |
9 | 0/0 | 0/0 | 8/83 | 8/63 | 9/237 | 5/40 |
As we can see from the table, MACCS, Estate and Substructure Fingerprint perform very bad, they found very little hit, sometimes no hit at all. They are not designed to do this task, it’s not amazing to see this result.
For standard and extended CDK fingerprint, sometimes standard one works better(Maybe I should use longer extended fingerprint rather than the default length, as discussed in Guha’s post). On queries of complex ring system, extended CDK fingerprint works better, but not a obvious advantage.
But I wonder why hashed fingerprint still miss some result, (see query 2 and query 7)? Why the superstructure doesn’t share the same bits with substructure? Is this because the structure is too simple?
As many incurrent hits are found, CDK fingerprints should never be used alone in substructure search. But please consider combine CDK fingerprints and subgraph isomorphism, do fingerprint search first, we can avoid performing the consuming subgraph isomorphism match on all targets, thus we can have a balance between speed and accuracy in that way.
Tags: Chemoinformatics, Fingerprint, search, substructure
Recently, I wrote a post named Faster Fingerprint Search with Java & CDK . It’s fast enough, with a response time of 300 ms for a database of 100000 compounds as you can see from the chart above. If we do some simple improvement on it, it could be even faster.
As we all know, when we search for similar structures, our judgement of similarity is based on Tanimoto coefficient. If variable ‘a’ stands for the number of all TRUE bits in one fingerprint, ‘b’ stands for another, and ‘c’ stands for the number of TRUE bits they both have, we can define Tanimoto coefficient as c/(a+b-c). If we want to find some fingerprints with a minimum Tanimoto coefficient λ, we are saying c/(a+b-c) > λ. As c is the number of TRUE bits they have in common, c is absolutely not greater than a or b. Then we get b*λ<a<b/λ and a*λ<b<a/λ.
With this inequality in hand, we don’t need to iterate all the fingerprints to do a similar structure search. If we sort all the fingerprints in their number of all TRUE bits, we can significantly reduce the range of database we need to screen.
Here is the distribution of fingerprint darkness of my database of 80000 commercial compounds.
And here is the search time after new search method is applied.
Extremely fast!
Tags: CDK, Cheminformatics, Chemoinformatics, Fingerprint
Rich Apodaca wrote a great serious posts named Fast Substructure Search Using Open Source Tools providing details on substructure search with MySQL. But, however, poor binary data operation functions of MySQL limited the implementation of similar structure search which typically depends on the calculation of Tanimato coefficient. We are going to use Java & CDK to add this feature.
As default output of CDK fingerprint, java.util.BitSet with Serializable interface is perfect data format of fingerprint data storage. Java itself provides several collections such as ArrayList, LinkedList, Vector class in package java.util. To provide web access to the search engine, thread unsafe ArrayList and LinkedList have to be kicked out. How about Vector? Once all the fingerprint data is well prepared, the collection function we need to do similarity search is just iteration. No add, no delete. So, a light weight array is enough.
Most of the molecule information is stored in MySQL database, so we are going to map fingerprint to corresponding row in data table. Here is the MolDFData class, we use a long variable to store corresponding primary key in data table.
public class MolDFData implements Serializable {
private long id;
private BitSet fingerprint;
public MolDFData(long id, BitSet fingerprint) {
this.id = id;
this.fingerprint = fingerprint;
}
public long getId() {
return id;
}
public void setId(long id) {
this.id = id;
}
public BitSet getFingerprint() {
return fingerprint;
}
public void setFingerprint(BitSet fingerprint) {
this.fingerprint = fingerprint;
}
}
This is how we storage our fingerprints.
private MolFPData[] arrayData;
No big deal with similarity search. Just calculate the Tanimoto coefficient, if it’s bigger than minimal similarity you set, add this one into result.
public List searchTanimoto(BitSet bt, float minSimlarity) {
List resultList = new LinkedList();
int i;
for (i = 0; i < arrayData.length; i++) {
MolDFData aListData = arrayData[i];
try {
float coefficient = Tanimoto.calculate(aListData.getFingerprint(), bt);
if (coefficient > minSimlarity) {
resultList.add(new SearchResultData(aListData.getId(), coefficient));
}
} catch (CDKException e) {
}
Collections.sort(resultList);
}
return resultList;
}
Pretty ugly code? Maybe. But it really works, at a acceptable speed. Tests were done using the code blow on a macbook(Intel Core Due 1.83 GHz, 2G RAM).
long t3 = System.currentTimeMillis();
List<SearchResultData> listResult = se.searchTanimoto(bs, 0.8f);
long t4 = System.currentTimeMillis();
System.out.println("Thread: Search done in " + (t4 - t3) + " ms.");
In my database of 87364 commercial compounds, it takes 335 ms.
Tags: CDK, Cheminformatics, Chemoinformatics, Fingerprint, Java