Recently, I wrote a post named Faster Fingerprint Search with Java & CDK . It’s fast enough, with a response time of 300 ms for a database of 100000 compounds as you can see from the chart above. If we do some simple improvement on it, it could be even faster.
As we all know, when we search for similar structures, our judgement of similarity is based on Tanimoto coefficient. If variable ‘a’ stands for the number of all TRUE bits in one fingerprint, ‘b’ stands for another, and ‘c’ stands for the number of TRUE bits they both have, we can define Tanimoto coefficient as c/(a+b-c). If we want to find some fingerprints with a minimum Tanimoto coefficient λ, we are saying c/(a+b-c) > λ. As c is the number of TRUE bits they have in common, c is absolutely not greater than a or b. Then we get b*λ<a<b/λ and a*λ<b<a/λ.
With this inequality in hand, we don’t need to iterate all the fingerprints to do a similar structure search. If we sort all the fingerprints in their number of all TRUE bits, we can significantly reduce the range of database we need to screen.
Here is the distribution of fingerprint darkness of my database of 80000 commercial compounds.
And here is the search time after new search method is applied.
Extremely fast!
Tags: CDK, Cheminformatics, Fingerprint, 化学信息学