msgbartop
ChemHack.com中文版
msgbarbottom

31 三 09 CDK,Gasteiger,开源

昨天Gasteiger在我们组呆了一天,流水账就不记了。说说有意思的,先跟他show了show我的jsMolEditor,还不错。后来不知道怎么提到了斯丁贝克(我都说了CDK了,大家也就知道,免得人家Google过来看俺说坏话),G老就说阿,这个斯丁贝克啊,我太了解了,他写的CDK我能列出一堆问题。囧掉,后面我也就没提我的结构检索是用的CDK指纹。后来我放到jsMolEditor的最后一张片,说到Open Source & Open Access的时候。G老又说了,你Open Access就行了,别Open Source,让人家用就行了,别管人家能不能学会你的代码。然后又开始数落了CDK一通。手心都是汗。

G老说阿,你这个不是编辑器吗?我们也有一个。于是就开始给俺Show了,Show着Show着,Show不出来了。这个东西不是他自己写的,还没学会怎么用,囧RZ。本来说今天早上再Show给我的,结果半路被徐博士拐走。。。

然后就是比较G老比较精彩的东西了,从分子模型扯到人体结构。像指纹这些碎片编码阿,就是跟碎尸一样,说不定头跟小腿放一块呢,不靠谱。这个拓扑结构阿,那就是说你手指头跟大腿骨一样粗阿,也不靠谱阿。这个三维结构阿,大概就是像只有骨架的干尸标本。人是有皮的,so分子也要考虑它的表面。然后从二维结构到分子表面一一给出了相似度的比较方法,然后顺便广告时间,你们就不用想怎么去实现了,买我们的软件就行了。

最后就是给我们药学院的学生来了一个关于BioPath的lecture,咋看咋像软件广告会。。。

以上。

Tags: , ,

13 十一 08 More Faster Fingerprint Search with Java & CDK

Recently, I wrote a post named Faster Fingerprint Search with Java & CDK . It’s fast enough, with a response time of 300 ms for a database of 100000 compounds as you can see from the chart above. If we do some simple improvement on it, it could be even faster.


 

As we all know, when we search for similar structures, our judgement of similarity is based on Tanimoto coefficient. If variable ‘a’ stands for the number of all TRUE bits in one fingerprint, ‘b’ stands for another, and ‘c’ stands for the number of TRUE bits they both have, we can define Tanimoto coefficient as c/(a+b-c). If we want to find some fingerprints with a minimum Tanimoto coefficient λ, we are saying c/(a+b-c) > λ. As c is the number of TRUE bits they have in common, c is absolutely not greater than a or b. Then we get b*λ<a<b/λ and a*λ<b<a/λ. 

With this inequality in hand, we don’t need to iterate all the fingerprints to do a similar structure search. If we sort all the fingerprints in their number of all TRUE bits, we can significantly reduce the range of database we need to screen.

 Here is the distribution of fingerprint darkness of my database of 80000 commercial compounds. 

And here is the search time after new search method is applied.

Extremely fast!

Tags: , , ,

11 十一 08 Faster Fingerprint Search with Java & CDK

Rich Apodaca wrote a great serious posts named Fast Substructure Search Using Open Source Tools providing details on substructure search with MySQL. But, however, poor binary data operation functions of MySQL limited the implementation of similar structure search which typically depends on the calculation of Tanimato coefficient. We are going to use Java & CDK to add this feature.

As default output of CDK fingerprint, java.util.BitSet with Serializable interface is perfect data format of fingerprint data storage. Java itself provides several collections such as ArrayList, LinkedList, Vector class in package java.util. To provide web access to the search engine, thread unsafe ArrayList and LinkedList have to be kicked out. How about Vector? Once all the fingerprint data is well prepared, the collection  function we need to do similarity search is just iteration. No add, no delete. So, a light weight array is enough.

Most of the molecule information is stored in MySQL database, so we are going to map fingerprint to corresponding row in data table. Here is the MolDFData class, we use a long variable to store corresponding primary key in data table.

public class MolDFData implements Serializable {
    private long id;
    private BitSet fingerprint;

    public MolDFData(long id, BitSet fingerprint) {
        this.id = id;
        this.fingerprint = fingerprint;
    }
    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    public BitSet getFingerprint() {
        return fingerprint;
    }

    public void setFingerprint(BitSet fingerprint) {
        this.fingerprint = fingerprint;
    }
}

This is how we storage our fingerprints.

private MolFPData[] arrayData;

No big deal with similarity search. Just calculate the Tanimoto coefficient, if it’s bigger than minimal  similarity you set, add this one into result.

    public List searchTanimoto(BitSet bt, float minSimlarity) {
        List resultList = new LinkedList();
        int i;
        for (i = 0; i < arrayData.length; i++) {
            MolDFData aListData = arrayData[i];
            try {
                float coefficient = Tanimoto.calculate(aListData.getFingerprint(), bt);
                if (coefficient > minSimlarity) {
                    resultList.add(new SearchResultData(aListData.getId(), coefficient));
                }
            } catch (CDKException e) {

            }
            Collections.sort(resultList);
        }
        return resultList;
    }

Pretty ugly code?  Maybe. But it really works, at a acceptable speed. Tests were done using the code blow on a macbook(Intel Core Due 1.83 GHz, 2G RAM).

                long t3 = System.currentTimeMillis();
                List<SearchResultData> listResult = se.searchTanimoto(bs, 0.8f);
                long t4 = System.currentTimeMillis();
                System.out.println("Thread: Search done in " + (t4 - t3) + " ms.");

In my database of 87364 commercial compounds, it takes 335 ms.

Tags: , , , ,