<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>ChemHack &#187; CDK</title>
	<atom:link href="http://chemhack.com/tag/cdk/feed/" rel="self" type="application/rss+xml" />
	<link>http://chemhack.com</link>
	<description>Hacking the chemistry world.</description>
	<lastBuildDate>Sat, 18 Dec 2010 18:07:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Does the CDK Fingeprint works? Something went wrong.</title>
		<link>http://chemhack.com/2009/03/does-the-cdk-fingeprint-works-something-went-wrong/</link>
		<comments>http://chemhack.com/2009/03/does-the-cdk-fingeprint-works-something-went-wrong/#comments</comments>
		<pubDate>Mon, 16 Mar 2009 16:11:38 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[CDK]]></category>
		<category><![CDATA[Fingerprint]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=282</guid>
		<description><![CDATA[  In my last post, I doubted the accuracy of fingerprint based substructure search and pointed out sometimes fingerprint loosed hits. In fact, something went wrong in my code. As I was reading from SDF directly, while IteratingMDLReader does not percieve atom type or detect aromaticity automatically. This cause the incorrect matchings of UniversalIsomorphismTester, sorry for the incorrect post. I&#8217;ve run the test again, [...]]]></description>
			<content:encoded><![CDATA[<p> </p>
<p>In my <a href="http://chemhack.com/archives/2009/03/255/">last post</a>, I doubted the accuracy of fingerprint based substructure search and pointed out sometimes fingerprint loosed hits. In fact, something went wrong in my code. As I was reading from SDF directly, while IteratingMDLReader does not percieve atom type or detect aromaticity automatically. This cause the incorrect matchings of UniversalIsomorphismTester, sorry for the incorrect post. I&#8217;ve run the test again, using the <a href="http://chemhack.com/wp-content/uploads/2009/03/junk.smi">SMILES</a> provided by <a href="http://blog.rguha.net/?p=133" target="_blank">Guha</a> as input. The groovy script is also attached <a href="http://chemhack.com/wp-content/uploads/2009/03/substructure.groovy">here</a>.</p>
<table border="1" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td valign="top"><strong>#</strong></td>
<td valign="top"><strong>Query</strong></td>
<td valign="top"><strong>Subgraph</strong><strong> Isomorphism</strong></td>
<td valign="top"><strong>Entended CDK</strong></td>
<td valign="top"><strong>Missing</strong></td>
<td valign="top"><strong>Extra</strong></td>
</tr>
<tr>
<td valign="top">1</td>
<td valign="top"><img class="size-thumbnail wp-image-258 alignnone" title="e59bbee78987-101" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-101-150x150.png" alt="e59bbee78987-101" width="120" height="120" /></td>
<td valign="top">20</td>
<td valign="top">24</td>
<td valign="top">0</td>
<td valign="top">4</td>
</tr>
<tr>
<td valign="top">2</td>
<td valign="top"><img class="size-full wp-image-256 alignnone" title="e59bbee78987-9" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-9.png" alt="e59bbee78987-9" width="150" height="121" /></td>
<td valign="top">7</td>
<td valign="top">103</td>
<td valign="top">0</td>
<td valign="top">96</td>
</tr>
<tr>
<td valign="top">3</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-259" title="e59bbee78987-11" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-11-150x150.png" alt="e59bbee78987-11" width="135" height="135" /></td>
<td valign="top">69</td>
<td valign="top">100</td>
<td valign="top">0</td>
<td valign="top">31</td>
</tr>
<tr>
<td valign="top">4</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-260" title="e59bbee78987-12" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-12-150x150.png" alt="e59bbee78987-12" width="135" height="135" /></td>
<td valign="top">6</td>
<td valign="top">10</td>
<td valign="top">0</td>
<td valign="top">4</td>
</tr>
<tr>
<td valign="top">5</td>
<td valign="top"><img class="alignnone size-full wp-image-261" title="e59bbee78987-13" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-13.png" alt="e59bbee78987-13" width="151" height="81" /></td>
<td valign="top">31</td>
<td valign="top">41</td>
<td valign="top">0</td>
<td valign="top">10</td>
</tr>
<tr>
<td valign="top">6</td>
<td valign="top"><img class="alignnone size-full wp-image-262" title="e59bbee78987-15" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-15.png" alt="e59bbee78987-15" width="85" height="112" /></td>
<td valign="top">23</td>
<td valign="top">23</td>
<td valign="top">0</td>
<td valign="top">0</td>
</tr>
<tr>
<td valign="top">7</td>
<td valign="top"><img class="alignnone size-full wp-image-263" title="e59bbee78987-16" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-16.png" alt="e59bbee78987-16" width="134" height="102" /></td>
<td valign="top">7</td>
<td valign="top">75</td>
<td valign="top">0</td>
<td valign="top">68</td>
</tr>
</tbody>
</table>
<p>Well, CDK fingerprint is OK.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/03/does-the-cdk-fingeprint-works-something-went-wrong/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>More Faster Fingerprint Search with Java &amp; CDK</title>
		<link>http://chemhack.com/2008/11/more-faster-fingerprint-search-with-java-cdk/</link>
		<comments>http://chemhack.com/2008/11/more-faster-fingerprint-search-with-java-cdk/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 22:11:24 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[CDK]]></category>
		<category><![CDATA[Cheminformatics]]></category>
		<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Fingerprint]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=137</guid>
		<description><![CDATA[Recently, I wrote a post named Faster Fingerprint Search with Java &#38; CDK . It&#8217;s fast enough, with a response time of 300 ms for a database of 100000 compounds as you can see from the chart above. If we do some simple improvement on it, it could be even faster.   As we all know, when we [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I wrote a post named <a href="http://chemhack.com/archives/2008/11/110/">Faster Fingerprint Search with Java &amp; CDK</a> . It&#8217;s fast enough, with a response time of 300 ms for a database of 100000 compounds as you can see from the chart above. If we do some simple improvement on it, it could be even faster.</p>
<p><span><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-121.png"><img class="alignnone size-full wp-image-140" title="Darkness" src="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-121.png" alt="" width="452" height="450" /></a><br />
</span></p>
<p> </p>
<p>As we all know, when we search for similar structures, our judgement of similarity is based on Tanimoto coefficient. If variable &#8216;a&#8217; stands for the number of all TRUE bits in one fingerprint, &#8216;b&#8217; stands for another, and &#8216;c&#8217; stands for the number of TRUE bits they both have, we can define Tanimoto coefficient as c/(a+b-c). If we want to find some fingerprints with a minimum Tanimoto coefficient λ, we are saying c/(a+b-c) &gt; λ. As c is the number of TRUE bits they have in common, c is absolutely not greater than a or b. Then we get b*λ&lt;a&lt;b/λ and a*λ&lt;b&lt;a/λ. </p>
<p>With this inequality in hand, we don&#8217;t need to iterate all the fingerprints to do a similar structure search. If we sort all the fingerprints in their number of all TRUE bits, we can significantly reduce the range of database we need to screen.</p>
<p> Here is the distribution of fingerprint darkness of my database of 80000 commercial compounds. </p>
<p><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-3.png"><img class="alignnone size-full wp-image-141" title="Darkness" src="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-3.png" alt="" width="500" height="507" /></a></p>
<p>And here is the search time after new search method is applied.</p>
<p><span style="color: #0000ee; text-decoration: underline;"><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-9.png"></a><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-9.png"><img class="alignnone size-full wp-image-142" title="Search Time" src="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-9.png" alt="" width="500" height="430" /></a></span></p>
<p>Extremely fast!</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2008/11/more-faster-fingerprint-search-with-java-cdk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Faster Fingerprint Search with Java &amp; CDK</title>
		<link>http://chemhack.com/2008/11/faster-fingerprint-search-with-java-cdk/</link>
		<comments>http://chemhack.com/2008/11/faster-fingerprint-search-with-java-cdk/#comments</comments>
		<pubDate>Tue, 11 Nov 2008 10:54:27 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[CDK]]></category>
		<category><![CDATA[Cheminformatics]]></category>
		<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Fingerprint]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=110</guid>
		<description><![CDATA[Rich Apodaca wrote a great serious posts named Fast Substructure Search Using Open Source Tools providing details on substructure search with MySQL. But, however, poor binary data operation functions of MySQL limited the implementation of similar structure search which typically depends on the calculation of Tanimato coefficient. We are going to use Java &#38; CDK to add this feature. As default [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://depth-first.com/">Rich Apodaca</a> wrote a great serious posts named <em>Fast Substructure Search Using Open Source Tools</em> providing details on substructure search with MySQL. But, however, poor binary data operation functions of MySQL limited the implementation of similar structure search which typically depends on the calculation of Tanimato coefficient. We are going to use Java &amp; CDK to add this feature.</p>
<p>As default output of CDK fingerprint, <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/BitSet.html">java.util.BitSet</a> with <a title="interface in java.io" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/Serializable.html">Serializable</a> interface is perfect data format of fingerprint data storage. Java itself provides several collections such as <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/ArrayList.html">ArrayList</a>, <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/LinkedList.html">LinkedList</a>, <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/Vector.html">Vector</a> class in package java.util. To provide web access to the search engine, thread unsafe ArrayList and LinkedList have to be kicked out. How about Vector? Once all the fingerprint data is well prepared, the collection  function we need to do similarity search is just iteration. No add, no delete. So, a light weight array is enough.</p>
<p>Most of the molecule information is stored in MySQL database, so we are going to map fingerprint to corresponding row in data table. Here is the MolDFData class, we use a long variable to store corresponding primary key in data table.</p>
<pre lang="java">public class MolDFData implements Serializable {
    private long id;
    private BitSet fingerprint;

    public MolDFData(long id, BitSet fingerprint) {
        this.id = id;
        this.fingerprint = fingerprint;
    }
    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    public BitSet getFingerprint() {
        return fingerprint;
    }

    public void setFingerprint(BitSet fingerprint) {
        this.fingerprint = fingerprint;
    }
}</pre>
<p>This is how we storage our fingerprints.</p>
<pre lang="java">private MolFPData[] arrayData;</pre>
<p>No big deal with similarity search. Just calculate the Tanimoto coefficient, if it&#8217;s bigger than minimal  similarity you set, add this one into result.</p>
<pre lang="java">    public List searchTanimoto(BitSet bt, float minSimlarity) {
        List resultList = new LinkedList();
        int i;
        for (i = 0; i &lt; arrayData.length; i++) {
            MolDFData aListData = arrayData[i];
            try {
                float coefficient = Tanimoto.calculate(aListData.getFingerprint(), bt);
                if (coefficient &gt; minSimlarity) {
                    resultList.add(new SearchResultData(aListData.getId(), coefficient));
                }
            } catch (CDKException e) {

            }
            Collections.sort(resultList);
        }
        return resultList;
    }</pre>
<p>Pretty ugly code?  Maybe. But it really works, at a acceptable speed. Tests were done using the code blow on a macbook(Intel Core Due 1.83 GHz, 2G RAM).<span style="font-family: 'Courier New'; line-height: 18px; white-space: pre; "> </span></p>
<pre lang="java">                long t3 = System.currentTimeMillis();
                List&lt;SearchResultData&gt; listResult = se.searchTanimoto(bs, 0.8f);
                long t4 = System.currentTimeMillis();
                System.out.println("Thread: Search done in " + (t4 - t3) + " ms.");</pre>
<p>In my database of 87364 commercial compounds, it takes 335 ms.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2008/11/faster-fingerprint-search-with-java-cdk/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

