<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>ChemHack &#187; substructure</title>
	<atom:link href="http://chemhack.com/tag/substructure/feed/" rel="self" type="application/rss+xml" />
	<link>http://chemhack.com</link>
	<description>Hacking the chemistry world.</description>
	<lastBuildDate>Mon, 22 Mar 2010 22:18:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Does the CDK Fingerprints Work? Substructure search</title>
		<link>http://chemhack.com/2009/03/does-the-cdk-fingerprints-work-substructure-search/</link>
		<comments>http://chemhack.com/2009/03/does-the-cdk-fingerprints-work-substructure-search/#comments</comments>
		<pubDate>Thu, 05 Mar 2009 17:41:14 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Fingerprint]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[substructure]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=255</guid>
		<description><![CDATA[The test code used in this post has fatal error, which caused the test result to be completely incorrect. Please see details here.
Abstract: Doubting on the accuracy of fingerprint based molecule substructure search, I did a test between different fingerprints implemented in CDK and subgraph isomorphism. The result is very interesting, CDK fingerprints should never [...]]]></description>
			<content:encoded><![CDATA[<p><span style="color: #ff0000;">The test code used in this post has fatal error, which caused the test result to be completely incorrect. Please see details <a href="http://chemhack.com/archives/2009/03/282/">here</a>.</span></p>
<p>Abstract: Doubting on the accuracy of fingerprint based molecule substructure search, I did a test between different fingerprints implemented in CDK and subgraph isomorphism. The result is very interesting, CDK fingerprints should never be used alone in substructure search, but combine CDK fingerprints and subgraph isomorphism, we can have a balance between speed and accuracy.</p>
<h2>Backgournd</h2>
<p><a href="http://blog.rguha.net/">Guha</a> wrote a <a href="http://blog.rguha.net/?p=29">post</a> about benchmarking of different type of fingerprints with benchmarking strategies described by <a href="http://dx.doi.org/10.1021/ci0500177">Bender &amp; Glen</a> and <a href="http://dx.doi.org/10.1021/ci050276w">Godden</a> et al. The benchmark is based on Tanimoto similarity, which is the foundation of most chemistry database’s similar structure search. Another impo</p>
<p>rtant feature of molecule structure search is substructure, currently,  subgraph isomorphism and fingerprint are both used in substructure search. Adel Golovin and Kim Henrick’s article <a href="http://pubs.acs.org/doi/abs/10.1021/ci8003013">Chemical Substructure Search in SQL</a> provides a pure SQL subgraph isomorphism strategies, <a href="http://depth-first.com">Rich Apodaca</a>’s <a href="http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases">post</a> fingerprint based MySQL substructure search in MySQL also found a solution with limited binary operation in MySQL.</p>
<p>Fingerprint is obviously faster as it’s much less consuming than subgraph isomorphism. But the real question is, does the fingerprint method really find all the substructures? Are there any hits misjudged as substructure?</p>
<h2>Method</h2>
<p>Using <a href="http://www.drugbank.ca">DrugBank</a> small molecule drugs as test dataset, several hand-draw structures of different types as queries, I performed substructure search using subgraph isomorphism and fingerprints implemented in CDK. If a hit in search results of fingerprint method is also found in search results of subgraph isomorphism method, I count this hit as a correct hit, other wise the hit will be counted as a incorrect hit.</p>
<p>All fingerprints implemented in CDK are tested, generated using Fingerprinter, ExtendedFingerprinter, GraphOnlyFingerprinter, SubstructureFingerprinter, MACCSFingerprinter and EStateFingerprinter. All parameters are kept default.</p>
<p>To test the accuracy of queries with different complexity, I drew several structures, as listed below.</p>
<p><img class="size-full wp-image-256 alignnone" title="e59bbee78987-9" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-9.png" alt="e59bbee78987-9" width="150" height="121" /><img class="size-thumbnail wp-image-258 alignnone" title="e59bbee78987-101" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-101-150x150.png" alt="e59bbee78987-101" width="120" height="120" /><img class="alignnone size-thumbnail wp-image-259" title="e59bbee78987-11" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-11-150x150.png" alt="e59bbee78987-11" width="135" height="135" /><img class="alignnone size-thumbnail wp-image-260" title="e59bbee78987-12" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-12-150x150.png" alt="e59bbee78987-12" width="135" height="135" /><img class="alignnone size-full wp-image-261" title="e59bbee78987-13" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-13.png" alt="e59bbee78987-13" width="151" height="81" /><img class="alignnone size-full wp-image-262" title="e59bbee78987-15" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-15.png" alt="e59bbee78987-15" width="85" height="112" /><img class="alignnone size-full wp-image-263" title="e59bbee78987-16" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-16.png" alt="e59bbee78987-16" width="134" height="102" /></p>
<h2>Result</h2>
<p>The result is listed below. The result is listed in the format of &#8220;A/B&#8221;. A represents current hits, i.e. hits also founded by subgraph isomorphism method. B represents incurrent hits. For example, 31/49 stands for 31 current hits and 49 incurrent hits are found. Higher A and lower B is better.</p>
<table border="1" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td valign="top"><strong>#</strong></td>
<td valign="top"><strong>Query</strong></td>
<td valign="top"><strong>Subgraph</strong><strong> Isomorphism</strong></td>
<td valign="top"><strong>EState</strong></td>
<td valign="top"><strong>MACCS</strong></td>
<td valign="top"><strong>Standard CDK</strong></td>
<td valign="top"><strong>Entended CDK</strong></td>
<td valign="top"><strong>Graphonly CDK FIngerpint</strong></td>
<td valign="top"><strong>Substructure Fingerprint</strong></td>
</tr>
<tr>
<td valign="top">1</td>
<td valign="top"><img class="size-thumbnail wp-image-258 alignnone" title="e59bbee78987-101" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-101-150x150.png" alt="e59bbee78987-101" width="120" height="120" /></td>
<td valign="top">31</td>
<td valign="top">0/0</td>
<td valign="top">0/43</td>
<td valign="top"><span style="color: red;">31/49</span></td>
<td valign="top">31/54</td>
<td valign="top">31/574</td>
<td valign="top">0/0</td>
</tr>
<tr>
<td valign="top">2</td>
<td valign="top"><img class="size-full wp-image-256 alignnone" title="e59bbee78987-9" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-9.png" alt="e59bbee78987-9" width="150" height="121" /></td>
<td valign="top">54</td>
<td valign="top">0/0</td>
<td valign="top">0/1</td>
<td valign="top"><span style="color: red;">21/95</span></td>
<td valign="top">18/84</td>
<td valign="top">54/1512</td>
<td valign="top">11/30</td>
</tr>
<tr>
<td valign="top">3</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-259" title="e59bbee78987-11" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-11-150x150.png" alt="e59bbee78987-11" width="135" height="135" /></td>
<td valign="top">29</td>
<td valign="top">0/0</td>
<td valign="top">0/20</td>
<td valign="top"><span style="color: red;">29/74</span></td>
<td valign="top">29/83</td>
<td valign="top">29/1793</td>
<td valign="top">0/5</td>
</tr>
<tr>
<td valign="top">4</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-260" title="e59bbee78987-12" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-12-150x150.png" alt="e59bbee78987-12" width="135" height="135" /></td>
<td valign="top">3</td>
<td valign="top">0/0</td>
<td valign="top">0/85</td>
<td valign="top">3/6</td>
<td valign="top"><span style="color: red;">3/3</span></td>
<td valign="top">3/36</td>
<td valign="top">0/4</td>
</tr>
<tr>
<td valign="top">5</td>
<td valign="top"><img class="alignnone size-full wp-image-261" title="e59bbee78987-13" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-13.png" alt="e59bbee78987-13" width="151" height="81" /></td>
<td valign="top">31</td>
<td valign="top">0/0</td>
<td valign="top">29/93</td>
<td valign="top">31/14</td>
<td valign="top"><span style="color: red;">31/13</span></td>
<td valign="top">31/1593</td>
<td valign="top">27/53</td>
</tr>
<tr>
<td valign="top">6</td>
<td valign="top"><img class="alignnone size-full wp-image-262" title="e59bbee78987-15" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-15.png" alt="e59bbee78987-15" width="85" height="112" /></td>
<td valign="top">23</td>
<td valign="top">0/0</td>
<td valign="top">0/0</td>
<td valign="top"><span style="color: red;">23/0</span></td>
<td valign="top"><span style="color: red;">23/0</span></td>
<td valign="top">23/23</td>
<td valign="top">0/0</td>
</tr>
<tr>
<td valign="top">7</td>
<td valign="top"><img class="alignnone size-full wp-image-263" title="e59bbee78987-16" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-16.png" alt="e59bbee78987-16" width="134" height="102" /></td>
<td valign="top">9</td>
<td valign="top">0/0</td>
<td valign="top">0/0</td>
<td valign="top">8/83</td>
<td valign="top"><span style="color: red;">8/63</span></td>
<td valign="top">9/237</td>
<td valign="top">5/40</td>
</tr>
</tbody>
</table>
<h2>Conclusions &amp; Questions</h2>
<p>As we can see from the table, MACCS, Estate and Substructure Fingerprint  perform very bad, they found very little hit, sometimes no hit at all. They are not designed to do this task, it&#8217;s not amazing to see this result.</p>
<p>For standard and extended CDK fingerprint, sometimes standard one works better(Maybe I should use longer extended fingerprint rather than the default length, as discussed in Guha&#8217;s <a href="http://blog.rguha.net/?p=29">post</a>). On queries of complex ring system, extended CDK fingerprint works better, but not a obvious advantage.</p>
<p>But I wonder why hashed fingerprint still miss some result, (see query 2 and query 7)? Why the superstructure doesn&#8217;t share the same bits with substructure? Is this because the structure is too simple?</p>
<p>As many incurrent hits are found, CDK fingerprints should never be used alone in substructure search. But please consider combine CDK fingerprints and subgraph isomorphism, do fingerprint search first, we can avoid performing the consuming subgraph isomorphism match on all targets, thus we can have a balance between speed and accuracy in that way.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/03/does-the-cdk-fingerprints-work-substructure-search/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
