<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>ChemHack &#187; Chemoinformatics</title>
	<atom:link href="http://chemhack.com/tag/chemoinformatics/feed/" rel="self" type="application/rss+xml" />
	<link>http://chemhack.com</link>
	<description>Hacking the chemistry world.</description>
	<lastBuildDate>Sat, 18 Dec 2010 18:07:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>JME on .Net Framework &#8211; A Workaround</title>
		<link>http://chemhack.com/2009/08/jme-on-net-framework-a-workaround/</link>
		<comments>http://chemhack.com/2009/08/jme-on-net-framework-a-workaround/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 06:55:59 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Work]]></category>
		<category><![CDATA[.Net]]></category>
		<category><![CDATA[Applet]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[JME]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=311</guid>
		<description><![CDATA[Introduction JME is the most popular molecule structure editor on the web. We can integrate it to any web pages or Java applications. Now, we&#8217;re going to integrate JME into .Net applications via IKVM.NET . Method Although JME can be obtained at no cost, it&#8217;s still a private software as it&#8217;s not open source. Any [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<h2><img class="alignright size-full wp-image-318" title="2" src="http://chemhack.com/wp-content/uploads/2009/08/21.png" alt="2" width="283" height="282" /></h2>
<p><a href="http://www.molinspiration.com/jme/" target="_blank">JME</a> is the most popular molecule structure editor on the web. We can integrate it to any web pages or Java applications. Now, we&#8217;re going to integrate JME into .Net applications via <a href="http://www.ikvm.net/" target="_blank">IKVM.NET </a>.</p>
<h2>Method</h2>
<p>Although JME can be obtained at no cost, it&#8217;s still a private software as it&#8217;s not open source. Any port method at source code level would not work on JME. The only way is work with JME Java byte-code.</p>
<p>IKVM.NET is an implementation of Java for <a href="http://www.go-mono.org/" target="_blank">Mono</a> and the <a href="http://msdn.microsoft.com/netframework/" target="_blank">Microsoft .NET Framework</a>.To be brief, IKVM.NET is a JVM for .Net. With IKVM.NET, we&#8217;re able to run Java byte-code on .Net Framework. IKVM.NET has uncomplete support for AWT and Swing.</p>
<p>With several simple steps listed below, you&#8217;re able to integrate JME into your VB.Net or C# applications.</p>
<ol>
<li>Download IKVM.Net and compile JME.jar into .Net assembly.</li>
<li>Wrap JME applet into a Java Swing container JWindow.</li>
<li> Popup the JWindow.</li>
</ol>
<p>Let&#8217;s have a look at how we can do it.</p>
<h2>Compiling JME into .Net assembly</h2>
<p>Under the directory of IKVM.NET executables, this command below can simply covert JME from Java byte-code into .Net IL code.</p>
<pre lang="shell">C:\ikvm-0.40.0.1\bin&gt;ikvmc.exe -target:library JME.jar
Note IKVMC0002: output file is "JME.dll"</pre>
<p>Add JME.dll and other IKVM.NET dependencies into your Visual Studio project as reference and then you can access JME class in C#.</p>
<h2>Display JME</h2>
<pre lang="c#">JME applet = new JME();
JWindow floatingWindow = new JWindow();
floatingWindow.setSize(300, 300);
floatingWindow.setLayout(new GridLayout(1, 1));
floatingWindow.add(applet);
applet.init();
applet.start();
floatingWindow.setVisible(true);</pre>
<p>With this code snippet you can display JME in C# applications and invoke the applet&#8217;s Java methods in C#.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/08/jme-on-net-framework-a-workaround/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Chemistry Software Everywhere: Handheld Calulator</title>
		<link>http://chemhack.com/2009/03/chemistry-software-everywhere/</link>
		<comments>http://chemhack.com/2009/03/chemistry-software-everywhere/#comments</comments>
		<pubDate>Mon, 23 Mar 2009 04:48:56 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[calculator]]></category>
		<category><![CDATA[TI-Nspire]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=293</guid>
		<description><![CDATA[In Google Group tinpire, I saw a very interesting Chemistry Library for TI calculator, TI-nspire. I installed it on my handheld, looks nice.]]></description>
			<content:encoded><![CDATA[<p>In Google Group <a href="http://groups.google.com/group/tinspire">tinpire</a>, I saw a very interesting <a href="http://nelsonsousa.pt/index.php?lang=en&amp;cat=2&amp;subcat=3&amp;article=33">Chemistry Library</a> for TI calculator, TI-nspire. I installed it on my handheld, looks nice.</p>
<p><img class="screenshots" src="http://nelsonsousa.pt/images/chemistry1.en.jpg" alt="" /></p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/03/chemistry-software-everywhere/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Does the CDK Fingeprint works? Something went wrong.</title>
		<link>http://chemhack.com/2009/03/does-the-cdk-fingeprint-works-something-went-wrong/</link>
		<comments>http://chemhack.com/2009/03/does-the-cdk-fingeprint-works-something-went-wrong/#comments</comments>
		<pubDate>Mon, 16 Mar 2009 16:11:38 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[CDK]]></category>
		<category><![CDATA[Fingerprint]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=282</guid>
		<description><![CDATA[  In my last post, I doubted the accuracy of fingerprint based substructure search and pointed out sometimes fingerprint loosed hits. In fact, something went wrong in my code. As I was reading from SDF directly, while IteratingMDLReader does not percieve atom type or detect aromaticity automatically. This cause the incorrect matchings of UniversalIsomorphismTester, sorry for the incorrect post. I&#8217;ve run the test again, [...]]]></description>
			<content:encoded><![CDATA[<p> </p>
<p>In my <a href="http://chemhack.com/archives/2009/03/255/">last post</a>, I doubted the accuracy of fingerprint based substructure search and pointed out sometimes fingerprint loosed hits. In fact, something went wrong in my code. As I was reading from SDF directly, while IteratingMDLReader does not percieve atom type or detect aromaticity automatically. This cause the incorrect matchings of UniversalIsomorphismTester, sorry for the incorrect post. I&#8217;ve run the test again, using the <a href="http://chemhack.com/wp-content/uploads/2009/03/junk.smi">SMILES</a> provided by <a href="http://blog.rguha.net/?p=133" target="_blank">Guha</a> as input. The groovy script is also attached <a href="http://chemhack.com/wp-content/uploads/2009/03/substructure.groovy">here</a>.</p>
<table border="1" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td valign="top"><strong>#</strong></td>
<td valign="top"><strong>Query</strong></td>
<td valign="top"><strong>Subgraph</strong><strong> Isomorphism</strong></td>
<td valign="top"><strong>Entended CDK</strong></td>
<td valign="top"><strong>Missing</strong></td>
<td valign="top"><strong>Extra</strong></td>
</tr>
<tr>
<td valign="top">1</td>
<td valign="top"><img class="size-thumbnail wp-image-258 alignnone" title="e59bbee78987-101" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-101-150x150.png" alt="e59bbee78987-101" width="120" height="120" /></td>
<td valign="top">20</td>
<td valign="top">24</td>
<td valign="top">0</td>
<td valign="top">4</td>
</tr>
<tr>
<td valign="top">2</td>
<td valign="top"><img class="size-full wp-image-256 alignnone" title="e59bbee78987-9" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-9.png" alt="e59bbee78987-9" width="150" height="121" /></td>
<td valign="top">7</td>
<td valign="top">103</td>
<td valign="top">0</td>
<td valign="top">96</td>
</tr>
<tr>
<td valign="top">3</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-259" title="e59bbee78987-11" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-11-150x150.png" alt="e59bbee78987-11" width="135" height="135" /></td>
<td valign="top">69</td>
<td valign="top">100</td>
<td valign="top">0</td>
<td valign="top">31</td>
</tr>
<tr>
<td valign="top">4</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-260" title="e59bbee78987-12" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-12-150x150.png" alt="e59bbee78987-12" width="135" height="135" /></td>
<td valign="top">6</td>
<td valign="top">10</td>
<td valign="top">0</td>
<td valign="top">4</td>
</tr>
<tr>
<td valign="top">5</td>
<td valign="top"><img class="alignnone size-full wp-image-261" title="e59bbee78987-13" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-13.png" alt="e59bbee78987-13" width="151" height="81" /></td>
<td valign="top">31</td>
<td valign="top">41</td>
<td valign="top">0</td>
<td valign="top">10</td>
</tr>
<tr>
<td valign="top">6</td>
<td valign="top"><img class="alignnone size-full wp-image-262" title="e59bbee78987-15" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-15.png" alt="e59bbee78987-15" width="85" height="112" /></td>
<td valign="top">23</td>
<td valign="top">23</td>
<td valign="top">0</td>
<td valign="top">0</td>
</tr>
<tr>
<td valign="top">7</td>
<td valign="top"><img class="alignnone size-full wp-image-263" title="e59bbee78987-16" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-16.png" alt="e59bbee78987-16" width="134" height="102" /></td>
<td valign="top">7</td>
<td valign="top">75</td>
<td valign="top">0</td>
<td valign="top">68</td>
</tr>
</tbody>
</table>
<p>Well, CDK fingerprint is OK.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/03/does-the-cdk-fingeprint-works-something-went-wrong/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Does the CDK Fingerprints Work? Substructure search</title>
		<link>http://chemhack.com/2009/03/does-the-cdk-fingerprints-work-substructure-search/</link>
		<comments>http://chemhack.com/2009/03/does-the-cdk-fingerprints-work-substructure-search/#comments</comments>
		<pubDate>Thu, 05 Mar 2009 17:41:14 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Fingerprint]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[substructure]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=255</guid>
		<description><![CDATA[The test code used in this post has fatal error, which caused the test result to be completely incorrect. Please see details here. Abstract: Doubting on the accuracy of fingerprint based molecule substructure search, I did a test between different fingerprints implemented in CDK and subgraph isomorphism. The result is very interesting, CDK fingerprints should [...]]]></description>
			<content:encoded><![CDATA[<p><span style="color: #ff0000;">The test code used in this post has fatal error, which caused the test result to be completely incorrect. Please see details <a href="http://chemhack.com/archives/2009/03/282/">here</a>.</span></p>
<p>Abstract: Doubting on the accuracy of fingerprint based molecule substructure search, I did a test between different fingerprints implemented in CDK and subgraph isomorphism. The result is very interesting, CDK fingerprints should never be used alone in substructure search, but combine CDK fingerprints and subgraph isomorphism, we can have a balance between speed and accuracy.</p>
<h2>Backgournd</h2>
<p><a href="http://blog.rguha.net/">Guha</a> wrote a <a href="http://blog.rguha.net/?p=29">post</a> about benchmarking of different type of fingerprints with benchmarking strategies described by <a href="http://dx.doi.org/10.1021/ci0500177">Bender &amp; Glen</a> and <a href="http://dx.doi.org/10.1021/ci050276w">Godden</a> et al. The benchmark is based on Tanimoto similarity, which is the foundation of most chemistry database’s similar structure search. Another impo</p>
<p>rtant feature of molecule structure search is substructure, currently,  subgraph isomorphism and fingerprint are both used in substructure search. Adel Golovin and Kim Henrick’s article <a href="http://pubs.acs.org/doi/abs/10.1021/ci8003013">Chemical Substructure Search in SQL</a> provides a pure SQL subgraph isomorphism strategies, <a href="http://depth-first.com">Rich Apodaca</a>’s <a href="http://depth-first.com/articles/2008/10/02/fast-substructure-search-using-open-source-tools-part-1-fingerprints-and-databases">post</a> fingerprint based MySQL substructure search in MySQL also found a solution with limited binary operation in MySQL.</p>
<p>Fingerprint is obviously faster as it’s much less consuming than subgraph isomorphism. But the real question is, does the fingerprint method really find all the substructures? Are there any hits misjudged as substructure?</p>
<h2>Method</h2>
<p>Using <a href="http://www.drugbank.ca">DrugBank</a> small molecule drugs as test dataset, several hand-draw structures of different types as queries, I performed substructure search using subgraph isomorphism and fingerprints implemented in CDK. If a hit in search results of fingerprint method is also found in search results of subgraph isomorphism method, I count this hit as a correct hit, other wise the hit will be counted as a incorrect hit.</p>
<p>All fingerprints implemented in CDK are tested, generated using Fingerprinter, ExtendedFingerprinter, GraphOnlyFingerprinter, SubstructureFingerprinter, MACCSFingerprinter and EStateFingerprinter. All parameters are kept default.</p>
<p>To test the accuracy of queries with different complexity, I drew several structures, as listed below.</p>
<p><img class="size-full wp-image-256 alignnone" title="e59bbee78987-9" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-9.png" alt="e59bbee78987-9" width="150" height="121" /><img class="size-thumbnail wp-image-258 alignnone" title="e59bbee78987-101" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-101-150x150.png" alt="e59bbee78987-101" width="120" height="120" /><img class="alignnone size-thumbnail wp-image-259" title="e59bbee78987-11" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-11-150x150.png" alt="e59bbee78987-11" width="135" height="135" /><img class="alignnone size-thumbnail wp-image-260" title="e59bbee78987-12" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-12-150x150.png" alt="e59bbee78987-12" width="135" height="135" /><img class="alignnone size-full wp-image-261" title="e59bbee78987-13" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-13.png" alt="e59bbee78987-13" width="151" height="81" /><img class="alignnone size-full wp-image-262" title="e59bbee78987-15" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-15.png" alt="e59bbee78987-15" width="85" height="112" /><img class="alignnone size-full wp-image-263" title="e59bbee78987-16" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-16.png" alt="e59bbee78987-16" width="134" height="102" /></p>
<h2>Result</h2>
<p>The result is listed below. The result is listed in the format of &#8220;A/B&#8221;. A represents current hits, i.e. hits also founded by subgraph isomorphism method. B represents incurrent hits. For example, 31/49 stands for 31 current hits and 49 incurrent hits are found. Higher A and lower B is better.</p>
<table border="1" cellspacing="0" cellpadding="0" align="center">
<tbody>
<tr>
<td valign="top"><strong>#</strong></td>
<td valign="top"><strong>Query</strong></td>
<td valign="top"><strong>Subgraph</strong><strong> Isomorphism</strong></td>
<td valign="top"><strong>EState</strong></td>
<td valign="top"><strong>MACCS</strong></td>
<td valign="top"><strong>Standard CDK</strong></td>
<td valign="top"><strong>Entended CDK</strong></td>
<td valign="top"><strong>Graphonly CDK FIngerpint</strong></td>
<td valign="top"><strong>Substructure Fingerprint</strong></td>
</tr>
<tr>
<td valign="top">1</td>
<td valign="top"><img class="size-thumbnail wp-image-258 alignnone" title="e59bbee78987-101" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-101-150x150.png" alt="e59bbee78987-101" width="120" height="120" /></td>
<td valign="top">31</td>
<td valign="top">0/0</td>
<td valign="top">0/43</td>
<td valign="top"><span style="color: red;">31/49</span></td>
<td valign="top">31/54</td>
<td valign="top">31/574</td>
<td valign="top">0/0</td>
</tr>
<tr>
<td valign="top">2</td>
<td valign="top"><img class="size-full wp-image-256 alignnone" title="e59bbee78987-9" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-9.png" alt="e59bbee78987-9" width="150" height="121" /></td>
<td valign="top">54</td>
<td valign="top">0/0</td>
<td valign="top">0/1</td>
<td valign="top"><span style="color: red;">21/95</span></td>
<td valign="top">18/84</td>
<td valign="top">54/1512</td>
<td valign="top">11/30</td>
</tr>
<tr>
<td valign="top">3</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-259" title="e59bbee78987-11" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-11-150x150.png" alt="e59bbee78987-11" width="135" height="135" /></td>
<td valign="top">29</td>
<td valign="top">0/0</td>
<td valign="top">0/20</td>
<td valign="top"><span style="color: red;">29/74</span></td>
<td valign="top">29/83</td>
<td valign="top">29/1793</td>
<td valign="top">0/5</td>
</tr>
<tr>
<td valign="top">4</td>
<td valign="top"><img class="alignnone size-thumbnail wp-image-260" title="e59bbee78987-12" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-12-150x150.png" alt="e59bbee78987-12" width="135" height="135" /></td>
<td valign="top">3</td>
<td valign="top">0/0</td>
<td valign="top">0/85</td>
<td valign="top">3/6</td>
<td valign="top"><span style="color: red;">3/3</span></td>
<td valign="top">3/36</td>
<td valign="top">0/4</td>
</tr>
<tr>
<td valign="top">5</td>
<td valign="top"><img class="alignnone size-full wp-image-261" title="e59bbee78987-13" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-13.png" alt="e59bbee78987-13" width="151" height="81" /></td>
<td valign="top">31</td>
<td valign="top">0/0</td>
<td valign="top">29/93</td>
<td valign="top">31/14</td>
<td valign="top"><span style="color: red;">31/13</span></td>
<td valign="top">31/1593</td>
<td valign="top">27/53</td>
</tr>
<tr>
<td valign="top">6</td>
<td valign="top"><img class="alignnone size-full wp-image-262" title="e59bbee78987-15" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-15.png" alt="e59bbee78987-15" width="85" height="112" /></td>
<td valign="top">23</td>
<td valign="top">0/0</td>
<td valign="top">0/0</td>
<td valign="top"><span style="color: red;">23/0</span></td>
<td valign="top"><span style="color: red;">23/0</span></td>
<td valign="top">23/23</td>
<td valign="top">0/0</td>
</tr>
<tr>
<td valign="top">7</td>
<td valign="top"><img class="alignnone size-full wp-image-263" title="e59bbee78987-16" src="http://chemhack.com/wp-content/uploads/2009/03/e59bbee78987-16.png" alt="e59bbee78987-16" width="134" height="102" /></td>
<td valign="top">9</td>
<td valign="top">0/0</td>
<td valign="top">0/0</td>
<td valign="top">8/83</td>
<td valign="top"><span style="color: red;">8/63</span></td>
<td valign="top">9/237</td>
<td valign="top">5/40</td>
</tr>
</tbody>
</table>
<h2>Conclusions &amp; Questions</h2>
<p>As we can see from the table, MACCS, Estate and Substructure Fingerprint  perform very bad, they found very little hit, sometimes no hit at all. They are not designed to do this task, it&#8217;s not amazing to see this result.</p>
<p>For standard and extended CDK fingerprint, sometimes standard one works better(Maybe I should use longer extended fingerprint rather than the default length, as discussed in Guha&#8217;s <a href="http://blog.rguha.net/?p=29">post</a>). On queries of complex ring system, extended CDK fingerprint works better, but not a obvious advantage.</p>
<p>But I wonder why hashed fingerprint still miss some result, (see query 2 and query 7)? Why the superstructure doesn&#8217;t share the same bits with substructure? Is this because the structure is too simple?</p>
<p>As many incurrent hits are found, CDK fingerprints should never be used alone in substructure search. But please consider combine CDK fingerprints and subgraph isomorphism, do fingerprint search first, we can avoid performing the consuming subgraph isomorphism match on all targets, thus we can have a balance between speed and accuracy in that way.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/03/does-the-cdk-fingerprints-work-substructure-search/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Building Search Engine for All RDBMSs: Data Synchronization</title>
		<link>http://chemhack.com/2009/02/building-search-engine-for-all-rdbmss-data-synchronization/</link>
		<comments>http://chemhack.com/2009/02/building-search-engine-for-all-rdbmss-data-synchronization/#comments</comments>
		<pubDate>Fri, 27 Feb 2009 19:18:06 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[RDBMS]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Synchronize]]></category>
		<category><![CDATA[trigger]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=246</guid>
		<description><![CDATA[It has been discussed in my last post Structure Search Engine for All Major RDBMSs, if you want to build a structure search engine for all RDBMSs, the best way to do it is do it outside RDBMSs. If the structure search index is stored outside RDBSs, you’ve to find a way to synchronize SQL [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://chemhack.com/wp-content/uploads/2009/02/sync_how_to-300x290.jpg" alt="Sync" title="Sync" width="300" height="290" class="alignright size-medium wp-image-251" />It has been discussed in my last post <a href="http://chemhack.com/archives/2009/02/240/">Structure Search Engine for All Major RDBMSs</a>, if you want to build a structure search engine for all RDBMSs, the best way to do it is do it outside RDBMSs. If the structure search index is stored outside RDBSs, you’ve to find a way to synchronize SQL table and the search index. Most modern RDBMSs provide TRIGGER to monitor modification of SQL table, so we can easily implement a one-way synchronization from SQL table to the structure search index. </p>
<p>OK. Let’s go and see how does this works. </p>
<p>NOTE: I choose MySQL as the development platform, SQL statements may need a little change to work on other RDBMSs. But I’m sure SQL statements for other platform will be included in the final release, as my goal is support all major RDBMSs including MySQL, PostgreSQL, Microsoft SQL Server, Oracle and IBM DB2(if possible). </p>
<p> If we had a database named “a_chemical_database” and a table named “molecules”. </p>
<pre lang="SQL">mysql> use a_chemical_database;
mysql> SELECT * FROM molecules;
+----+-----------+-----------+-----------+
| id | smiles    | property1 | property2 |
+----+-----------+-----------+-----------+
|  1 | Cc1ccccc1 | 84        | liquid    |
|  2 | CCC       | 36        | gas       |
+----+-----------+-----------+-----------+
2 rows in set (0.00 sec)
</pre>
<p>In the table “molecules” structure information, the “smiles” column, and other properties exist. There’s nothing different between this table and common SQL tables, i.e. you don’t need to specially design your SQL tables to do structure search.</p>
<p>We want our search engine to know where the modification occurs if someone changed the data. It’s possible to monitor the “molecules” data directly intermittently, but this will be a very consuming task if you have a really big table. With triggers, we can know which kind of modification(INSERT, UPDATE or DELETE) is performed on which row exactly. Before we can create triggers, a table to log the modifications needs to be created.</p>
<pre lang="SQL">mysql> CREATE TABLE `syncs` (
    ->   `id` int(11) NOT NULL auto_increment,
    ->   `mod_action` varchar(10) default NULL,
    ->   `prim_key` int(11) default NULL,
    ->   PRIMARY KEY  (`id`)
    -> );
</pre>
<p>In the table “syncs”, we can store which the type of modification(column “mod_action”) and the row (column “prim_key”).</p>
<p>If data in the “molecules” table is changed, we expect a new record inserted into the “syncs” table, for example,</p>
<pre lang="SQL">mysql> SELECT * FROM syncs ;
+----+------------+----------+
| id | mod_action | prim_key |
+----+------------+----------+
|  1 | INSERT     |        2 |
|  2 | UPDATE     |        2 |
|  3 | DELETE     |        1 |
+----+------------+----------+
</pre>
<p>Now we create triggers.</p>
<pre lang="SQL">
CREATE TRIGGER molecules_insert AFTER INSERT ON molecules
FOR EACH ROW INSERT INTO syncs(mod_action,prim_key) VALUES('INSERT',NEW.id);
CREATE TRIGGER molecules_update AFTER UPDATE ON molecules
FOR EACH ROW
BEGIN
IF  NOT(OLD.smiles LIKE NEW.smiles)  THEN
INSERT INTO syncs(mod_action,prim_key) VALUES('UPDATE',NEW.id));
END IF;
END;
CREATE TRIGGER molecules_delete AFTER DELETE ON molecules
FOR EACH ROW INSERT INTO syncs(mod_action,prim_key) VALUES('DELETE', OLD.id);
</pre>
<p>Here we’ve done. Let’s do something on the “molecules” table and see what happens.</p>
<pre lang="SQL">mysql> INSERT INTO molecules(smiles) VALUES("CC=CCN");
mysql> UPDATE molecules SET smiles='CC1CCC1' WHERE id=2;
mysql> DELETE FROM molecules WHERE id=3;

mysql> SELECT * FROM syncs;
+----+------------+----------+
| id | mod_action | prim_key |
+----+------------+----------+
|  6 | DELETE     |        3 |
|  5 | UPDATE     |        2 |
|  4 | INSERT     |        3 |
+----+------------+----------+
3 rows in set (0.00 sec)
</pre>
<p>The trigger successfully monitored the modifications.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/02/building-search-engine-for-all-rdbmss-data-synchronization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JavaScript for Cheminformatics:JavaScript Molecule Editor and 3D Structure Viewer</title>
		<link>http://chemhack.com/2009/01/javascript-for-cheminformaticsjavascript-molecule-editor-and-3d-structure-viewer/</link>
		<comments>http://chemhack.com/2009/01/javascript-for-cheminformaticsjavascript-molecule-editor-and-3d-structure-viewer/#comments</comments>
		<pubDate>Wed, 07 Jan 2009 09:15:01 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[jsMolEditor]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=217</guid>
		<description><![CDATA[Very happy to read about Rich&#8217;s post. It was one month ago when I released a demo of molecule structure renderer. Sorry for disappearing from the Internet so long, but I had to cope with final exams for 9 courses. However, I passed all of them and are now enjoying my winter vacation for 35 [...]]]></description>
			<content:encoded><![CDATA[<p>Very happy to read about <a href="http://depth-first.com/articles/2009/01/06/javascript-for-cheminformatics-cross-compiling-java-to-javascript-with-gwt-revisited">Rich&#8217;s post</a>. It was one month ago when I released a <a href="http://chemhack.com/mx-gwt/demo-molecule-structure-rendering/">demo</a> of molecule structure renderer. Sorry for disappearing from the Internet so long, but I had to cope with final exams for 9 courses. However, I passed all of them and are now enjoying my winter vacation for 35 days, enough to make the tittle of this post to become true. </p>
<p>Today, I got something to show you.</p>
<p>Have a look at renderer and editor <a href="http://chemhack.com/gwt/com.chemhack.jsMolEditor.Editor/Editor.html" target="_blank">demo here</a>. </p>
<p>You can click the first two buttons to load a demo structure with different sizes, and you can read a demo molecule. You can move your mouse on atoms, drag the editor on atoms and in white space to see what happened. If you&#8217;re using IE, especially IE6, the dragging process may not be so smooth, as IE is famous for its super slow JavaScript engine. </p>
<p>OK, what happened after you clicked the buttons. The buttons for loading editor have a onClick attribute with JavaScript below.<br />
<code>initEditor('editor1',500,300);</code><br />
And behind this function is:<br />
<code>    function initEditor(divID,width, height){<br />
        if(window.__initEditor){<br />
        document.getElementById(divID).innerHTML="";<br />
		__initEditor(divID,width, height);<br />
        }else{<br />
            document.getElementById(divID).style.width=width+"px";<br />
            document.getElementById(divID).style.height=height+"px";<br />
            document.getElementById(divID).innerHTML="Loading...";<br />
            setTimeout(function(){initEditor(divID,width, height);}, 1000);<br />
        }<br />
    }<br />
</code><br />
And this in GWT Java code:<br />
<code><br />
    private static native void injectJSMethods()/*-{<br />
    $wnd.__readMolFile =function(divID,fileContent){<br />
    @com.chemhack.jsMolEditor.client.Editor::readMolFile(Ljava/lang/String;Ljava/lang/String;)(divID,fileContent);<br />
    };</p>
<p>    $wnd.__initEditor =function(divID, width, height){<br />
    @com.chemhack.jsMolEditor.client.Editor::initEditor(Ljava/lang/String;II)(divID, width, height);<br />
    };</p>
<p>    }-*/;</p>
<p></code></p>
<p>I think Rich&#8217;s problem has been partly solved as this proves how we can cross the boundary between hand-written JavaScript and GWT generated JavaScript. If you&#8217;d like to make the whole library exposed to JavaScript world, just write a wrapper for each Java method you&#8217;d like to call. GWT code generator may be a good helper.</p>
<p>I call this molecule editor jsMolEditor, and I plan to release its first fully functional Alpha version in two or three weeks. jsMolEditor will be released under GPL license as its code mainly came from MX-GWT and JChemPaint. </p>
<p>So how about JMol in javascript? This <a href="http://www.redbrick.dcu.ie/~noel/blog/molecproc/">demo</a> shows that it&#8217;s not a mission impossible, but we have a long way to go. </p>
<p>I promise to release the first Alpha in two or three weeks and also keep you informed how is the work going with at least two posts per week.  <img src='http://chemhack.com/wp-includes/images/smilies/icon_lol.gif' alt=':dsadsad:' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2009/01/javascript-for-cheminformaticsjavascript-molecule-editor-and-3d-structure-viewer/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>More Faster Fingerprint Search with Java &amp; CDK</title>
		<link>http://chemhack.com/2008/11/more-faster-fingerprint-search-with-java-cdk/</link>
		<comments>http://chemhack.com/2008/11/more-faster-fingerprint-search-with-java-cdk/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 22:11:24 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[CDK]]></category>
		<category><![CDATA[Cheminformatics]]></category>
		<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Fingerprint]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=137</guid>
		<description><![CDATA[Recently, I wrote a post named Faster Fingerprint Search with Java &#38; CDK . It&#8217;s fast enough, with a response time of 300 ms for a database of 100000 compounds as you can see from the chart above. If we do some simple improvement on it, it could be even faster.   As we all know, when we [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I wrote a post named <a href="http://chemhack.com/archives/2008/11/110/">Faster Fingerprint Search with Java &amp; CDK</a> . It&#8217;s fast enough, with a response time of 300 ms for a database of 100000 compounds as you can see from the chart above. If we do some simple improvement on it, it could be even faster.</p>
<p><span><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-121.png"><img class="alignnone size-full wp-image-140" title="Darkness" src="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-121.png" alt="" width="452" height="450" /></a><br />
</span></p>
<p> </p>
<p>As we all know, when we search for similar structures, our judgement of similarity is based on Tanimoto coefficient. If variable &#8216;a&#8217; stands for the number of all TRUE bits in one fingerprint, &#8216;b&#8217; stands for another, and &#8216;c&#8217; stands for the number of TRUE bits they both have, we can define Tanimoto coefficient as c/(a+b-c). If we want to find some fingerprints with a minimum Tanimoto coefficient λ, we are saying c/(a+b-c) &gt; λ. As c is the number of TRUE bits they have in common, c is absolutely not greater than a or b. Then we get b*λ&lt;a&lt;b/λ and a*λ&lt;b&lt;a/λ. </p>
<p>With this inequality in hand, we don&#8217;t need to iterate all the fingerprints to do a similar structure search. If we sort all the fingerprints in their number of all TRUE bits, we can significantly reduce the range of database we need to screen.</p>
<p> Here is the distribution of fingerprint darkness of my database of 80000 commercial compounds. </p>
<p><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-3.png"><img class="alignnone size-full wp-image-141" title="Darkness" src="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-3.png" alt="" width="500" height="507" /></a></p>
<p>And here is the search time after new search method is applied.</p>
<p><span style="color: #0000ee; text-decoration: underline;"><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-9.png"></a><a href="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-9.png"><img class="alignnone size-full wp-image-142" title="Search Time" src="http://chemhack.com/wp-content/uploads/2008/11/e59bbee78987-9.png" alt="" width="500" height="430" /></a></span></p>
<p>Extremely fast!</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2008/11/more-faster-fingerprint-search-with-java-cdk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Faster Fingerprint Search with Java &amp; CDK</title>
		<link>http://chemhack.com/2008/11/faster-fingerprint-search-with-java-cdk/</link>
		<comments>http://chemhack.com/2008/11/faster-fingerprint-search-with-java-cdk/#comments</comments>
		<pubDate>Tue, 11 Nov 2008 10:54:27 +0000</pubDate>
		<dc:creator>Duan Lian</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[CDK]]></category>
		<category><![CDATA[Cheminformatics]]></category>
		<category><![CDATA[Chemoinformatics]]></category>
		<category><![CDATA[Fingerprint]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://chemhack.com/?p=110</guid>
		<description><![CDATA[Rich Apodaca wrote a great serious posts named Fast Substructure Search Using Open Source Tools providing details on substructure search with MySQL. But, however, poor binary data operation functions of MySQL limited the implementation of similar structure search which typically depends on the calculation of Tanimato coefficient. We are going to use Java &#38; CDK to add this feature. As default [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://depth-first.com/">Rich Apodaca</a> wrote a great serious posts named <em>Fast Substructure Search Using Open Source Tools</em> providing details on substructure search with MySQL. But, however, poor binary data operation functions of MySQL limited the implementation of similar structure search which typically depends on the calculation of Tanimato coefficient. We are going to use Java &amp; CDK to add this feature.</p>
<p>As default output of CDK fingerprint, <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/BitSet.html">java.util.BitSet</a> with <a title="interface in java.io" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/Serializable.html">Serializable</a> interface is perfect data format of fingerprint data storage. Java itself provides several collections such as <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/ArrayList.html">ArrayList</a>, <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/LinkedList.html">LinkedList</a>, <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/Vector.html">Vector</a> class in package java.util. To provide web access to the search engine, thread unsafe ArrayList and LinkedList have to be kicked out. How about Vector? Once all the fingerprint data is well prepared, the collection  function we need to do similarity search is just iteration. No add, no delete. So, a light weight array is enough.</p>
<p>Most of the molecule information is stored in MySQL database, so we are going to map fingerprint to corresponding row in data table. Here is the MolDFData class, we use a long variable to store corresponding primary key in data table.</p>
<pre lang="java">public class MolDFData implements Serializable {
    private long id;
    private BitSet fingerprint;

    public MolDFData(long id, BitSet fingerprint) {
        this.id = id;
        this.fingerprint = fingerprint;
    }
    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    public BitSet getFingerprint() {
        return fingerprint;
    }

    public void setFingerprint(BitSet fingerprint) {
        this.fingerprint = fingerprint;
    }
}</pre>
<p>This is how we storage our fingerprints.</p>
<pre lang="java">private MolFPData[] arrayData;</pre>
<p>No big deal with similarity search. Just calculate the Tanimoto coefficient, if it&#8217;s bigger than minimal  similarity you set, add this one into result.</p>
<pre lang="java">    public List searchTanimoto(BitSet bt, float minSimlarity) {
        List resultList = new LinkedList();
        int i;
        for (i = 0; i &lt; arrayData.length; i++) {
            MolDFData aListData = arrayData[i];
            try {
                float coefficient = Tanimoto.calculate(aListData.getFingerprint(), bt);
                if (coefficient &gt; minSimlarity) {
                    resultList.add(new SearchResultData(aListData.getId(), coefficient));
                }
            } catch (CDKException e) {

            }
            Collections.sort(resultList);
        }
        return resultList;
    }</pre>
<p>Pretty ugly code?  Maybe. But it really works, at a acceptable speed. Tests were done using the code blow on a macbook(Intel Core Due 1.83 GHz, 2G RAM).<span style="font-family: 'Courier New'; line-height: 18px; white-space: pre; "> </span></p>
<pre lang="java">                long t3 = System.currentTimeMillis();
                List&lt;SearchResultData&gt; listResult = se.searchTanimoto(bs, 0.8f);
                long t4 = System.currentTimeMillis();
                System.out.println("Thread: Search done in " + (t4 - t3) + " ms.");</pre>
<p>In my database of 87364 commercial compounds, it takes 335 ms.</p>
]]></content:encoded>
			<wfw:commentRss>http://chemhack.com/2008/11/faster-fingerprint-search-with-java-cdk/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

