Chemoinformatics or cheminformatics? Well, maybe a meaningless question.
Google just released Books Ngram Viewer which allow you to query words frequency from digitalized books. It seems in 21st century, chemoinformatics are preferred in publications. Does anyone know which books introduced these two words?
G.cn used to be the shortest domain to access Google China(google.cn). But it’s now the shortest way to Google Hong Kong. Finally and just as expected, Google is moving out from mainland China. All requests to google.cn and g.cn will be redirected to google.com.hk with a line of words says “Welcome to our new home in China” on the bottom of the page. Well, google, you’re right. You’re still in China, in a SAR where Chinese shitizenship need to pass passport control to enter.
Farewell, Google China. Fare well, 谷歌.
My last post is already half a year ago. Now I am doing an internship in Chris‘s group. EBI is really a nice place to work in. Yesterday, in a Thai resturant, Chris, Kalaivani, Leonid, Mark, Paula and I took a photo together. As Paula is going to give birth to a baby, best wishes to her and her baby.
Although JME can be obtained at no cost, it’s still a private software as it’s not open source. Any port method at source code level would not work on JME. The only way is work with JME Java byte-code.
IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework.To be brief, IKVM.NET is a JVM for .Net. With IKVM.NET, we’re able to run Java byte-code on .Net Framework. IKVM.NET has uncomplete support for AWT and Swing.
With several simple steps listed below, you’re able to integrate JME into your VB.Net or C# applications.
Let’s have a look at how we can do it.
Under the directory of IKVM.NET executables, this command below can simply covert JME from Java byte-code into .Net IL code.
C:\ikvm-0.40.0.1\bin>ikvmc.exe -target:library JME.jar Note IKVMC0002: output file is "JME.dll"
Add JME.dll and other IKVM.NET dependencies into your Visual Studio project as reference and then you can access JME class in C#.
JME applet = new JME(); JWindow floatingWindow = new JWindow(); floatingWindow.setSize(300, 300); floatingWindow.setLayout(new GridLayout(1, 1)); floatingWindow.add(applet); applet.init(); applet.start(); floatingWindow.setVisible(true);
With this code snippet you can display JME in C# applications and invoke the applet’s Java methods in C#.
In my last post, I doubted the accuracy of fingerprint based substructure search and pointed out sometimes fingerprint loosed hits. In fact, something went wrong in my code. As I was reading from SDF directly, while IteratingMDLReader does not percieve atom type or detect aromaticity automatically. This cause the incorrect matchings of UniversalIsomorphismTester, sorry for the incorrect post. I’ve run the test again, using the SMILES provided by Guha as input. The groovy script is also attached here.
|#||Query||Subgraph Isomorphism||Entended CDK||Missing||Extra|
Well, CDK fingerprint is OK.
The test code used in this post has fatal error, which caused the test result to be completely incorrect. Please see details here.
Abstract: Doubting on the accuracy of fingerprint based molecule substructure search, I did a test between different fingerprints implemented in CDK and subgraph isomorphism. The result is very interesting, CDK fingerprints should never be used alone in substructure search, but combine CDK fingerprints and subgraph isomorphism, we can have a balance between speed and accuracy.
Guha wrote a post about benchmarking of different type of fingerprints with benchmarking strategies described by Bender & Glen and Godden et al. The benchmark is based on Tanimoto similarity, which is the foundation of most chemistry database’s similar structure search. Another impo
rtant feature of molecule structure search is substructure, currently, subgraph isomorphism and fingerprint are both used in substructure search. Adel Golovin and Kim Henrick’s article Chemical Substructure Search in SQL provides a pure SQL subgraph isomorphism strategies, Rich Apodaca’s post fingerprint based MySQL substructure search in MySQL also found a solution with limited binary operation in MySQL.
Fingerprint is obviously faster as it’s much less consuming than subgraph isomorphism. But the real question is, does the fingerprint method really find all the substructures? Are there any hits misjudged as substructure?
Using DrugBank small molecule drugs as test dataset, several hand-draw structures of different types as queries, I performed substructure search using subgraph isomorphism and fingerprints implemented in CDK. If a hit in search results of fingerprint method is also found in search results of subgraph isomorphism method, I count this hit as a correct hit, other wise the hit will be counted as a incorrect hit.
All fingerprints implemented in CDK are tested, generated using Fingerprinter, ExtendedFingerprinter, GraphOnlyFingerprinter, SubstructureFingerprinter, MACCSFingerprinter and EStateFingerprinter. All parameters are kept default.
To test the accuracy of queries with different complexity, I drew several structures, as listed below.
The result is listed below. The result is listed in the format of “A/B”. A represents current hits, i.e. hits also founded by subgraph isomorphism method. B represents incurrent hits. For example, 31/49 stands for 31 current hits and 49 incurrent hits are found. Higher A and lower B is better.
|#||Query||Subgraph Isomorphism||EState||MACCS||Standard CDK||Entended CDK||Graphonly CDK FIngerpint||Substructure Fingerprint|
As we can see from the table, MACCS, Estate and Substructure Fingerprint perform very bad, they found very little hit, sometimes no hit at all. They are not designed to do this task, it’s not amazing to see this result.
For standard and extended CDK fingerprint, sometimes standard one works better(Maybe I should use longer extended fingerprint rather than the default length, as discussed in Guha’s post). On queries of complex ring system, extended CDK fingerprint works better, but not a obvious advantage.
But I wonder why hashed fingerprint still miss some result, (see query 2 and query 7)? Why the superstructure doesn’t share the same bits with substructure? Is this because the structure is too simple?
As many incurrent hits are found, CDK fingerprints should never be used alone in substructure search. But please consider combine CDK fingerprints and subgraph isomorphism, do fingerprint search first, we can avoid performing the consuming subgraph isomorphism match on all targets, thus we can have a balance between speed and accuracy in that way.
It has been discussed in my last post Structure Search Engine for All Major RDBMSs, if you want to build a structure search engine for all RDBMSs, the best way to do it is do it outside RDBMSs. If the structure search index is stored outside RDBSs, you’ve to find a way to synchronize SQL table and the search index. Most modern RDBMSs provide TRIGGER to monitor modification of SQL table, so we can easily implement a one-way synchronization from SQL table to the structure search index.
OK. Let’s go and see how does this works.
NOTE: I choose MySQL as the development platform, SQL statements may need a little change to work on other RDBMSs. But I’m sure SQL statements for other platform will be included in the final release, as my goal is support all major RDBMSs including MySQL, PostgreSQL, Microsoft SQL Server, Oracle and IBM DB2(if possible).
If we had a database named “a_chemical_database” and a table named “molecules”.
mysql> use a_chemical_database; mysql> SELECT * FROM molecules; +----+-----------+-----------+-----------+ | id | smiles | property1 | property2 | +----+-----------+-----------+-----------+ | 1 | Cc1ccccc1 | 84 | liquid | | 2 | CCC | 36 | gas | +----+-----------+-----------+-----------+ 2 rows in set (0.00 sec)
In the table “molecules” structure information, the “smiles” column, and other properties exist. There’s nothing different between this table and common SQL tables, i.e. you don’t need to specially design your SQL tables to do structure search.
We want our search engine to know where the modification occurs if someone changed the data. It’s possible to monitor the “molecules” data directly intermittently, but this will be a very consuming task if you have a really big table. With triggers, we can know which kind of modification(INSERT, UPDATE or DELETE) is performed on which row exactly. Before we can create triggers, a table to log the modifications needs to be created.
mysql> CREATE TABLE `syncs` ( -> `id` int(11) NOT NULL auto_increment, -> `mod_action` varchar(10) default NULL, -> `prim_key` int(11) default NULL, -> PRIMARY KEY (`id`) -> );
In the table “syncs”, we can store which the type of modification(column “mod_action”) and the row (column “prim_key”).
If data in the “molecules” table is changed, we expect a new record inserted into the “syncs” table, for example,
mysql> SELECT * FROM syncs ; +----+------------+----------+ | id | mod_action | prim_key | +----+------------+----------+ | 1 | INSERT | 2 | | 2 | UPDATE | 2 | | 3 | DELETE | 1 | +----+------------+----------+
Now we create triggers.
CREATE TRIGGER molecules_insert AFTER INSERT ON molecules FOR EACH ROW INSERT INTO syncs(mod_action,prim_key) VALUES('INSERT',NEW.id); CREATE TRIGGER molecules_update AFTER UPDATE ON molecules FOR EACH ROW BEGIN IF NOT(OLD.smiles LIKE NEW.smiles) THEN INSERT INTO syncs(mod_action,prim_key) VALUES('UPDATE',NEW.id)); END IF; END; CREATE TRIGGER molecules_delete AFTER DELETE ON molecules FOR EACH ROW INSERT INTO syncs(mod_action,prim_key) VALUES('DELETE', OLD.id);
Here we’ve done. Let’s do something on the “molecules” table and see what happens.
mysql> INSERT INTO molecules(smiles) VALUES("CC=CCN"); mysql> UPDATE molecules SET smiles='CC1CCC1' WHERE id=2; mysql> DELETE FROM molecules WHERE id=3; mysql> SELECT * FROM syncs; +----+------------+----------+ | id | mod_action | prim_key | +----+------------+----------+ | 6 | DELETE | 3 | | 5 | UPDATE | 2 | | 4 | INSERT | 3 | +----+------------+----------+ 3 rows in set (0.00 sec)
The trigger successfully monitored the modifications.
Many methods of doing substructure search directly in SQL has been reported recently, Adel Golovin and Kim Henrick’s Chemical Substructure in SQL, Rich Apodaca’s fingerprint based MySQL substructure search in MySQL, and Charlie Zhu’s Microsoft SQL Server based substructure search with SMARTS support.
Doing this in RDBMSs do have a number of advantages, “including platform independency, simplicity, flexibility, integrity, robustness and single point of failure”, as Adel and Kim describes. But some light weight RDBMSs such as MySQL and PostgreSQL, the most widely used open source ones, provide very limited SQL programming function, a pure SQL based solution may be impossible.
Plugins are developed to enhance the functionality. For MySQL, there’s an open source project called mychem. For PostgreSQL, there’s pgchem:tigress which is also open source. Both of them is based on OpenBabel, a C++ chemoinformatics library.
On Oracle platform, there’re CambridgeSoft Oracle Cartridge, Symyx Direct, JChem Cartridge(may be free for academic or non-commercial use), etc.. As Oracle is a commercial platform, not of these above is free.
When I was developing chemsoso.com, a Chinese chemical supplier database, structure search feature is an important problem to be solved. The database contains 90,000 different chemicals in total and still growing, performance needs to be carefully dealt with.
In consideration of speed, fingerprint is obviously the best choice. It takes time to generate fingerprints, but in the search stage, bit operation are much less consuming than graph matching. My initial idea is to generate fingerprint in Java and do bit operation in MySQL. Unfortunately, MySQL has restrictions on bit operation, it limit the maximum range to 64 bits. In Rich’s solution, fingerprint is separated into multi fileds to satisfy MySQL’s requirement. Substructure search is possible in this method. But similar search where Tanimoto coefficient needs to be calculated is still impossible, as more bit operation function is missing in MySQL.
In my final solution, a in-memory fingerprint index outside MySQL is created. Molecule structure information(SMILES or mol file) is stored in MySQL, my search engine synchronize data between the in-memory index and MySQL table. Structure searching is performed directly on the in-memory index, this guarantee the performance. On a MacBook with 1.83GHz CPU, it only takes about 50ms to do substructure search on chemsoso.com’s 90,000 structures. For similar structure search, it takes about 300ms, for full structure search, the time is less than 10ms. If search boundary if set according to similarity requirement, we can have another 4X to 100X performance improvement depends on the complexity of the query molecule structure.
Several days ago, Charlie Zhu talked to me, wondering how chemsoso.com’s structure search engine works. As the structure was mainly built from open source libraries, I decide to make the search engine open source to ease the work of building chemistry related database. With the search engine released, developers can focus the database’s own functionality, instead of dealing with structure search.
Before I can release the search engine, I have to find a way to cut in users’ system in the form of plugin. Including source code directly into users’ project may be the fastest way to add structure search functionality, but my code is in Java does not means everyone’s project is in Java. I want the search engine works not only with all major RDBMSs, but also all major OSs and all major programming languages. Besides Java API, command-line API and HTTP API will also be provided to make sure the search engine works with multi programming language and network environment where server clusters exists.