Many methods of doing substructure search directly in SQL has been reported recently, Adel Golovin and Kim Henrick’s Chemical Substructure in SQL, Rich Apodaca’s fingerprint based MySQL substructure search in MySQL, and Charlie Zhu’s Microsoft SQL Server based substructure search with SMARTS support.
Doing this in RDBMSs do have a number of advantages, “including platform independency, simplicity, flexibility, integrity, robustness and single point of failure”, as Adel and Kim describes. But some light weight RDBMSs such as MySQL and PostgreSQL, the most widely used open source ones, provide very limited SQL programming function, a pure SQL based solution may be impossible.
Plugins are developed to enhance the functionality. For MySQL, there’s an open source project called mychem. For PostgreSQL, there’s pgchem:tigress which is also open source. Both of them is based on OpenBabel, a C++ chemoinformatics library.
On Oracle platform, there’re CambridgeSoft Oracle Cartridge, Symyx Direct, JChem Cartridge(may be free for academic or non-commercial use), etc.. As Oracle is a commercial platform, not of these above is free.
When I was developing chemsoso.com, a Chinese chemical supplier database, structure search feature is an important problem to be solved. The database contains 90,000 different chemicals in total and still growing, performance needs to be carefully dealt with.
In consideration of speed, fingerprint is obviously the best choice. It takes time to generate fingerprints, but in the search stage, bit operation are much less consuming than graph matching. My initial idea is to generate fingerprint in Java and do bit operation in MySQL. Unfortunately, MySQL has restrictions on bit operation, it limit the maximum range to 64 bits. In Rich’s solution, fingerprint is separated into multi fileds to satisfy MySQL’s requirement. Substructure search is possible in this method. But similar search where Tanimoto coefficient needs to be calculated is still impossible, as more bit operation function is missing in MySQL.
In my final solution, a in-memory fingerprint index outside MySQL is created. Molecule structure information(SMILES or mol file) is stored in MySQL, my search engine synchronize data between the in-memory index and MySQL table. Structure searching is performed directly on the in-memory index, this guarantee the performance. On a MacBook with 1.83GHz CPU, it only takes about 50ms to do substructure search on chemsoso.com’s 90,000 structures. For similar structure search, it takes about 300ms, for full structure search, the time is less than 10ms. If search boundary if set according to similarity requirement, we can have another 4X to 100X performance improvement depends on the complexity of the query molecule structure.
Several days ago, Charlie Zhu talked to me, wondering how chemsoso.com’s structure search engine works. As the structure was mainly built from open source libraries, I decide to make the search engine open source to ease the work of building chemistry related database. With the search engine released, developers can focus the database’s own functionality, instead of dealing with structure search.
Before I can release the search engine, I have to find a way to cut in users’ system in the form of plugin. Including source code directly into users’ project may be the fastest way to add structure search functionality, but my code is in Java does not means everyone’s project is in Java. I want the search engine works not only with all major RDBMSs, but also all major OSs and all major programming languages. Besides Java API, command-line API and HTTP API will also be provided to make sure the search engine works with multi programming language and network environment where server clusters exists.