Scalable structural clustering of local RNA secondary structures

Reference. Fabrizio Costa, Steffen Heyne, Dominic Rose and Rolf Backofen. 2014. Scalable structural clustering of local RNA secondary structures. Proceedings of 1st workshop on Computational Methods for Structural RNAs (CMSR'14). isbn:978-2-9550187-0-5. pp. 59-61. doi:10.15455/CMSR.2014.0006
Abstract. Here, we propose an alignment-free approach for clustering RNA sequences according to sequence and structure information. We extend a fast graph kernel technique that we have developed for chemoinformatics applications and we adapt it to detect similarities between RNA secondary structures. The key novelties are twofold: (1) we represent multiple folding hypothesis associated to a single RNA sequence in a flexible graph format; and (2) we efficiently convert the graph encoding into a very high dimensional sparse vectors. The first strategy allows us to compensate the inaccuracies of the minimum free energy solution. The second strategy allows us to use locality sensitive hashing methods to identify clusters with a complexity that is linear in the number of sequences N, i.e. avoiding the quadratic complexity arising from pairwise similarity computations. We have integrated the approach in a ready-to-use pipeline for large-scale clustering of putative ncRNA. The method has been evaluated on known ncRNA classes and compared against existing approaches such as LocARNA and RNASOUP. We show that not only we obtain clusters of high quality, but also we achieve striking speedups: from years to days for serial computation, down to hours when considering the parallel implementation. We applied our method to six heterogeneous large-scale data sets containing more than 220,000 sequence fragments in total. We have analyzed predicted short ncRNAs which were lacking reliable class assignments and we have searched for local structural elements specific to experimentally validated lincRNAs. In this latter case we found enriched GO-terms for lincRNAs containing predicted local motifs that suggest a connection to vital processes of the human nervous system.
Presented at CMSR'14 (Strasbourg, France) on September 7th 2014 at 15:20 by Fabrizio Costa.

License information

Creative Commons LicenseThe proceedings of CMSR'14 are distributed under the terms of a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Ownership of the copyright for the articles is retained by their authors. They allow anyone to download, reuse, reprint, distribute, and/or copy articles for any non-commercial purposes, provided that the original authors and source are appropriately cited.