Optimizing the Research of DNA Sequences in a NoSQL Document Database: A Preliminary Study

Celesti, F.; Celesti, A.; Galletta, A.; Fazio, M.; Villari, M.

doi:10.1109/ISCC47284.2019.8969697

The study of DNA sequences has become indis-pensable for basic biological research, and in numerous applied fields such as comparative genomics, evolutionary biology, pan genomics, genetics of disease, regulation of gene expression, oncology and many others, all supported by bioinformatics. In the era of Cloud computing, federating the Cloud systems of different genetics research organisations paves the way towards a new era of data sharing and new mashup services and applications. However, due to the huge amount of genomics data (genomics Big Data) that have to be managed, a parallel distributed NoSQL DataBase Management System (DBMS) approach becomes fundamental. Specifically, due to the textual nature of genomics data, a NoSQL DBMS appears to be the most suitable solution. In this paper, by considering the whole human genome, we present a preliminary study comparing this latter using MongoDB with a SQL-like database solution, i.e., MySQL in order to look for DNA sequences. Moreover, in order to optimize the research of genomics codes, we adopt hash functions that allow mapping nucleotides sequences of arbitrary size onto data of a fixed smaller size. Experiments, shows that MongoDB apart simplifying the management of genomics data provides better performances.