CRBC News

MetaGraph: ETH Zurich’s 'Google for DNA' Lets Scientists Search Nearly 600 Million Sequences

ETH Zurich researchers have built MetaGraph, an open-source "Google for DNA" that consolidates nearly 600 million sequences (≈21 million GB) into a single searchable index. The system converts raw reads into error-corrected graphs and achieves an average compression of about 300×, allowing some very large datasets to be reduced from ~100 TB to ~10 GB. MetaGraph lets scientists query vast collections without downloading raw files, making searches fast and inexpensive; roughly half of public sequencing data is already indexed, with the rest expected by the end of 2025.

MetaGraph: ETH Zurich’s 'Google for DNA' Lets Scientists Search Nearly 600 Million Sequences

DNA sequencing has revolutionized our understanding of cancer, neurodegenerative disorders and many other conditions — but it has also produced a flood of data. Public archives now contain petabytes of raw reads, making large-scale search and comparative analysis slow, expensive and technically challenging. Researchers at ETH Zurich have developed MetaGraph, a searchable index that consolidates vast DNA and RNA datasets into a single, efficient resource to tackle this problem.

What MetaGraph is

MetaGraph is an open-source, full-text searchable index that brings together nearly 600 million distinct sequences and roughly 21 million gigabytes (~21 PB) of sequencing data. Described by Professor Gunnar Rätsch of ETH Zurich as a "Google for DNA," the project is presented in a paper in Nature and aims to make massive sequence collections quickly queryable without requiring users to download terabytes of raw files.

How it works

The system converts raw read data into error-corrected, refined graphs and merges them into a unified index. By organizing sequence data and metadata with advanced mathematical graph structures and removing redundancies, MetaGraph achieves dramatic compression — on average about 300×, with some datasets reduced far more (for example, the team reports compressing certain ~100 TB collections down to ~10 GB). The index preserves searchability while shrinking storage needs substantially.

What’s included

The indexed material spans viruses, bacteria, fungi, plants, microbes and human sequences, including human gut metagenomes and metazoan samples, plus raw metagenomic datasets. About half of the world’s publicly available sequencing data is already indexed, and the team expects the remainder of public collections to be online by the end of 2025.

Practical benefits

Instead of downloading large datasets before searching them, researchers can query the compressed index directly. This reduces time, bandwidth and storage costs: individual queries can execute for a few cents, and the full public index can fit on a handful of hard drives with estimated infrastructure costs on the order of $2,500. MetaGraph is designed to scale so that search performance remains high as the archive grows.

Who will use it and why it matters

MetaGraph is open source and intended for a broad audience — academic researchers, pharmaceutical companies, educators and potentially private users. As Dr. André Kahles of ETH Zurich’s Biomedical Informatics Group noted, search engines often find unexpected uses; as sequencing becomes cheaper and routine, tools like MetaGraph could enable everyday applications such as quickly identifying plant species or tracking antimicrobial-resistance genes.

Examples and next steps

Faster, cheaper search could accelerate workflows that rely on large-scale comparisons, from mapping viral genomes (as in SARS-CoV-2 surveillance) to evolutionary studies. The MetaGraph project provides an Open Data repository and web examples that allow users to try queries and view visualizations of proteins and resistance genes.

Bottom line

MetaGraph lowers the barrier to working with enormous sequencing archives by compressing data into a searchable index that preserves utility while cutting cost and time. By making these resources easier to explore, the platform could speed discovery across genetics, infectious disease research and biodiversity studies.

Similar Articles

MetaGraph: ETH Zurich’s 'Google for DNA' Lets Scientists Search Nearly 600 Million Sequences - CRBC News