Geniml: A machine learning and BED file analysis tool for genomic intervals

Geniml: Machine Learning Toolkit for Genomic Intervals

Overview

Geniml is a Python toolkit specialized for machine learning analysis of genomic interval data (BED files). It provides unsupervised methods to train embeddings of genomic regions, single cells, and metadata, supporting similarity search, clustering, and downstream ML tasks.

Use Cases

scATAC-seq single-cell analysis

Perform cell-type clustering, annotation, and dimensionality reduction on single-cell ATAC-seq data. The generated embeddings can be seamlessly integrated into scanpy workflows.

Bulk genomic data feature extraction

Process BED files from bulk sequencing data such as ChIP-seq and ATAC-seq; train region embeddings with Region2Vec for region similarity analysis and downstream supervised learning.

Metadata-aware genomic search

When experiments have metadata labels like cell type, tissue, or condition, use BEDspace to build a joint embedding space of regions and labels, enabling cross-modal queries.

Core Features

1. Region2Vec genomic region embeddings

Uses a word2vec-style unsupervised learning approach to convert genomic regions into low-dimensional vector representations. Suitable for dimensionality reduction of BED file collections, region similarity analysis, and constructing feature vectors for downstream ML tasks.

2. scEmbed single-cell embeddings

An embedding training tool designed for single-cell ATAC-seq data, capable of generating cell-level embedding vectors that directly integrate with scanpy for clustering, visualization, and cell-type annotation.

3. Consensus peak (Universe) construction

Builds a reference peak set from collections of BED files, offering four statistical methods—CC (coverage cutoff), CCF (flexible cutoff), ML (maximum likelihood), and HMM (hidden Markov model)—to provide standardized reference features for tokenization.

Frequently Asked Questions

What types of genomic data does Geniml support?

Geniml primarily handles genomic interval data in BED format, including chromatin accessibility datasets such as ChIP-seq and ATAC-seq, scATAC-seq single-cell data, and any learning tasks based on regional genomic features. Data must match the reference genome and can be used to construct a tokenization universe.

How should I choose between Region2Vec and BEDspace?

If you only need to analyze similarity of regions themselves and there are no metadata labels, choose Region2Vec. When experiments include metadata such as cell type, tissue, or condition and you need cross-modal queries (e.g., "which cell types do these regions belong to?"), choose BEDspace. BEDspace builds a joint embedding space of regions and labels.

Can Geniml integrate with existing single-cell analysis workflows?

Yes. The cell embeddings generated by scEmbed can be directly used as obsm entries of an AnnData object (e.g., adata.obsm['scembed_X']), fully compatible with scanpy downstream analysis workflows, including neighborhood graph construction, clustering, UMAP visualization, etc. It also supports integration with ecosystems like BEDbase and Hugging Face.

geniml

Author

Category

Install