geniml
This skill should be used when working with genomic interval data (BED files) for machine learning tasks. Use for training region embeddings (Region2Vec, BEDspace), single-cell ATAC-seq analysis (scEmbed), building consensus peaks (universes), or any ML-based analysis of genomic regions. Applies to BED file collections, scATAC-seq data, chromatin accessibility datasets, and region-based genomic feature learning.
Author
Category
AI Skill DevelopmentInstall
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Geniml: Machine Learning Toolkit for Genomic Intervals
Overview
Geniml is a Python toolkit specialized for machine learning analysis of genomic interval data (BED files). It provides unsupervised methods to train embeddings of genomic regions, single cells, and metadata, supporting similarity search, clustering, and downstream ML tasks.
Use Cases
Perform cell-type clustering, annotation, and dimensionality reduction on single-cell ATAC-seq data. The generated embeddings can be seamlessly integrated into scanpy workflows.
Process BED files from bulk sequencing data such as ChIP-seq and ATAC-seq; train region embeddings with Region2Vec for region similarity analysis and downstream supervised learning.
When experiments have metadata labels like cell type, tissue, or condition, use BEDspace to build a joint embedding space of regions and labels, enabling cross-modal queries.
Core Features
1. Region2Vec genomic region embeddings
Uses a word2vec-style unsupervised learning approach to convert genomic regions into low-dimensional vector representations. Suitable for dimensionality reduction of BED file collections, region similarity analysis, and constructing feature vectors for downstream ML tasks.
2. scEmbed single-cell embeddings
An embedding training tool designed for single-cell ATAC-seq data, capable of generating cell-level embedding vectors that directly integrate with scanpy for clustering, visualization, and cell-type annotation.
3. Consensus peak (Universe) construction
Builds a reference peak set from collections of BED files, offering four statistical methods—CC (coverage cutoff), CCF (flexible cutoff), ML (maximum likelihood), and HMM (hidden Markov model)—to provide standardized reference features for tokenization.
Frequently Asked Questions
What types of genomic data does Geniml support?
Geniml primarily handles genomic interval data in BED format, including chromatin accessibility datasets such as ChIP-seq and ATAC-seq, scATAC-seq single-cell data, and any learning tasks based on regional genomic features. Data must match the reference genome and can be used to construct a tokenization universe.
How should I choose between Region2Vec and BEDspace?
If you only need to analyze similarity of regions themselves and there are no metadata labels, choose Region2Vec. When experiments include metadata such as cell type, tissue, or condition and you need cross-modal queries (e.g., "which cell types do these regions belong to?"), choose BEDspace. BEDspace builds a joint embedding space of regions and labels.
Can Geniml integrate with existing single-cell analysis workflows?
Yes. The cell embeddings generated by scEmbed can be directly used as obsm entries of an AnnData object (e.g.,
adata.obsm['scembed_X']), fully compatible with scanpy downstream analysis workflows, including neighborhood graph construction, clustering, UMAP visualization, etc. It also supports integration with ecosystems like BEDbase and Hugging Face.