geniml

This skill should be used when working with genomic interval data (BED files) for machine learning tasks. Use for training region embeddings (Region2Vec, BEDspace), single-cell ATAC-seq analysis (scEmbed), building consensus peaks (universes), or any ML-based analysis of genomic regions. Applies to BED file collections, scATAC-seq data, chromatin accessibility datasets, and region-based genomic feature learning.

Install

Hot:15

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=k-dense-ai-scientific-skills-geniml&locale=en&source=copy

Geniml: Machine Learning Toolkit for Genomic Intervals

Overview

Geniml is a Python toolkit specialized for machine learning analysis of genomic interval data (BED files). It provides unsupervised methods to train embeddings of genomic regions, single cells, and metadata, supporting similarity search, clustering, and downstream ML tasks.

Use Cases

  • scATAC-seq single-cell analysis

  • Perform cell-type clustering, annotation, and dimensionality reduction on single-cell ATAC-seq data. The generated embeddings can be seamlessly integrated into scanpy workflows.

  • Bulk genomic data feature extraction

  • Process BED files from bulk sequencing data such as ChIP-seq and ATAC-seq; train region embeddings with Region2Vec for region similarity analysis and downstream supervised learning.

  • Metadata-aware genomic search

  • When experiments have metadata labels like cell type, tissue, or condition, use BEDspace to build a joint embedding space of regions and labels, enabling cross-modal queries.

    Core Features

    1. Region2Vec genomic region embeddings


    Uses a word2vec-style unsupervised learning approach to convert genomic regions into low-dimensional vector representations. Suitable for dimensionality reduction of BED file collections, region similarity analysis, and constructing feature vectors for downstream ML tasks.

    2. scEmbed single-cell embeddings


    An embedding training tool designed for single-cell ATAC-seq data, capable of generating cell-level embedding vectors that directly integrate with scanpy for clustering, visualization, and cell-type annotation.

    3. Consensus peak (Universe) construction


    Builds a reference peak set from collections of BED files, offering four statistical methods—CC (coverage cutoff), CCF (flexible cutoff), ML (maximum likelihood), and HMM (hidden Markov model)—to provide standardized reference features for tokenization.

    Frequently Asked Questions

    What types of genomic data does Geniml support?


    Geniml primarily handles genomic interval data in BED format, including chromatin accessibility datasets such as ChIP-seq and ATAC-seq, scATAC-seq single-cell data, and any learning tasks based on regional genomic features. Data must match the reference genome and can be used to construct a tokenization universe.

    How should I choose between Region2Vec and BEDspace?


    If you only need to analyze similarity of regions themselves and there are no metadata labels, choose Region2Vec. When experiments include metadata such as cell type, tissue, or condition and you need cross-modal queries (e.g., "which cell types do these regions belong to?"), choose BEDspace. BEDspace builds a joint embedding space of regions and labels.

    Can Geniml integrate with existing single-cell analysis workflows?


    Yes. The cell embeddings generated by scEmbed can be directly used as obsm entries of an AnnData object (e.g., adata.obsm['scembed_X']), fully compatible with scanpy downstream analysis workflows, including neighborhood graph construction, clustering, UMAP visualization, etc. It also supports integration with ecosystems like BEDbase and Hugging Face.