gtars

High-performance toolkit for genomic interval analysis in Rust with Python bindings. Use when working with genomic regions, BED files, coverage tracks, overlap detection, tokenization for ML models, or fragment analysis in computational genomics and machine learning applications.

Install

Hot:6

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=k-dense-ai-scientific-skills-gtars&locale=en&source=copy

Gtars: High-Performance Genomic Interval Analysis Toolkit

Overview

Gtars is a high-performance genomic interval analysis toolkit developed in Rust, providing Python bindings and a command-line interface. It supports BED file processing, genome overlap detection, coverage track generation, and machine learning data preprocessing.

Use Cases

1. Genomic Region Overlap Analysis

When you need to compare different sets of genomic features, gtars can quickly detect overlaps between regions. For example: find overlaps between ChIP-seq peaks and promoters, annotate the functional elements where variant sites reside, or identify regulatory regions shared across samples. Based on the IGD (Integrated Genome Database) index structure, queries are much faster than traditional interval tree methods.

2. Sequencing Coverage Analysis

Generate coverage tracks from ATAC-seq, ChIP-seq, or RNA-seq fragments or reads for genome browser visualization or downstream quantitative analysis. Supports output in WIG and BigWig formats, allows custom resolution, and is suitable for efficient processing of large-scale sequencing data.

3. Genomic Machine Learning Data Preprocessing

Prepare genomic input data for deep learning models by discretizing genomic intervals into tokens, with seamless integration into the geniml library. Suitable for training transformer-based genomic models, creating positional encodings, or building custom genomic machine learning pipelines.

Core Features

1. Interval Overlap Detection and IGD Index

Use the IGD data structure for fast genomic interval overlap queries, supporting bulk queries after building an index. Suitable for feature annotation, peak set comparison, regulatory element identification, and other scenarios, offering significant performance advantages over traditional methods.

2. Coverage Track Generation

Generate normalized coverage files from sequencing data via the uniwig module, supporting multiple output formats and resolution settings. Can be used for accessibility profiling (ATAC-seq), binding signal visualization (ChIP-seq), expression quantification (RNA-seq), and similar applications.

3. Genomic Tokenization

Convert genomic regions into discrete token representations usable by machine learning models, offering multiple tokenization strategies such as TreeTokenizer. It is a foundational component of the geniml ecosystem, supporting positional encoding generation and transformer model training.

4. Reference Sequence Management

Handle reference genome sequences following the GA4GH refget protocol, supporting sequence extraction, integrity checks, and digest computation. Suitable for cross-reference version comparisons, sequence validation, and related use cases.

5. Single-Cell Fragment Processing

Tools specialized for fragment analysis of single-cell ATAC-seq data, supporting splitting fragment files by cell barcode or cluster. Useful for quality control and downstream clustering analyses.

Frequently Asked Questions

What data formats does gtars support?

gtars supports standard formats in genomics: BED files (genomic intervals, 3-column or extended formats), WIG/BigWig (coverage tracks), FASTA (reference sequences), and Fragment TSV (single-cell fragment files with cell barcodes). These formats cover most input/output needs of genomic analysis workflows.

How is gtars different from bedtools?

gtars is implemented in Rust and offers significant performance advantages when handling large-scale data, supporting multithreading and zero-copy NumPy integration. In addition to command-line tools, gtars provides a native Python API, making it easier to integrate into Python analysis pipelines. For scenarios requiring high-performance computing or integration with machine learning workflows, gtars is the better choice.

How do I use gtars in Python?

After installing via pip install gtars, you can import and use gtars' Python bindings. For example: from gtars.tokenizers import TreeTokenizer to create a tokenizer, or use gtars.igd.build_index() to build an interval index. The Python API is designed to be concise and interoperable with common data science libraries like NumPy and Pandas, making it suitable for building custom analysis workflows.

What types of genomic data is gtars suitable for?

gtars is suitable for all interval-based genomic data analyses, including but not limited to: ChIP-seq peak analysis, ATAC-seq accessibility analysis, RNA-seq coverage quantification, single-cell chromatin accessibility data, variant annotation, and regulatory element identification. As long as your data can be represented as intervals on chromosomes (BED format), gtars can provide efficient processing.