pysam
Genomic file toolkit. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences, extract regions, calculate coverage, for NGS data processing pipelines.
Author
Category
File ManagementInstall
Hot:13
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Download and install this skill https://openskills.cc/api/download?slug=k-dense-ai-scientific-skills-pysam&locale=en&source=copy
Pysam - Python tool for genomic data processing
Overview of capabilities
Pysam is a Python module for reading, manipulating, and writing genomic datasets, providing efficient handling of SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequence files.
Use cases
1. Analysis of sequencing alignment files
When you need to process large-scale NGS sequencing data, Pysam can efficiently read BAM/SAM/CRAM alignment files. It supports extracting reads by genomic region, computing coverage statistics, performing pileup analysis, and filtering reads by quality scores, alignment flags, and other criteria. It is suitable for downstream analyses of whole-genome sequencing, exome sequencing, RNA-seq, and other sequencing types.
2. Variant detection and analysis
Pysam provides full read/write support for VCF/BCF files and can be used for variant filtering, annotation, and statistics in variant-calling workflows. It supports region-based variant queries, extracting sample genotype data, and filtering variants by quality or allele frequency, making it suitable for GWAS, population genetics studies, and clinical variant interpretation.
3. Reference sequence and gene region extraction
Using Pysam's FastaFile class, you can randomly access reference genomes and quickly extract sequences for specific genes or regions. Combined with alignment and variant files, you can extract variant context sequences, validate reference alleles, and retrieve gene sequences—serving as a core component for building variant annotation pipelines and visualization tools.
Core features
1. Multi-format genomic file I/O
Supports reading and writing SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequence files. Provides a Pythonic interface to the htslib library, supports efficient region queries via tabix index files, and can directly call samtools and bcftools command-line tools.
2. Efficient region queries and analysis
Implements fast random access to genomic coordinates using index files, supporting extraction of reads, variants, or sequences by chromosome region. Provides pileup analysis for per-site coverage calculations and supports parallel processing of independent genomic regions to improve analysis efficiency.
3. Integrated bioinformatics workflows
Supports integrating multiple file types for complex analyses, including variant validation, coverage annotation, sequence extraction, and quality control. Suitable for building complete NGS data processing pipelines, enabling end-to-end automation from raw data QC to variant calling and result annotation.
Frequently asked questions
What's the difference between Pysam and samtools?
Pysam is a Python wrapper around the htslib C library, while samtools is a command-line tool built on the same library. Pysam's advantage is seamless integration with Python code, making it suitable for building custom analysis workflows and complex data processing logic; samtools is better suited for simple one-off operations and shell scripts. Pysam includes interfaces to call samtools and bcftools, and the two can be used together.
Do I need to create index files to use Pysam?
Most region-query operations require index files. BAM files need a .bai index, CRAM files need a .crai index, FASTA files need a .fai index, and compressed VCF files need a .tbi (tabix) index. If no index is available, you can perform sequential reading with
fetch(until_eof=True). Index files can be created using functions like pysam.index(), pysam.faidx(), or pysam.tabix_index().Does Pysam use 0-based or 1-based coordinates?
Pysam follows Python conventions and uses a 0-based, half-open coordinate system, meaning start positions begin at 0 and end positions are exclusive. However, note that string arguments to the
fetch() method (e.g., "chr1:1000-2000") follow the samtools convention and use 1-based coordinates. The VCF file format itself is 1-based, but the VariantRecord.start attribute returns a 0-based coordinate—so be careful to distinguish them when using these different interfaces.