scikit-bio

scikit-bio - Python bioinformatics and microbiome analysis toolkit

Overview of capabilities

scikit-bio is a comprehensive Python bioinformatics library for handling and analyzing biological sequence data, building phylogenetic trees, calculating microbial diversity metrics, and performing ecological statistical tests and ordination analyses.

Use cases

1. Microbiome and ecological community analysis

Suitable for microbiome studies such as 16S rRNA and metagenomics; compute alpha diversity (Shannon, Simpson, Faith's PD) and beta diversity (Bray-Curtis, UniFrac distances); perform statistical tests like PERMANOVA and ANOSIM to assess community structure differences.

2. Biological sequence processing and analysis

Suitable for reading, editing, and converting DNA, RNA, and protein sequences; supports more than 19 biological file formats including FASTA, FASTQ, and GenBank; perform sequence alignment, motif searches, transcription and translation, and other operations.

3. Phylogenetics and evolutionary analysis

Useful for constructing phylogenetic trees from distance matrices (methods like NJ, UPGMA), pruning and re-rooting trees, comparing trees (Robinson-Foulds distance), and calculating patristic and cophenetic distances.

Core features

1. Diversity analysis

Compute common microbial ecology metrics, including alpha diversity (richness, Shannon entropy, Simpson index, Pielou evenness, Faith's PD) and beta diversity (Bray-Curtis, Jaccard, weighted/unweighted UniFrac), with support for rarefaction and subsampling.

2. Sequence operations and alignment

Provides DNA, RNA, and Protein classes for sequence operations (reverse complement, transcription, translation, motif search), supports global and local sequence alignment, and uses TabularMSA for handling multiple sequence alignments.

3. Statistical tests and ordination

Provides ecological statistical methods such as PERMANOVA, ANOSIM, and Mantel tests; supports ordination analyses like PCoA, CA, CCA, and RDA; can handle distance matrices and biological tables (BIOM format).

Frequently Asked Questions

What is scikit-bio? What is it suitable for?

scikit-bio is a Python library for biological data processing, particularly suited for microbiome analysis, biological sequence processing, phylogenetic tree construction, and ecological statistical analysis. It integrates with the QIIME 2 ecosystem and supports common formats like BIOM and Newick.

What's the difference between scikit-bio and Biopython?

Both are bioinformatics Python libraries, but they have different focuses. Biopython is more general-purpose, covering sequence parsing, structural biology, network database access, etc.; scikit-bio focuses on microbiome analysis and ecological statistics, providing more comprehensive diversity metrics, UniFrac, PERMANOVA, and other community analysis tools.

How to compute microbial diversity with scikit-bio?

Use skbio.diversity.alpha_diversity() to calculate alpha diversity and skbio.diversity.beta_diversity() to calculate beta diversity (e.g., unweighted_unifrac). Before calculation, prepare an integer abundance matrix (not relative abundances); phylogenetic metrics like UniFrac also require a tree and a mapping of OTU IDs.

Author

Category

Install