imaging-data-commons

Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.

View Source
name:imaging-data-commonsdescription:Query and download public cancer imaging data from NCI Imaging Data Commons using idc-index. Use for accessing large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Query by metadata, visualize in browser, check licenses.license:This skill is provided under the MIT License. IDC data itself has individual licensing (mostly CC-BY, some CC-NC) that must be respected when using the data.metadata:version:1.2.0skill-author:Andrey Fedorov, @fedorovidc-index:"0.11.7"repository:https://github.com/ImagingDataCommons/idc-claude-skill

Imaging Data Commons

Overview

Use the idc-index Python package to query and download public cancer imaging data from the National Cancer Institute Imaging Data Commons (IDC). No authentication required for data access.

Primary tool: idc-index (GitHub)

Check current data scale for the latest version:

from idc_index import IDCClient
client = IDCClient()

get IDC data version


print(client.get_idc_version())

Get collection count and total series


stats = client.sql_query("""
SELECT
COUNT(DISTINCT collection_id) as collections,
COUNT(DISTINCT analysis_result_id) as analysis_results,
COUNT(DISTINCT PatientID) as patients,
COUNT(DISTINCT StudyInstanceUID) as studies,
COUNT(DISTINCT SeriesInstanceUID) as series,
SUM(instanceCount) as instances,
SUM(series_size_MB)/1000000 as size_TB
FROM index
""")
print(stats)

Core workflow:

  • Query metadata → client.sql_query()

  • Download DICOM files → client.download_from_selection()

  • Visualize in browser → client.get_viewer_URL(seriesInstanceUID=...)
  • When to Use This Skill

  • Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images

  • Selecting image subsets by cancer type, modality, anatomical site, or other metadata

  • Downloading DICOM data from IDC

  • Checking data licenses before use in research or commercial applications

  • Visualizing medical images in a browser without local DICOM viewer software
  • IDC Data Model

    IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance):

  • collection_id: Groups patients by disease, modality, or research focus (e.g., tcga_luad, nlst). A patient belongs to exactly one collection.

  • analysis_result_id: Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections.
  • Use collection_id to find original imaging data, may include annotations deposited along with the images; use analysis_result_id to find AI-generated or expert annotations.

    Key identifiers for queries:

    IdentifierScopeUse for
    collection_idDataset groupingFiltering by project/study
    PatientIDPatientGrouping images by patient
    StudyInstanceUIDDICOM studyGrouping of related series, visualization
    SeriesInstanceUIDDICOM seriesGrouping of related series, visualization

    Index Tables

    The idc-index package provides multiple metadata index tables, accessible via SQL or as pandas DataFrames.

    Important: Use client.indices_overview to get current table descriptions and column schemas. This is the authoritative source for available columns and their types — always query it when writing SQL or exploring data structure.

    Available Tables

    TableRow GranularityLoadedDescription
    index1 row = 1 DICOM seriesAutoPrimary metadata for all current IDC data
    prior_versions_index1 row = 1 DICOM seriesAutoSeries from previous IDC releases; for downloading deprecated data
    collections_index1 row = 1 collectionfetch_index()Collection-level metadata and descriptions
    analysis_results_index1 row = 1 analysis result collectionfetch_index()Metadata about derived datasets (annotations, segmentations)
    clinical_index1 row = 1 clinical data columnfetch_index()Dictionary mapping clinical table columns to collections
    sm_index1 row = 1 slide microscopy seriesfetch_index()Slide Microscopy (pathology) series metadata
    sm_instance_index1 row = 1 slide microscopy instancefetch_index()Instance-level (SOPInstanceUID) metadata for slide microscopy
    seg_index1 row = 1 DICOM Segmentation seriesfetch_index()Segmentation metadata: algorithm, segment count, reference to source image series

    Auto = loaded automatically when IDCClient() is instantiated
    fetch_index() = requires client.fetch_index("table_name") to load

    Joining Tables

    Key columns are not explicitly labeled, the following is a subset that can be used in joins.

    Join ColumnTablesUse Case
    collection_idindex, prior_versions_index, collections_index, clinical_indexLink series to collection metadata or clinical data
    SeriesInstanceUIDindex, prior_versions_index, sm_index, sm_instance_indexLink series across tables; connect to slide microscopy details
    StudyInstanceUIDindex, prior_versions_indexLink studies across current and historical data
    PatientIDindex, prior_versions_indexLink patients across current and historical data
    analysis_result_idindex, analysis_results_indexLink series to analysis result metadata (annotations, segmentations)
    source_DOIindex, analysis_results_indexLink by publication DOI
    crdc_series_uuidindex, prior_versions_indexLink by CRDC unique identifier
    Modalityindex, prior_versions_indexFilter by imaging modality
    SeriesInstanceUIDindex, seg_indexLink segmentation series to its index metadata
    segmented_SeriesInstanceUIDseg_index → indexLink segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID)

    Note: Subjects, Updated, and Description appear in multiple tables but have different meanings (counts vs identifiers, different update contexts).

    Example joins:

    from idc_index import IDCClient
    client = IDCClient()

    Join index with collections_index to get cancer types


    client.fetch_index("collections_index")
    result = client.sql_query("""
    SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE i.Modality = 'MR'
    LIMIT 10
    """)

    Join index with sm_index for slide microscopy details


    client.fetch_index("sm_index")
    result = client.sql_query("""
    SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf
    FROM index i
    JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID
    LIMIT 10
    """)

    Join seg_index with index to find segmentations and their source images


    client.fetch_index("seg_index")
    result = client.sql_query("""
    SELECT
    s.SeriesInstanceUID as seg_series,
    s.AlgorithmName,
    s.total_segments,
    src.collection_id,
    src.Modality as source_modality,
    src.BodyPartExamined
    FROM seg_index s
    JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
    WHERE s.AlgorithmType = 'AUTOMATIC'
    LIMIT 10
    """)

    Accessing Index Tables

    Via SQL (recommended for filtering/aggregation):

    from idc_index import IDCClient
    client = IDCClient()

    Query the primary index (always available)


    results = client.sql_query("SELECT FROM index WHERE Modality = 'CT' LIMIT 10")

    Fetch and query additional indices


    client.fetch_index("collections_index")
    collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index")

    client.fetch_index("analysis_results_index")
    analysis = client.sql_query("SELECT
    FROM analysis_results_index LIMIT 5")

    As pandas DataFrames (direct access):

    # Primary index (always available after client initialization)
    df = client.index

    Fetch and access on-demand indices


    client.fetch_index("sm_index")
    sm_df = client.sm_index

    Discovering Table Schemas (Essential for Query Writing)

    The indices_overview dictionary contains complete schema information for all tables. Always consult this when writing queries or exploring data structure.

    DICOM attribute mapping: Many columns are populated directly from DICOM attributes in the source files. The column description in the schema indicates when a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or references a DICOM tag). This allows leveraging DICOM knowledge when querying — standard DICOM attribute names like PatientID, StudyInstanceUID, Modality, BodyPartExamined work as expected.

    from idc_index import IDCClient
    client = IDCClient()

    List all available indices with descriptions


    for name, info in client.indices_overview.items():
    print(f"\n{name}:")
    print(f" Installed: {info['installed']}")
    print(f" Description: {info['description']}")

    Get complete schema for a specific index (columns, types, descriptions)


    schema = client.indices_overview["index"]["schema"]
    print(f"\nTable: {schema['table_description']}")
    print("\nColumns:")
    for col in schema['columns']:
    desc = col.get('description', 'No description')
    # Description indicates if column is from DICOM attribute
    print(f" {col['name']} ({col['type']}): {desc}")

    Find columns that are DICOM attributes (check description for "DICOM" reference)


    dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()]
    print(f"\nDICOM-sourced columns: {dicom_cols}")

    Alternative: use get_index_schema() method:

    schema = client.get_index_schema("index")

    Returns same schema dict: {'table_description': ..., 'columns': [...]}

    Key Columns in Primary index Table

    Most common columns for queries (use indices_overview for complete list and descriptions):

    ColumnTypeDICOMDescription
    collection_idSTRINGNoIDC collection identifier
    analysis_result_idSTRINGNoIf applicable, indicates what analysis results collection given series is part of
    source_DOISTRINGNoDOI linking to dataset details; use for learning more about the content and for attribution (see citations below)
    PatientIDSTRINGYesPatient identifier
    StudyInstanceUIDSTRINGYesDICOM Study UID
    SeriesInstanceUIDSTRINGYesDICOM Series UID — use for downloads/viewing
    ModalitySTRINGYesImaging modality (CT, MR, PT, SM, etc.)
    BodyPartExaminedSTRINGYesAnatomical region
    SeriesDescriptionSTRINGYesDescription of the series
    ManufacturerSTRINGYesEquipment manufacturer
    StudyDateSTRINGYesDate study was performed
    PatientSexSTRINGYesPatient sex
    PatientAgeSTRINGYesPatient age at time of study
    license_short_nameSTRINGNoLicense type (CC BY 4.0, CC BY-NC 4.0, etc.)
    series_size_MBFLOATNoSize of series in megabytes
    instanceCountINTEGERNoNumber of DICOM instances in series

    DICOM = Yes: Column value extracted from the DICOM attribute with the same name. Refer to the DICOM standard for numeric tag mappings. Use standard DICOM knowledge for expected values and formats.

    Clinical Data Access

    # Fetch clinical index (also downloads clinical data tables)
    client.fetch_index("clinical_index")

    Query clinical index to find available tables and their columns


    tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index")

    Load a specific clinical table as DataFrame


    clinical_df = client.get_clinical_table("table_name")

    See references/clinical_data_guide.md for detailed workflows including value mapping patterns and joining clinical data with imaging.

    Data Access Options

    MethodAuth RequiredBest For
    idc-indexNoKey queries and downloads (recommended)
    IDC PortalNoInteractive exploration, manual selection, browser-based download
    BigQueryYes (GCP account)Complex queries, full DICOM metadata
    DICOMweb proxyNoTool integration via DICOMweb API
    Cloud storage (S3/GCS)NoDirect file access, bulk downloads, custom pipelines

    Cloud storage organization

    IDC maintains all DICOM files in public cloud storage buckets mirrored between AWS S3 and Google Cloud Storage. Files are organized by CRDC UUIDs (not DICOM UIDs) to support versioning.

    Bucket (AWS / GCS)LicenseContent
    idc-open-data / idc-open-dataNo commercial restriction>90% of IDC data
    idc-open-data-two / idc-open-idc1No commercial restrictionCollections with potential head scans
    idc-open-data-cr / idc-open-crCommercial use restricted (CC BY-NC)~4% of data

    Files are stored as /.dcm. Access is free (no egress fees) via AWS CLI, gsutil, or s5cmd with anonymous access. Use series_aws_url column from the index for S3 URLs; GCS uses the same path structure.

    See references/cloud_storage_guide.md for bucket details, access commands, UUID mapping, and versioning.

    DICOMweb access

    IDC data is available via DICOMweb interface (Google Cloud Healthcare API implementation) for integration with PACS systems and DICOMweb-compatible tools.

    EndpointAuthUse Case
    Public proxyNoTesting, moderate queries, daily quota
    Google HealthcareYes (GCP)Production use, higher quotas

    See references/dicomweb_guide.md for endpoint URLs, code examples, supported operations, and implementation details.

    Installation and Setup

    Required (for basic access):

    pip install --upgrade idc-index

    Important: New IDC data release will always trigger a new version of idc-index. Always use --upgrade flag while installing, unless an older version is needed for reproducibility.

    Tested with: idc-index 0.11.7 (IDC data version v23)

    Optional (for data analysis):

    pip install pandas numpy pydicom

    Core Capabilities

    1. Data Discovery and Exploration

    Discover what imaging collections and data are available in IDC:

    from idc_index import IDCClient

    client = IDCClient()

    Get summary statistics from primary index


    query = """
    SELECT
    collection_id,
    COUNT(DISTINCT PatientID) as patients,
    COUNT(DISTINCT SeriesInstanceUID) as series,
    SUM(series_size_MB) as size_mb
    FROM index
    GROUP BY collection_id
    ORDER BY patients DESC
    """
    collections_summary = client.sql_query(query)

    For richer collection metadata, use collections_index


    client.fetch_index("collections_index")
    collections_info = client.sql_query("""
    SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData
    FROM collections_index
    """)

    For analysis results (annotations, segmentations), use analysis_results_index


    client.fetch_index("analysis_results_index")
    analysis_info = client.sql_query("""
    SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities
    FROM analysis_results_index
    """)

    collections_index provides curated metadata per collection: cancer types, tumor locations, species, subject counts, and supporting data types — without needing to aggregate from the primary index.

    analysis_results_index lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities.

    2. Querying Metadata with SQL

    Query the IDC mini-index using SQL to find specific datasets.

    First, explore available values for filter columns:

    from idc_index import IDCClient

    client = IDCClient()

    Check what Modality values exist


    modalities = client.sql_query("""
    SELECT DISTINCT Modality, COUNT() as series_count
    FROM index
    GROUP BY Modality
    ORDER BY series_count DESC
    """)
    print(modalities)

    Check what BodyPartExamined values exist for MR modality


    body_parts = client.sql_query("""
    SELECT DISTINCT BodyPartExamined, COUNT(
    ) as series_count
    FROM index
    WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL
    GROUP BY BodyPartExamined
    ORDER BY series_count DESC
    LIMIT 20
    """)
    print(body_parts)

    Then query with validated filter values:

    # Find breast MRI scans (use actual values from exploration above)
    results = client.sql_query("""
    SELECT
    collection_id,
    PatientID,
    SeriesInstanceUID,
    Modality,
    SeriesDescription,
    license_short_name
    FROM index
    WHERE Modality = 'MR'
    AND BodyPartExamined = 'BREAST'
    LIMIT 20
    """)

    Access results as pandas DataFrame


    for idx, row in results.iterrows():
    print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}")

    To filter by cancer type, join with collections_index:

    client.fetch_index("collections_index")
    results = client.sql_query("""
    SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality
    FROM index i
    JOIN collections_index c ON i.collection_id = c.collection_id
    WHERE c.CancerTypes LIKE '%Breast%'
    AND i.Modality = 'MR'
    LIMIT 20
    """)

    Available metadata fields (use client.indices_overview for complete list):

  • Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID

  • Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName

  • Clinical: PatientAge, PatientSex, StudyDate

  • Descriptions: StudyDescription, SeriesDescription

  • Licensing: license_short_name
  • Note: Cancer type is in collections_index.CancerTypes, not in the primary index table.

    3. Downloading DICOM Files

    Download imaging data efficiently from IDC's cloud storage:

    Download entire collection:

    from idc_index import IDCClient

    client = IDCClient()

    Download small collection (RIDER Pilot ~1GB)


    client.download_from_selection(
    collection_id="rider_pilot",
    downloadDir="./data/rider"
    )

    Download specific series:

    # First, query for series UIDs
    series_df = client.sql_query("""
    SELECT SeriesInstanceUID
    FROM index
    WHERE Modality = 'CT'
    AND BodyPartExamined = 'CHEST'
    AND collection_id = 'nlst'
    LIMIT 5
    """)

    Download only those series


    client.download_from_selection(
    seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
    downloadDir="./data/lung_ct"
    )

    Custom directory structure:

    Default dirTemplate: %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID

    # Simplified hierarchy (omit StudyInstanceUID level)
    client.download_from_selection(
    collection_id="tcga_luad",
    downloadDir="./data",
    dirTemplate="%collection_id/%PatientID/%Modality"
    )

    Results in: ./data/tcga_luad/TCGA-05-4244/CT/

    Flat structure (all files in one directory)


    client.download_from_selection(
    seriesInstanceUID=list(series_df['SeriesInstanceUID'].values),
    downloadDir="./data/flat",
    dirTemplate=""
    )

    Results in: ./data/flat/.dcm

    Command-Line Download

    The idc download command provides command-line access to download functionality without writing Python code. Available after installing idc-index.

    Auto-detects input type: manifest file path, or identifiers (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid).

    # Download entire collection
    idc download rider_pilot --download-dir ./data

    Download specific series by UID


    idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data

    Download multiple items (comma-separated)


    idc download "tcga_luad,tcga_lusc" --download-dir ./data

    Download from manifest file (auto-detected)


    idc download manifest.txt --download-dir ./data

    Options:

    OptionDescription
    --download-dirOutput directory (default: current directory)
    --dir-templateDirectory hierarchy template (default: %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID)
    --log-levelVerbosity: debug, info, warning, error, critical

    Manifest files:

    Manifest files contain S3 URLs (one per line) and can be:

  • Exported from the IDC Portal after cohort selection

  • Shared by collaborators for reproducible data access

  • Generated programmatically from query results
  • Format (one S3 URL per line):

    s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/
    s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/

    Example: Generate manifest from Python query:

    from idc_index import IDCClient

    client = IDCClient()

    Query for series URLs


    results = client.sql_query("""
    SELECT series_aws_url
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
    """)

    Save as manifest file


    with open('ct_manifest.txt', 'w') as f:
    for url in results['series_aws_url']:
    f.write(url + '\n')

    Then download:

    idc download ct_manifest.txt --download-dir ./ct_data

    4. Visualizing IDC Images

    View DICOM data in browser without downloading:

    from idc_index import IDCClient
    import webbrowser

    client = IDCClient()

    First query to get valid UIDs


    results = client.sql_query("""
    SELECT SeriesInstanceUID, StudyInstanceUID
    FROM index
    WHERE collection_id = 'rider_pilot' AND Modality = 'CT'
    LIMIT 1
    """)

    View single series


    viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID'])
    webbrowser.open(viewer_url)

    View all series in a study (useful for multi-series exams like MRI protocols)


    viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID'])
    webbrowser.open(viewer_url)

    The method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is useful when a DICOM Study contains multiple Series (e.g., T1, T2, DWI sequences from a single MRI session).

    5. Understanding and Checking Licenses

    Check data licensing before use (critical for commercial applications):

    from idc_index import IDCClient

    client = IDCClient()

    Check licenses for all collections


    query = """
    SELECT DISTINCT
    collection_id,
    license_short_name,
    COUNT(DISTINCT SeriesInstanceUID) as series_count
    FROM index
    GROUP BY collection_id, license_short_name
    ORDER BY collection_id
    """

    licenses = client.sql_query(query)
    print(licenses)

    License types in IDC:

  • CC BY 4.0 / CC BY 3.0 (~97% of data) - Allows commercial use with attribution

  • CC BY-NC 4.0 / CC BY-NC 3.0 (~3% of data) - Non-commercial use only

  • Custom licenses (rare) - Some collections have specific terms (e.g., NLM Terms and Conditions)
  • Important: Always check the license before using IDC data in publications or commercial applications. Each DICOM file is tagged with its specific license in metadata.

    Generating Citations for Attribution

    The source_DOI column contains DOIs linking to publications describing how the data was generated. To satisfy attribution requirements, use citations_from_selection() to generate properly formatted citations:

    from idc_index import IDCClient

    client = IDCClient()

    Get citations for a collection (APA format by default)


    citations = client.citations_from_selection(collection_id="rider_pilot")
    for citation in citations:
    print(citation)

    Get citations for specific series


    results = client.sql_query("""
    SELECT SeriesInstanceUID FROM index
    WHERE collection_id = 'tcga_luad' LIMIT 5
    """)
    citations = client.citations_from_selection(
    seriesInstanceUID=list(results['SeriesInstanceUID'].values)
    )

    Alternative format: BibTeX (for LaTeX documents)


    bibtex_citations = client.citations_from_selection(
    collection_id="tcga_luad",
    citation_format=IDCClient.CITATION_FORMAT_BIBTEX
    )

    Parameters:

  • collection_id: Filter by collection(s)

  • patientId: Filter by patient ID(s)

  • studyInstanceUID: Filter by study UID(s)

  • seriesInstanceUID: Filter by series UID(s)

  • citation_format: Use IDCClient.CITATION_FORMAT_ constants:

  • - CITATION_FORMAT_APA (default) - APA style
    - CITATION_FORMAT_BIBTEX - BibTeX for LaTeX
    - CITATION_FORMAT_JSON - CSL JSON
    - CITATION_FORMAT_TURTLE - RDF Turtle

    Best practice: When publishing results using IDC data, include the generated citations to properly attribute the data sources and satisfy license requirements.

    6. Batch Processing and Filtering

    Process large datasets efficiently with filtering:

    from idc_index import IDCClient
    import pandas as pd

    client = IDCClient()

    Find chest CT scans from GE scanners


    query = """
    SELECT
    SeriesInstanceUID,
    PatientID,
    collection_id,
    ManufacturerModelName
    FROM index
    WHERE Modality = 'CT'
    AND BodyPartExamined = 'CHEST'
    AND Manufacturer = 'GE MEDICAL SYSTEMS'
    AND license_short_name = 'CC BY 4.0'
    LIMIT 100
    """

    results = client.sql_query(query)

    Save manifest for later


    results.to_csv('lung_ct_manifest.csv', index=False)

    Download in batches to avoid timeout


    batch_size = 10
    for i in range(0, len(results), batch_size):
    batch = results.iloc[i:i+batch_size]
    client.download_from_selection(
    seriesInstanceUID=list(batch['SeriesInstanceUID'].values),
    downloadDir=f"./data/batch_{i//batch_size}"
    )

    7. Advanced Queries with BigQuery

    For queries requiring full DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires GCP account with billing enabled.

    Quick reference:

  • Dataset: bigquery-public-data.idc_current.

  • Main table: dicom_all (combined metadata)

  • Full metadata: dicom_metadata (all DICOM tags)

  • Private elements: OtherElements column (vendor-specific tags like diffusion b-values)
  • See references/bigquery_guide.md for setup, table schemas, query patterns, private element access, and cost optimization.

    8. Tool Selection Guide

    TaskToolReference
    Programmatic queries & downloadsidc-indexThis document
    Interactive explorationIDC Portalhttps://portal.imaging.datacommons.cancer.gov/
    Complex metadata queriesBigQueryreferences/bigquery_guide.md
    3D visualization & analysisSlicerIDCBrowserhttps://github.com/ImagingDataCommons/SlicerIDCBrowser

    Default choice: Use idc-index for most tasks (no auth, easy API, batch downloads).

    9. Integration with Analysis Pipelines

    Integrate IDC data into imaging analysis workflows:

    Read downloaded DICOM files:

    import pydicom
    import os

    Read DICOM files from downloaded series


    series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..."

    dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir)
    if f.endswith('.dcm')]

    Load first image


    ds = pydicom.dcmread(dicom_files[0])
    print(f"Patient ID: {ds.PatientID}")
    print(f"Modality: {ds.Modality}")
    print(f"Image shape: {ds.pixel_array.shape}")

    Build 3D volume from CT series:

    import pydicom
    import numpy as np
    from pathlib import Path

    def load_ct_series(series_path):
    """Load CT series as 3D numpy array"""
    files = sorted(Path(series_path).glob('
    .dcm'))
    slices = [pydicom.dcmread(str(f)) for f in files]

    # Sort by slice location
    slices.sort(key=lambda x: float(x.ImagePositionPatient[2]))

    # Stack into 3D array
    volume = np.stack([s.pixel_array for s in slices])

    return volume, slices[0] # Return volume and first slice for metadata

    volume, metadata = load_ct_series("./data/lung_ct/series_dir")
    print(f"Volume shape: {volume.shape}") # (z, y, x)

    Integrate with SimpleITK:

    import SimpleITK as sitk
    from pathlib import Path

    Read DICOM series


    series_path = "./data/ct_series"
    reader = sitk.ImageSeriesReader()
    dicom_names = reader.GetGDCMSeriesFileNames(series_path)
    reader.SetFileNames(dicom_names)
    image = reader.Execute()

    Apply processing


    smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5)

    Save as NIfTI


    sitk.WriteImage(smoothed, "processed_volume.nii.gz")

    Common Use Cases

    Use Case 1: Find and Download Lung CT Scans for Deep Learning

    Objective: Build training dataset of lung CT scans from NLST collection

    Steps:

    from idc_index import IDCClient

    client = IDCClient()

    1. Query for lung CT scans with specific criteria


    query = """
    SELECT
    PatientID,
    SeriesInstanceUID,
    SeriesDescription
    FROM index
    WHERE collection_id = 'nlst'
    AND Modality = 'CT'
    AND BodyPartExamined = 'CHEST'
    AND license_short_name = 'CC BY 4.0'
    ORDER BY PatientID
    LIMIT 100
    """

    results = client.sql_query(query)
    print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients")

    2. Download data organized by patient


    client.download_from_selection(
    seriesInstanceUID=list(results['SeriesInstanceUID'].values),
    downloadDir="./training_data",
    dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID"
    )

    3. Save manifest for reproducibility


    results.to_csv('training_manifest.csv', index=False)

    Use Case 2: Query Brain MRI by Manufacturer for Quality Study

    Objective: Compare image quality across different MRI scanner manufacturers

    Steps:

    from idc_index import IDCClient
    import pandas as pd

    client = IDCClient()

    Query for brain MRI grouped by manufacturer


    query = """
    SELECT
    Manufacturer,
    ManufacturerModelName,
    COUNT(DISTINCT SeriesInstanceUID) as num_series,
    COUNT(DISTINCT PatientID) as num_patients
    FROM index
    WHERE Modality = 'MR'
    AND BodyPartExamined LIKE '%BRAIN%'
    GROUP BY Manufacturer, ManufacturerModelName
    HAVING num_series >= 10
    ORDER BY num_series DESC
    """

    manufacturers = client.sql_query(query)
    print(manufacturers)

    Download sample from each manufacturer for comparison


    for _, row in manufacturers.head(3).iterrows():
    mfr = row['Manufacturer']
    model = row['ManufacturerModelName']

    query = f"""
    SELECT SeriesInstanceUID
    FROM index
    WHERE Manufacturer = '{mfr}'
    AND ManufacturerModelName = '{model}'
    AND Modality = 'MR'
    AND BodyPartExamined LIKE '%BRAIN%'
    LIMIT 5
    """

    series = client.sql_query(query)
    client.download_from_selection(
    seriesInstanceUID=list(series['SeriesInstanceUID'].values),
    downloadDir=f"./quality_study/{mfr.replace(' ', '_')}"
    )

    Use Case 3: Visualize Series Without Downloading

    Objective: Preview imaging data before committing to download

    from idc_index import IDCClient
    import webbrowser

    client = IDCClient()

    series_list = client.sql_query("""
    SELECT SeriesInstanceUID, PatientID, SeriesDescription
    FROM index
    WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT'
    LIMIT 10
    """)

    Preview each in browser


    for _, row in series_list.iterrows():
    viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID'])
    print(f"Patient {row['PatientID']}: {row['SeriesDescription']}")
    print(f" View at: {viewer_url}")
    # webbrowser.open(viewer_url) # Uncomment to open automatically

    For additional visualization options, see the IDC Portal getting started guide or SlicerIDCBrowser for 3D Slicer integration.

    Use Case 4: License-Aware Batch Download for Commercial Use

    Objective: Download only CC-BY licensed data suitable for commercial applications

    Steps:

    from idc_index import IDCClient

    client = IDCClient()

    Query ONLY for CC BY licensed data (allows commercial use with attribution)


    query = """
    SELECT
    SeriesInstanceUID,
    collection_id,
    PatientID,
    Modality
    FROM index
    WHERE license_short_name LIKE 'CC BY%'
    AND license_short_name NOT LIKE '%NC%'
    AND Modality IN ('CT', 'MR')
    AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN')
    LIMIT 200
    """

    cc_by_data = client.sql_query(query)

    print(f"Found {len(cc_by_data)} CC BY licensed series")
    print(f"Collections: {cc_by_data['collection_id'].unique()}")

    Download with license verification


    client.download_from_selection(
    seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values),
    downloadDir="./commercial_dataset",
    dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID"
    )

    Save license information


    cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False)

    Best Practices

  • Check licenses before use - Always query the license_short_name field and respect licensing terms (CC BY vs CC BY-NC)

  • Generate citations for attribution - Use citations_from_selection() to get properly formatted citations from source_DOI values; include these in publications

  • Start with small queries - Use LIMIT clause when exploring to avoid long downloads and understand data structure

  • Use mini-index for simple queries - Only use BigQuery when you need comprehensive metadata or complex JOINs

  • Organize downloads with dirTemplate - Use meaningful directory structures like %collection_id/%PatientID/%Modality

  • Cache query results - Save DataFrames to CSV files to avoid re-querying and ensure reproducibility

  • Estimate size first - Check collection size before downloading - some collection sizes are in terabytes!

  • Save manifests - Always save query results with Series UIDs for reproducibility and data provenance

  • Read documentation - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/

  • Use IDC forum - Search for questons/answers and ask your questions to the IDC maintainers and users at https://discourse.canceridc.dev/
  • Troubleshooting

    Issue: ModuleNotFoundError: No module named 'idc_index'

  • Cause: idc-index package not installed

  • Solution: Install with pip install --upgrade idc-index
  • Issue: Download fails with connection timeout

  • Cause: Network instability or large download size

  • Solution:

  • - Download smaller batches (e.g., 10-20 series at a time)
    - Check network connection
    - Use dirTemplate to organize downloads by batch
    - Implement retry logic with delays

    Issue: BigQuery quota exceeded or billing errors

  • Cause: BigQuery requires billing-enabled GCP project

  • Solution: Use idc-index mini-index for simple queries (no billing required), or see references/bigquery_guide.md for cost optimization tips
  • Issue: Series UID not found or no data returned

  • Cause: Typo in UID, data not in current IDC version, or wrong field name

  • Solution:

  • - Check if data is in current IDC version (some old data may be deprecated)
    - Use LIMIT 5 to test query first
    - Check field names against metadata schema documentation

    Issue: Downloaded DICOM files won't open

  • Cause: Corrupted download or incompatible viewer

  • Solution:

  • - Check DICOM object type (Modality and SOPClassUID attributes) - some object types require specialized tools
    - Verify file integrity (check file sizes)
    - Use pydicom to validate: pydicom.dcmread(file, force=True)
    - Try different DICOM viewer (3D Slicer, Horos, RadiAnt, QuPath)
    - Re-download the series

    Common SQL Query Patterns

    Quick reference for common queries. For detailed examples with context, see the Core Capabilities section above.

    Discover available filter values


    # What modalities exist?
    client.sql_query("SELECT DISTINCT Modality FROM index")

    What body parts for a specific modality?


    client.sql_query("""
    SELECT DISTINCT BodyPartExamined, COUNT() as n
    FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL
    GROUP BY BodyPartExamined ORDER BY n DESC
    """)

    What manufacturers for MR?


    client.sql_query("""
    SELECT DISTINCT Manufacturer, COUNT(
    ) as n
    FROM index WHERE Modality = 'MR'
    GROUP BY Manufacturer ORDER BY n DESC
    """)

    Find annotations and segmentations

    Note: Not all image-derived objects belong to analysis result collections. Some annotations are deposited alongside original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of collection type.

    # Find ALL segmentations and structure sets by DICOM Modality

    SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy Structure Set


    client.sql_query("""
    SELECT collection_id, Modality, COUNT() as series_count
    FROM index
    WHERE Modality IN ('SEG', 'RTSTRUCT')
    GROUP BY collection_id, Modality
    ORDER BY series_count DESC
    """)

    Find segmentations for a specific collection (includes non-analysis-result items)


    client.sql_query("""
    SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id
    FROM index
    WHERE collection_id = 'tcga_luad' AND Modality = 'SEG'
    """)

    List analysis result collections (curated derived datasets)


    client.fetch_index("analysis_results_index")
    client.sql_query("""
    SELECT analysis_result_id, analysis_result_title, Collections, Modalities
    FROM analysis_results_index
    """)

    Find analysis results for a specific source collection


    client.sql_query("""
    SELECT analysis_result_id, analysis_result_title
    FROM analysis_results_index
    WHERE Collections LIKE '%tcga_luad%'
    """)

    Use seg_index for detailed DICOM Segmentation metadata


    client.fetch_index("seg_index")

    Get segmentation statistics by algorithm


    client.sql_query("""
    SELECT AlgorithmName, AlgorithmType, COUNT(
    ) as seg_count
    FROM seg_index
    WHERE AlgorithmName IS NOT NULL
    GROUP BY AlgorithmName, AlgorithmType
    ORDER BY seg_count DESC
    LIMIT 10
    """)

    Find segmentations for specific source images (e.g., chest CT)


    client.sql_query("""
    SELECT
    s.SeriesInstanceUID as seg_series,
    s.AlgorithmName,
    s.total_segments,
    s.segmented_SeriesInstanceUID as source_series
    FROM seg_index s
    JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID
    WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST'
    LIMIT 10
    """)

    Find TotalSegmentator results with source image context


    client.sql_query("""
    SELECT
    seg_info.collection_id,
    COUNT(DISTINCT s.SeriesInstanceUID) as seg_count,
    SUM(s.total_segments) as total_segments
    FROM seg_index s
    JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID
    WHERE s.AlgorithmName LIKE '%TotalSegmentator%'
    GROUP BY seg_info.collection_id
    ORDER BY seg_count DESC
    """)

    Query slide microscopy data


    # sm_index has detailed metadata; join with index for collection_id
    client.fetch_index("sm_index")
    client.sql_query("""
    SELECT i.collection_id, COUNT() as slides,
    MIN(s.min_PixelSpacing_2sf) as min_resolution
    FROM sm_index s
    JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID
    GROUP BY i.collection_id
    ORDER BY slides DESC
    """)

    Estimate download size


    # Size for specific criteria
    client.sql_query("""
    SELECT SUM(series_size_MB) as total_mb, COUNT(
    ) as series_count
    FROM index
    WHERE collection_id = 'nlst' AND Modality = 'CT'
    """)

    Link to clinical data


    client.fetch_index("clinical_index")

    Find collections with clinical data and their tables


    client.sql_query("""
    SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns
    FROM clinical_index
    GROUP BY collection_id, table_name
    ORDER BY collection_id
    """)

    See references/clinical_data_guide.md for complete patterns including value mapping and patient cohort selection.

    Related Skills

    The following skills complement IDC workflows for downstream analysis and visualization:

    DICOM Processing


  • pydicom - Read, write, and manipulate downloaded DICOM files. Use for extracting pixel data, reading metadata, anonymization, and format conversion. Essential for working with IDC radiology data (CT, MR, PET).
  • Pathology and Slide Microscopy


  • histolab - Lightweight tile extraction and preprocessing for whole slide images. Use for basic slide processing, tissue detection, and dataset preparation from IDC slide microscopy data.

  • pathml - Full-featured computational pathology toolkit. Use for advanced WSI analysis including multiplexed imaging, nucleus segmentation, and ML model training on pathology data downloaded from IDC.
  • Metadata Visualization


  • matplotlib - Low-level plotting for full customization. Use for creating static figures summarizing IDC query results (bar charts of modalities, histograms of series counts, etc.).

  • seaborn - Statistical visualization with pandas integration. Use for quick exploration of IDC metadata distributions, relationships between variables, and categorical comparisons with attractive defaults.

  • plotly - Interactive visualization. Use when you need hover info, zoom, and pan for exploring IDC metadata, or for creating web-embeddable dashboards of collection statistics.
  • Data Exploration


  • exploratory-data-analysis - Comprehensive EDA on scientific data files. Use after downloading IDC data to understand file structure, quality, and characteristics before analysis.
  • Resources

    Schema Reference (Primary Source)

    Always use client.indices_overview for current column schemas. This ensures accuracy with the installed idc-index version:

    # Get all column names and types for any table
    schema = client.indices_overview["index"]["schema"]
    columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']]

    Reference Documentation

  • clinical_data_guide.md - Clinical/tabular data navigation, value mapping, and joining with imaging data

  • cloud_storage_guide.md - Direct cloud bucket access (S3/GCS), file organization, CRDC UUIDs, versioning, and reproducibility

  • cli_guide.md - Complete idc-index command-line interface reference (idc download, idc download-from-manifest, idc download-from-selection)

  • bigquery_guide.md - Advanced BigQuery usage guide for complex metadata queries

  • dicomweb_guide.md - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details

  • indices_reference - External documentation for index tables (may be ahead of the installed version)
  • External Links

  • IDC Portal: https://portal.imaging.datacommons.cancer.gov/explore/

  • Documentation: https://learn.canceridc.dev/

  • Tutorials: https://github.com/ImagingDataCommons/IDC-Tutorials

  • User Forum: https://discourse.canceridc.dev/

  • idc-index GitHub: https://github.com/ImagingDataCommons/idc-index

  • Citation: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180
  • Skill Updates

    This skill version is available in skill metadata. To check for updates:

  • Visit the releases page

  • Watch the repository on GitHub (Watch → Custom → Releases)