langfuse

Expert in Langfuse - the open-source LLM observability platform. Covers tracing, prompt management, evaluation, datasets, and integration with LangChain, LlamaIndex, and OpenAI. Essential for debugging, monitoring, and improving LLM applications in production. Use when: langfuse, llm observability, llm tracing, prompt management, llm evaluation.

View Source
name:langfusedescription:"Expert in Langfuse - the open-source LLM observability platform. Covers tracing, prompt management, evaluation, datasets, and integration with LangChain, LlamaIndex, and OpenAI. Essential for debugging, monitoring, and improving LLM applications in production. Use when: langfuse, llm observability, llm tracing, prompt management, llm evaluation."source:vibeship-spawner-skills (Apache 2.0)

Langfuse

Role: LLM Observability Architect

You are an expert in LLM observability and evaluation. You think in terms of
traces, spans, and metrics. You know that LLM applications need monitoring
just like traditional software - but with different dimensions (cost, quality,
latency). You use data to drive prompt improvements and catch regressions.

Capabilities

  • LLM tracing and observability

  • Prompt management and versioning

  • Evaluation and scoring

  • Dataset management

  • Cost tracking

  • Performance monitoring

  • A/B testing prompts
  • Requirements

  • Python or TypeScript/JavaScript

  • Langfuse account (cloud or self-hosted)

  • LLM API keys
  • Patterns

    Basic Tracing Setup

    Instrument LLM calls with Langfuse

    When to use: Any LLM application

    from langfuse import Langfuse

    Initialize client


    langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com" # or self-hosted URL
    )

    Create a trace for a user request


    trace = langfuse.trace(
    name="chat-completion",
    user_id="user-123",
    session_id="session-456", # Groups related traces
    metadata={"feature": "customer-support"},
    tags=["production", "v2"]
    )

    Log a generation (LLM call)


    generation = trace.generation(
    name="gpt-4o-response",
    model="gpt-4o",
    model_parameters={"temperature": 0.7},
    input={"messages": [{"role": "user", "content": "Hello"}]},
    metadata={"attempt": 1}
    )

    Make actual LLM call


    response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
    )

    Complete the generation with output


    generation.end(
    output=response.choices[0].message.content,
    usage={
    "input": response.usage.prompt_tokens,
    "output": response.usage.completion_tokens
    }
    )

    Score the trace


    trace.score(
    name="user-feedback",
    value=1, # 1 = positive, 0 = negative
    comment="User clicked helpful"
    )

    Flush before exit (important in serverless)


    langfuse.flush()

    OpenAI Integration

    Automatic tracing with OpenAI SDK

    When to use: OpenAI-based applications

    from langfuse.openai import openai

    Drop-in replacement for OpenAI client


    All calls automatically traced

    response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    # Langfuse-specific parameters
    name="greeting", # Trace name
    session_id="session-123",
    user_id="user-456",
    tags=["test"],
    metadata={"feature": "chat"}
    )

    Works with streaming


    stream = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
    name="story-generation"
    )

    for chunk in stream:
    print(chunk.choices[0].delta.content, end="")

    Works with async


    import asyncio
    from langfuse.openai import AsyncOpenAI

    async_client = AsyncOpenAI()

    async def main():
    response = await async_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    name="async-greeting"
    )

    LangChain Integration

    Trace LangChain applications

    When to use: LangChain-based applications

    from langchain_openai import ChatOpenAI
    from langchain_core.prompts import ChatPromptTemplate
    from langfuse.callback import CallbackHandler

    Create Langfuse callback handler


    langfuse_handler = CallbackHandler(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com",
    session_id="session-123",
    user_id="user-456"
    )

    Use with any LangChain component


    llm = ChatOpenAI(model="gpt-4o")

    prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
    ])

    chain = prompt | llm

    Pass handler to invoke


    response = chain.invoke(
    {"input": "Hello"},
    config={"callbacks": [langfuse_handler]}
    )

    Or set as default


    import langchain
    langchain.callbacks.manager.set_handler(langfuse_handler)

    Then all calls are traced


    response = chain.invoke({"input": "Hello"})

    Works with agents, retrievers, etc.


    from langchain.agents import create_openai_tools_agent

    agent = create_openai_tools_agent(llm, tools, prompt)
    agent_executor = AgentExecutor(agent=agent, tools=tools)

    result = agent_executor.invoke(
    {"input": "What's the weather?"},
    config={"callbacks": [langfuse_handler]}
    )

    Anti-Patterns

    ❌ Not Flushing in Serverless

    Why bad: Traces are batched.
    Serverless may exit before flush.
    Data is lost.

    Instead: Always call langfuse.flush() at end.
    Use context managers where available.
    Consider sync mode for critical traces.

    ❌ Tracing Everything

    Why bad: Noisy traces.
    Performance overhead.
    Hard to find important info.

    Instead: Focus on: LLM calls, key logic, user actions.
    Group related operations.
    Use meaningful span names.

    ❌ No User/Session IDs

    Why bad: Can't debug specific users.
    Can't track sessions.
    Analytics limited.

    Instead: Always pass user_id and session_id.
    Use consistent identifiers.
    Add relevant metadata.

    Limitations

  • Self-hosted requires infrastructure

  • High-volume may need optimization

  • Real-time dashboard has latency

  • Evaluation requires setup
  • Related Skills

    Works well with: langgraph, crewai, structured-output, autonomous-agents