This is a Jupyter notebook

Automate Code Optimization with Weco and Langfuse

This notebook provides a step-by-step guide on integrating Weco with Langfuse to automatically optimize LLM application code using Langfuse datasets, code evaluators, and managed LLM-as-a-Judge evaluators.

What is Weco? Weco (GitHub) is a code optimization platform. Given a source file and an evaluation function, Weco's optimizer iteratively edits the code, re-evaluates, and keeps the version that scores best. It works with any measurable metric — accuracy, latency, cost, or a custom composite score.

What is Langfuse? Langfuse is an open-source LLM engineering platform. It offers tracing and monitoring capabilities for AI applications. Langfuse helps developers debug, analyze, and optimize their AI systems by providing detailed insights and integrating with a wide array of tools and frameworks through native integrations, OpenTelemetry, and dedicated SDKs.

How It Works

Weco connects to Langfuse as an evaluation backend. On each optimization step, Weco:

Edits your source code (e.g. prompts, parsing logic)
Runs your target function against every item in a Langfuse dataset
Collects scores from local code evaluators and/or managed evaluators (LLM-as-a-Judge) configured in the Langfuse UI
Combines scores into a single metric and keeps the best-performing version

Each iteration creates a new experiment run in Langfuse so you can compare all variants side-by-side in the Langfuse dashboard.

Getting Started

Let's walk through a practical example of using Weco with Langfuse to optimize a simple QA function.

Step 1: Install Dependencies

For this example we'll install the weco client in an virtual environment. For global installation instructions and usage with agent skills please refer to Weco's docs.

!pip install "weco[langfuse]" langfuse openai -q

Authenticate with Weco:

!weco login

Step 2: Configure Langfuse SDK

Set up your Langfuse API keys. You can get these keys by signing up for a free Langfuse Cloud account or by self-hosting Langfuse. These environment variables are essential for the Langfuse client to authenticate and send data to your Langfuse project.

import os

# Get keys for your project from the project settings page: https://cloud.langfuse.com
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-***"
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-***"
os.environ["LANGFUSE_BASE_URL"] = "https://us.cloud.langfuse.com"  # 🇺🇸 US region
#os.environ["LANGFUSE_BASE_URL"] = "https://cloud.langfuse.com"  # 🇪🇺 EU region

# Your OpenAI key
os.environ["OPENAI_API_KEY"] = "sk-proj-***"

With the environment variables set, we can now initialize the Langfuse client. get_client() initializes the Langfuse client using the credentials provided in the environment variables.

from langfuse import get_client

langfuse = get_client()

# Verify connection
if langfuse.auth_check():
    print("Langfuse client is authenticated and ready!")
else:
    print("Authentication failed. Please check your credentials and host.")

Step 3: Create a Dataset in Langfuse

Weco evaluates your code against a Langfuse dataset. Each dataset item has an input dict and an optional expected_output dict. Let's create a small QA dataset.

dataset = langfuse.create_dataset(name="weco-demo-qa")

qa_pairs = [
    {
        "input": {"question": "What is the capital of France?"},
        "expected_output": {"expected_answer": "Paris"},
    },
    {
        "input": {"question": "What is the largest planet in our solar system?"},
        "expected_output": {"expected_answer": "Jupiter"},
    },
    {
        "input": {"question": "Who wrote Romeo and Juliet?"},
        "expected_output": {"expected_answer": "William Shakespeare"},
    },
    {
        "input": {"question": "What is the boiling point of water in Celsius?"},
        "expected_output": {"expected_answer": "100 degrees Celsius"},
    },
    {
        "input": {"question": "What year did the Berlin Wall fall?"},
        "expected_output": {"expected_answer": "1989"},
    },
]

for pair in qa_pairs:
    langfuse.create_dataset_item(
        dataset_name="weco-demo-qa",
        input=pair["input"],
        expected_output=pair["expected_output"],
    )

langfuse.flush()
print(f"Created dataset 'weco-demo-qa' with {len(qa_pairs)} items.")

Step 4: Write the Target Function

The target function is the code Weco will optimize. It receives an inputs dict from the dataset and returns a dict of outputs. Langfuse calls this function once per dataset item during each evaluation run.

Save this as agent.py in your working directory:

%%writefile agent.py
from openai import OpenAI

client = OpenAI()

SYSTEM_PROMPT = "You are a helpful assistant. Answer the question concisely in one sentence."


def answer_question(inputs: dict) -> dict:
    question = inputs.get("question", "")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": question},
        ],
        temperature=0.0,
    )
    return {"answer": response.choices[0].message.content}

Weco will iteratively edit agent.py — modifying SYSTEM_PROMPT, response parsing, or other logic — to improve the evaluation metric.

Step 5: Write an Evaluator and Metric Function

Code evaluators are Python functions that score each target function output. They receive keyword arguments and return a Langfuse Evaluation object.

The metric function combines all evaluator scores into a single number that Weco optimizes.

Save this as evaluators.py:

%%writefile evaluators.py
from langfuse import Evaluation


def answer_quality(*, input, output, expected_output=None, **kwargs):
    """Check that the answer is non-empty and reasonably concise."""
    answer = (output or {}).get("answer", "")
    if not answer:
        return Evaluation(name="answer_quality", value=0.0, comment="Empty answer")
    word_count = len(answer.split())
    score = 1.0 if word_count <= 50 else max(0.0, 1.0 - (word_count - 50) / 100)
    return Evaluation(
        name="answer_quality",
        value=score,
        comment=f"{word_count} words",
    )


def qa_metric(scores: dict) -> float:
    """Combine evaluator scores into a single optimization target.

    Multiplies answer_quality by Correctness (a managed evaluator
    configured in the Langfuse UI) so that only correct, concise
    answers score well.
    """
    return scores.get("answer_quality", 0.0) * scores.get("Correctness", 0.0)

Step 6: Configure a Managed Evaluator in Langfuse

Managed evaluators are LLM-as-a-Judge evaluators that run server-side on experiment traces. Set one up in the Langfuse UI:

Go to your project in Langfuse
Navigate to Evaluation → LL-as-a-Judge
Click + Set up evaluator

Create a Correctness evaluator:

Name: Correctness
Score: 0 or 1 (binary factual accuracy)
Variable mappings:
- {{input}} → $.input.question
- {{output}} → $.output.answer
- {{expected_output}} → $.expected_output.expected_answer

Important: Evaluator names are case-sensitive. The name in the Langfuse UI (e.g. Correctness) must exactly match the name passed to --langfuse-managed-evaluators and the key used in your metric function (scores.get("Correctness")).

Managed evaluators run asynchronously after each experiment. Weco automatically polls for their scores (up to 15 minutes by default). Adjust the timeout with --langfuse-managed-evaluator-timeout.

Step 7: Run the Optimization

With the dataset, target function, evaluators, and metric function in place, run Weco from the command line. Weco uses Langfuse as the evaluation backend, running your target function against the dataset on each iteration and tracking progress as experiment runs in Langfuse.

!weco run --source agent.py \
  --eval-backend langfuse \
  --langfuse-dataset weco-demo-qa \
  --langfuse-target agent:answer_question \
  --langfuse-evaluators evaluators:answer_quality \
  --langfuse-managed-evaluators Correctness \
  --langfuse-metric-function evaluators:qa_metric \
  --metric qa_metric --goal maximize --steps 5 \
  --output plain  # For Jupyter-friendly formatting

Step 8: View Results

Each optimization step creates a new experiment run in Langfuse. Navigate to Datasets → weco-demo-qa → Runs to compare inputs, outputs, and evaluator scores across all variants side-by-side.

You can also track iteration-by-iteration progress in the Weco dashboard, which shows metric scores and the exact code changes Weco made at each step. When the run completes, you'll be prompted to apply the best-performing version to your source file.

Full Example: End-to-End QA Optimization

A complete working example with a richer dataset, multiple evaluators, and holdout validation is available in the Weco CLI repository:

git clone https://github.com/WecoAI/weco-cli.git
cd weco-cli/examples/langfuse-zeph-hr-qa

This example optimizes an HR QA agent against a dataset of policy questions, using both local code evaluators and managed LLM-as-a-Judge evaluators to measure correctness and helpfulness. See the full tutorial in the Weco docs for a detailed walkthrough including dataset setup, evaluator configuration, and holdout validation.

Troubleshooting

No experiment runs appear in Langfuse

Verify that LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, and LANGFUSE_BASE_URL are exported in the same shell session where you run weco.
Check that the dataset name passed to --langfuse-dataset matches an existing dataset in your Langfuse project.

Managed evaluator scores never arrive

Confirm the evaluator name passed to --langfuse-managed-evaluators exactly matches the name in the Langfuse UI (case-sensitive).
Verify variable mappings in the evaluator setup by using the live preview on a historical trace.
Increase the polling timeout with --langfuse-managed-evaluator-timeout if evaluators are slow.

Metric stays at 0

Print intermediate evaluator outputs to check score scales and key names.
Confirm that the keys used in your metric function (scores.get("Correctness")) match the evaluator names exactly.

Auth errors

Re-check API keys and confirm the base URL matches your region (EU vs US).
Verify project-level key permissions in Langfuse project settings.

Learn More

Weco documentation — full CLI reference and advanced configuration.
Weco CLI GitHub — source code and example projects.
Langfuse Datasets documentation — creating and managing datasets.
Langfuse Evaluation documentation — configuring evaluators and scores.

Was this page helpful?

Support

On this page