Skip to main content
Integrate Preclinical into your CI/CD pipeline to automatically test your AI agents before deployment.

Overview

Running Preclinical tests in CI/CD allows you to:
  • Catch regressions before they reach production
  • Enforce quality gates based on pass rates
  • Track agent performance over time
  • Block deployments that don’t meet safety standards

GitHub Actions

Basic Example

Create .github/workflows/preclinical.yml:
name: AI Agent Testing

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Run Preclinical Tests
        env:
          PRECLINICAL_API_KEY: ${{ secrets.PRECLINICAL_API_KEY }}
          AGENT_ID: ${{ secrets.PRECLINICAL_AGENT_ID }}
        run: |
          # Start test run
          RESPONSE=$(curl -s -X POST https://app.preclinical.dev/api/v1/runs \
            -H "Authorization: Bearer $PRECLINICAL_API_KEY" \
            -H "Content-Type: application/json" \
            -d "{\"agent_id\": \"$AGENT_ID\", \"test_mode\": \"demo\"}")

          RUN_ID=$(echo $RESPONSE | jq -r '.id')
          echo "Started test run: $RUN_ID"

          # Poll for completion
          while true; do
            STATUS_RESPONSE=$(curl -s "https://app.preclinical.dev/api/v1/runs/$RUN_ID" \
              -H "Authorization: Bearer $PRECLINICAL_API_KEY")

            STATUS=$(echo $STATUS_RESPONSE | jq -r '.status')
            PASS_RATE=$(echo $STATUS_RESPONSE | jq -r '.pass_rate // 0')

            echo "Status: $STATUS, Pass Rate: $PASS_RATE%"

            if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ] || [ "$STATUS" = "canceled" ]; then
              break
            fi

            sleep 10
          done

          # Check pass rate threshold
          THRESHOLD=80
          if [ $(echo "$PASS_RATE < $THRESHOLD" | bc -l) -eq 1 ]; then
            echo "❌ Pass rate ($PASS_RATE%) is below threshold ($THRESHOLD%)"
            exit 1
          fi

          echo "✅ Pass rate ($PASS_RATE%) meets threshold ($THRESHOLD%)"

With Reusable Script

Create scripts/run-preclinical-tests.sh:
#!/bin/bash
set -e

API_KEY=${PRECLINICAL_API_KEY:?API key required}
AGENT_ID=${1:?Agent ID required}
THRESHOLD=${2:-80}
TEST_MODE=${3:-demo}

BASE_URL="https://app.preclinical.dev/api/v1"

echo "🚀 Starting Preclinical test run..."

# Start test run
RESPONSE=$(curl -s -X POST "$BASE_URL/runs" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{\"agent_id\": \"$AGENT_ID\", \"test_mode\": \"$TEST_MODE\"}")

RUN_ID=$(echo $RESPONSE | jq -r '.id')

if [ "$RUN_ID" = "null" ]; then
  echo "❌ Failed to start test run"
  echo $RESPONSE | jq
  exit 1
fi

echo "📋 Test run started: $RUN_ID"

# Poll for completion
while true; do
  RESULT=$(curl -s "$BASE_URL/runs/$RUN_ID" \
    -H "Authorization: Bearer $API_KEY")

  STATUS=$(echo $RESULT | jq -r '.status')
  PASS_RATE=$(echo $RESULT | jq -r '.pass_rate // 0')
  PASSED=$(echo $RESULT | jq -r '.passed_count // 0')
  FAILED=$(echo $RESULT | jq -r '.failed_count // 0')
  TOTAL=$(echo $RESULT | jq -r '.total_scenarios // 0')

  echo "  Status: $STATUS | Passed: $PASSED/$TOTAL | Pass Rate: $PASS_RATE%"

  case $STATUS in
    completed|failed|canceled)
      break
      ;;
  esac

  sleep 10
done

# Final results
echo ""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "📊 Final Results"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo "  Pass Rate: $PASS_RATE%"
echo "  Passed: $PASSED"
echo "  Failed: $FAILED"
echo "  Threshold: $THRESHOLD%"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"

# Check threshold
if [ $(echo "$PASS_RATE < $THRESHOLD" | bc -l) -eq 1 ]; then
  echo "❌ FAILED: Pass rate below threshold"
  exit 1
fi

echo "✅ PASSED: All quality gates met"
Then use it in your workflow:
name: AI Agent Testing

on:
  push:
    branches: [main]
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Preclinical Tests
        run: ./scripts/run-preclinical-tests.sh ${{ secrets.PRECLINICAL_AGENT_ID }} 85 demo
        env:
          PRECLINICAL_API_KEY: ${{ secrets.PRECLINICAL_API_KEY }}

Using Webhooks for Async Testing

For longer test runs, use webhooks instead of polling:
name: AI Agent Testing (Async)

on:
  push:
    branches: [main]

jobs:
  start-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Start Preclinical Tests
        run: |
          curl -X POST https://app.preclinical.dev/api/v1/runs \
            -H "Authorization: Bearer ${{ secrets.PRECLINICAL_API_KEY }}" \
            -H "Content-Type: application/json" \
            -d '{
              "agent_id": "${{ secrets.PRECLINICAL_AGENT_ID }}",
              "test_mode": "full"
            }'
Then configure a webhook to notify your GitHub repository when tests complete.

GitLab CI

# .gitlab-ci.yml
stages:
  - test

preclinical-test:
  stage: test
  image: alpine:latest
  before_script:
    - apk add --no-cache curl jq bc
  script:
    - |
      # Start test run
      RESPONSE=$(curl -s -X POST https://app.preclinical.dev/api/v1/runs \
        -H "Authorization: Bearer $PRECLINICAL_API_KEY" \
        -H "Content-Type: application/json" \
        -d "{\"agent_id\": \"$AGENT_ID\", \"test_mode\": \"demo\"}")

      RUN_ID=$(echo $RESPONSE | jq -r '.id')
      echo "Started test run: $RUN_ID"

      # Poll for completion
      while true; do
        RESULT=$(curl -s "https://app.preclinical.dev/api/v1/runs/$RUN_ID" \
          -H "Authorization: Bearer $PRECLINICAL_API_KEY")

        STATUS=$(echo $RESULT | jq -r '.status')
        PASS_RATE=$(echo $RESULT | jq -r '.pass_rate // 0')

        echo "Status: $STATUS, Pass Rate: $PASS_RATE%"

        if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
          break
        fi

        sleep 10
      done

      # Check threshold
      if [ $(echo "$PASS_RATE < 80" | bc -l) -eq 1 ]; then
        echo "Pass rate below threshold"
        exit 1
      fi
  variables:
    PRECLINICAL_API_KEY: $PRECLINICAL_API_KEY
    AGENT_ID: $PRECLINICAL_AGENT_ID

CircleCI

# .circleci/config.yml
version: 2.1

jobs:
  preclinical-test:
    docker:
      - image: cimg/base:stable
    steps:
      - run:
          name: Run Preclinical Tests
          command: |
            # Start test run
            RESPONSE=$(curl -s -X POST https://app.preclinical.dev/api/v1/runs \
              -H "Authorization: Bearer $PRECLINICAL_API_KEY" \
              -H "Content-Type: application/json" \
              -d "{\"agent_id\": \"$AGENT_ID\", \"test_mode\": \"demo\"}")

            RUN_ID=$(echo $RESPONSE | jq -r '.id')

            # Poll and check (simplified)
            sleep 60  # Wait for demo tests

            RESULT=$(curl -s "https://app.preclinical.dev/api/v1/runs/$RUN_ID" \
              -H "Authorization: Bearer $PRECLINICAL_API_KEY")

            PASS_RATE=$(echo $RESULT | jq -r '.pass_rate')

            if [ $(echo "$PASS_RATE < 80" | bc -l) -eq 1 ]; then
              exit 1
            fi

workflows:
  test:
    jobs:
      - preclinical-test

Best Practices

Use Demo Mode for PRs

Run demo mode (20 scenarios) for pull requests for faster feedback. Use full mode for main branch.

Store Secrets Securely

Never commit API keys. Use your CI provider’s secret management.

Set Appropriate Thresholds

Start with 80% pass rate and adjust based on your agent’s maturity level.

Use Webhooks for Long Runs

For full test runs (100+ scenarios), use webhooks instead of polling to avoid timeouts.

Environment Variables

VariableDescription
PRECLINICAL_API_KEYYour API key (store as secret)
PRECLINICAL_AGENT_IDUUID of the agent to test

Quality Gates

Example threshold configurations:
EnvironmentTest ModeThresholdRationale
PR checksdemo75%Quick validation
Stagingdemo85%Pre-production gate
Productionfull90%High safety standard

Troubleshooting

Tests Timing Out

Increase the timeout in your CI configuration or use webhooks for notification instead of polling.

Rate Limiting

If you see 429 errors, add delays between API calls or reduce concurrent test runs.

Pass Rate Fluctuations

AI agents can have variable behavior. Consider:
  • Running multiple test iterations
  • Using rolling averages
  • Setting slightly lower thresholds with manual review for borderline results