llm-reliability

Here are 13 public repositories matching this topic...

Eatosin / Structura

Turn Chaos Into Structure. A Type-Safe AI Agent that extracts valid JSON from unstructured data using PydanticAI, FastHTML, and Gemini 2.5.

data-extraction fasthtml ai-agents unstructured-data type-safe-builder gemini-ai pydanticai self-healing-ai llm-reliability json-extraction

Updated Jan 10, 2026
Python

North-Shore-AI / crucible_examples

Sponsor

Star

Interactive Phoenix LiveView demonstrations of the Crucible Framework - showcasing ensemble voting, request hedging, statistical analysis, and more with mock LLMs

Updated Apr 23, 2026
Elixir

TianbaoZhang001 / OpenCAAF

Star

Reference implementation of CAAF — three-pillar agent framework with monotonic convergence.

constraint-satisfaction industrial-ai agent-framework llm-agents agentic-ai llm-reliability deterministic-ai

Updated Apr 28, 2026
Python

lokesh75-kank / agenteval

Star

Reliability and audit-evidence testing for LLM agents - wrap any agent, assert behavior, measure determinism, check grounding, emit an audit-grade report.

typescript mcp llm ai-testing llm-eval rag-evaluation agent-evaluation llm-reliability

Updated Jun 21, 2026
TypeScript

sarmishra / CHARM-agentic-rag

Star

Official implementation of CHARM: Cascading Hallucination Aware Resolution and Mitigation for multi-step agentic RAG pipelines.

ai-safety nli rag cascade-detection langchain hallucination-detection agentic-ai llm-reliability multi-hop-qa

Updated May 21, 2026
Python

assafkip / ai-reliability-recon

Star

Map where your bolted-on AI feature breaks before customers do. A free Claude Code tool: fragility map, reliability score across six dimensions, ranked gaps, and a 30-day plan. Built by a threat-intel practitioner.

hallucination rag prompt-engineering ai-testing llmops ai-quality llm-evaluation claude-code llm-reliability ai-evals ai-reliability claude-code-plugin

Updated Jun 15, 2026

North-Shore-AI / crucible_framework

Sponsor

Star

CrucibleFramework: A scientific platform for LLM reliability research on the BEAM

documentation machine-learning elixir otp research ai reproducible-research beam reliability ai-research ensemble-methods statistical-testing research-framework experiment-framework llm llm-testing llm-reliability nshkr-crucible

Updated Apr 4, 2026
Elixir

hsieh89t-cloud / legal-agent-reliability-benchmark

Star

Reliability and hallucination mitigation research for tool-augmented legal AI agents using QC-Sentinel verification architecture.

benchmark ai-safety ai-agents legal-ai openai-api prompt-engineering llm-reliability

Updated Mar 6, 2026

elyngved / failmodes-taxonomy

Star

Collection of LLM failure modes used on failmodes.com

ai llms llm-security llm-evaluation llm-reliability ai-reliability llm-failure-modes llm-observation

Updated Jun 14, 2026
JavaScript

MukundaKatta / lightweight-agent-eval-paper

Star

Public artifact bundle for the preprint 'Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents'

benchmarking ai-agents preprint agent-evaluation llm-reliability research-artifacts

Updated Jun 13, 2026
Python

shashidharReddy866 / llm-evaluation-system

Star

Production-style LLM evaluation harness for structured clinical extraction — compares prompt strategies across accuracy, cost, and hallucination.

nlp json-schema nextjs model-evaluation hono structured-output few-shot-learning ai-evaluation prompt-engineering anthropic llm-evaluation hallucination-detection llm-reliability eval-harness prompt-comparison

Updated May 1, 2026
TypeScript

jpoindexter / self-insight-agent-skills

Star

AI agent metacognition skills distilled from David Dunning's Self-Insight (2005): calibration, Dunning–Kruger awareness, the outside view, and feedback discipline for Claude Code, Codex, and compatible AI agents.

Updated Jun 13, 2026
Shell

MukundaKatta / lightweight-eval-scorecards-paper

Star

Preprint paper package — Lightweight Evaluation and Operational Scorecards for Tool-Using AI Agents (Zenodo DOI 10.5281/zenodo.20034550)

research open-science ai-agents preprint tool-use agent-evaluation llm-reliability workflow-evaluation artifact-paper operational-scorecards

Updated Jun 13, 2026
Python

Improve this page

Add a description, image, and links to the llm-reliability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-reliability topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-reliability

Here are 13 public repositories matching this topic...

Eatosin / Structura

North-Shore-AI / crucible_examples

TianbaoZhang001 / OpenCAAF

lokesh75-kank / agenteval

sarmishra / CHARM-agentic-rag

assafkip / ai-reliability-recon

North-Shore-AI / crucible_framework

hsieh89t-cloud / legal-agent-reliability-benchmark

elyngved / failmodes-taxonomy

MukundaKatta / lightweight-agent-eval-paper

shashidharReddy866 / llm-evaluation-system

jpoindexter / self-insight-agent-skills

MukundaKatta / lightweight-eval-scorecards-paper

Improve this page

Add this topic to your repo