add-evaluation-tool

Tech Stack

Frontend

Dart + Flutter Web: Single codebase for web UI with fast iteration and robust widget system; built‑in Material 3 support aligns with modern UX needs.
http (Dart package): Lightweight HTTP client for straightforward REST calls to the backend.
Nginx (static hosting & SPA routing): Serves compiled Flutter assets efficiently and handles client‑side routing with try_files; also proxies API requests to the backend.

Backend

Python 3.11: Modern, stable runtime with wide ecosystem and team familiarity.
FastAPI: High‑performance async web framework with type hints, automatic OpenAPI generation, and clear developer experience.
Uvicorn (ASGI server): Production‑grade async server well‑suited for FastAPI.
Pydantic v2: Typed data validation and serialization for request/response models.

API & Contract

OpenAPI (auto‑generated by FastAPI): Up‑to‑date interactive API docs at /docs with minimal extra work.
JSON over HTTP: Simple, widely supported format for client‑server communication.

Database & Storage

PostgreSQL for reliability, transactions, and broad tooling support.

Containerization & Orchestration

Docker: Reproducible environments for both backend and frontend.
Docker Compose: Simple local orchestration for multi‑service setup (frontend + backend).

Testing & Quality

Backend: pytest: Widely used, expressive testing framework for Python.
Backend: ruff + mypy: Fast linting and consistent formatting to maintain code quality.
Frontend: flutter_test + flutter_lints: Built‑in testing and lint rules to keep UI code reliable and idiomatic.

Testing

We standardized on pytest with the pytest-cov plugin for backend unit testing. Pytest already underpins our existing tests, offers first-class fixtures/monkeypatching for FastAPI services, and integrates seamlessly with Poetry. The pytest-cov plugin layers in coverage reporting and lets us gate per-module thresholds without bolting on another runner, so we get a single, fast command for both correctness and coverage validation.

Static analysis

Ruff (Python linter + formatter): One tool enforces the Python style guide (ruff format --check) and runs lint rules in a single, ultra-fast pass, so contributors don’t have to juggle black/isort/flake8 separately.
Mypy (Python type checker): Catches interface mismatches and optional/None bugs in our dynamically typed backend before runtime, which is critical for service stability.
Dart formatter: dart format --output=none --set-exit-if-changed keeps the Flutter widget tree readable and guarantees identical whitespace regardless of the contributor’s IDE.
Dart analyzer: dart analyze --fatal-infos --fatal-warnings catches UI/layout regressions and ensures null-safety issues never reach a device build.
markdownlint-cli2: Enforces consistent Markdown headings, tables, and fenced blocks so architecture docs stay reviewable even as they scale.
Lychee link checker: Automated dead-link detection prevents stale references in docs/ and README, which is vital for contributors onboarding through documentation first.

Documentation & Diagramming

Markdown (docs/): Plain‑text documentation that’s easy to review and version.
PlantUML: Standardized component/spec diagrams aligned with the project’s architecture focus.
Mermaid: Lightweight sequence/context diagrams embedded in Markdown for quick iterations.

Security & Networking

Nginx reverse proxy: Fronts the web app and proxies /api to the backend.
CORS configuration: Explicit origins in production to restrict browser access.
Future: HTTPS termination: TLS handled by the reverse proxy or platform provider in production.

Analytics

Instrumentation Tools

OpenTelemetry SDK (Python): Manual instrumentation for business metrics (diagram uploads, parsing operations, evaluation completions). Chosen because it provides fine-grained control over what we measure, allows custom attributes for business context, and integrates seamlessly with our existing Python codebase. We instrument key user actions like diagram uploads, parsing success/failure, and evaluation cycle completions to track our North Star metric.
Grafana Cloud (Prometheus): Metrics storage and querying backend. Chosen for its 15-month retention, PromQL query language, and tight integration with Grafana dashboards. Stores both HTTP metrics from Beyla and custom business metrics from OpenTelemetry SDK.
Grafana: Visualization and dashboard platform. Chosen for its powerful query builder, flexible dashboard creation, alerting capabilities, and unified interface for metrics, traces, and logs. Enables us to create business-focused dashboards tracking evaluation cycle completion rates and user engagement metrics.

Why These Tools

We chose OpenTelemetry SDK for analytics because it allows us to track business-specific events (like evaluation completions) that aren’t captured by generic HTTP metrics. The SDK’s counter and histogram instruments let us measure user workflows end-to-end, which is essential for understanding if users are finding value in the tool. Grafana Cloud provides the storage and querying infrastructure we need without managing our own Prometheus instance, and Grafana’s dashboards make it easy for stakeholders to understand business metrics at a glance.

Observability

Instrumentation Tools

Grafana Beyla: eBPF-based auto-instrumentation for HTTP metrics and traces. Chosen because it requires zero code changes, automatically captures all HTTP traffic, and provides low-overhead instrumentation via eBPF. Beyla eliminates the need to manually instrument every endpoint and ensures we capture metrics even for endpoints we might forget to instrument manually. It’s language-agnostic and works seamlessly with our FastAPI backend.
OpenTelemetry SDK (Python): Manual instrumentation for custom metrics (parsing duration histograms, error tracking) and distributed tracing. Chosen because it provides detailed control over what we instrument, supports custom attributes for debugging context, and allows us to track business logic that happens outside HTTP requests (like PlantUML parsing operations). The SDK integrates with our existing Python logging and provides structured telemetry data.
Grafana Cloud: Unified observability platform providing Prometheus (metrics), Tempo (traces), and Loki (logs). Chosen because it eliminates the operational overhead of running our own observability stack, provides long retention periods (15 months for metrics, 30 days for logs), and offers a single interface for all three pillars of observability. The OTLP protocol ensures vendor-agnostic data export.
Grafana: Visualization, alerting, and dashboard platform. Chosen for its powerful PromQL query language, flexible dashboard creation, built-in alerting rules, and ability to correlate metrics, traces, and logs in a single interface. Enables us to create SLO dashboards, set up alerting based on service level objectives, and investigate issues by jumping from metrics to traces to logs.

Why These Tools

We chose Grafana Beyla as our primary instrumentation tool because it provides automatic, zero-code observability for all HTTP traffic. This means we get comprehensive metrics and traces without modifying application code, reducing the risk of missing instrumentation and ensuring consistent coverage. The eBPF-based approach has minimal performance overhead and works regardless of the application framework.

OpenTelemetry SDK complements Beyla by allowing us to instrument business logic that doesn’t involve HTTP requests (like PlantUML parsing) and to add custom attributes that provide business context. The SDK’s histogram instruments are essential for tracking parsing performance against our SLO (p95 < 3 seconds).

Grafana Cloud was chosen over self-hosted solutions because it eliminates operational complexity while providing enterprise-grade features like long retention, high availability, and automatic scaling. The unified platform means we don’t need to manage separate systems for metrics, traces, and logs.

Grafana provides the visualization layer that makes our observability data actionable. Its PromQL support allows us to create complex queries for SLO monitoring, and its alerting system integrates with our notification channels (PagerDuty, Slack, Email) to ensure we respond quickly to issues.

CI/CD (Planned)

GitHub Actions: Automated lint/build/test for backend and frontend; container image builds for repeatable deployments.

This site is open source. Improve this page.