01Intended purpose and capability
The system is a structured voice-based competency assessment for governance, risk, and compliance (GRC) professionals. It asks the candidate twenty-five questions across seven GRC domains (Governance & Strategy, Risk Management, Compliance & Regulatory, Information Security, Audit & Assurance, Business Continuity, AI Governance), transcribes the spoken answers, grades each against a rubric, aggregates to an overall score, and produces a tier band (Expert / Proficient / Developing / Foundation) plus a written strengths + improvement summary.
The tier is intended as a candidate-controlled signal — shareable on LinkedIn, useful as a self-assessment baseline. It is not a credential, a pre-employment screen, or a hiring decision. Any employer-driven use must respect the human-oversight protocol documented in the companion page.
02Architecture
The pipeline runs in three stages:
- Live conversation — a LiveKit room hosts a voice agent (LiveKit Agents 1.5 framework). The agent runs: speech-to-text (Deepgram Nova-3 via LiveKit Inference), an LLM (OpenAI GPT-5.3 via LiveKit Inference), and text-to-speech (Cartesia Sonic-3 via LiveKit Inference). The LLM is constrained by a system prompt that includes the question bank, persona framing (adapts to candidate seniority), and structured turn- management. Raw audio is not persisted past session close.
- Persistence — at session end, the agent writes the full transcript and a per-question response array to Postgres (Supabase). Status flips to
completed. - Scoring— the web app's
/api/score-assessmentroute handler is fired by the candidate's browser when status flips. For each answered question it sends the question text, the golden-answer rubric, and the candidate's answer to Google Gemini 2.5 Pro (via Vertex AI) with a JSON-mode prompt; the model returns a 0–10 score and a one-sentence judge feedback. A second Gemini call synthesises the per-question scorecard into three strengths, three improvements, and a summary. Aggregations (overall mean, per-domain mean, tier band) happen in app code.
03Models in use
- Deepgram Nova-3 (speech-to-text). General-purpose ASR, English. Provider: Deepgram, via LiveKit Inference.
- OpenAI GPT-5.3 chat-latest (live examiner LLM). Constrained by a structured system prompt; no fine-tuning on user data. Provider: OpenAI, via LiveKit Inference.
- Cartesia Sonic-3(text-to-speech). Voice ID 228fca29-3a0a-435c-8728-5cb483251068 ("Kiefer," male). Provider: Cartesia, via LiveKit Inference.
- Google Gemini 2.5 Pro (LLM-judge scorer + synthesis). Used in JSON-mode against a fixed rubric. Provider: Google Cloud Vertex AI Express.
- OpenAI text-embedding-3-small (rubric embedding index). 1536-dim vectors over the golden-answer corpus. Used only for offline retrieval experiments in development; not on the critical scoring path today.
- Google Gemini 2.5 Flash (career-pathway generator and consultant chat). Provider: Google Cloud Vertex AI Express.
04Training data posture
ConnectGRC does not train any foundation model. The hosted models listed above are pre-trained by their respective providers; we use them via inference-only API contracts that explicitly prohibit the providers from training on our prompts or our users' data.
The rubric corpus (the golden-answer texts in the questions table) was authored by ConnectGRC subject-matter staff. It is reviewed quarterly and on any material framework update (ISO 27001 revision, new EU AI Act guidance, NIST AI RMF update).
05Performance benchmarks and validation
Production scoring is monitored continuously via the admin RAG analytics dashboard, which surfaces:
- Tier distribution across all completed runs.
- Per-domain mean score — a domain that drifts low or high is usually a rubric problem, not a candidate-population problem; the drift is the alarm.
- Unscored-but-completed row count (pipeline-failure proxy).
- Index coverage (% of approved questions with up-to-date embeddings).
A formal offline benchmark — held-out human-graded answer set, inter-rater agreement, calibration curve — will be published here within 90 days of public launch and re-run quarterly.
06Known limitations
The current system has known limitations:
- Voice-recognition bias — Deepgram Nova-3 transcription accuracy varies by accent. We do not have an internal accuracy-by-accent benchmark yet; this is on the bias-audit roadmap.
- Language — currently English-only. Non-English speakers cannot meaningfully use the voice flow; we offer a typed-input fallback for accessibility.
- Knowledge cutoff— the assessor LLM's training data has a knowledge cutoff; questions about regulatory changes after that cutoff may not be ranked accurately.
- Single-shot scoring — the judge model scores each answer independently, with no cross-question context. A candidate who paraphrases a great answer to question 1 inside question 5 will not be credited at q1.
07Logging
Every assessment run is logged with: input transcript, model versions used, scoring outputs, token consumption, wall-clock duration, and timestamps. Logs are retained for one year (security-log retention, see Privacy Policy). Aggregated logs feed the admin RAG dashboard; raw logs are not exposed outside the admin role.
08Change log
- v1.0.0 (May 2026) — initial publication. Pipeline stack as documented above; 25-question bank seeded across the seven GRC domains.