Overview
I am a prospective domestic PhD candidate in Australia. This page outlines the broad research areas I want to explore.
Trust in Emerging AI: a normative model, an evaluation framework, and human-centred benchmarks for LLMs
Summary
Large language models are often delivered through chat experiences that can feel strikingly human. This believability may change how people rely on and judge systems, and it may be that traditional computer trust ideas themselves no longer suffice. This project proposes to (1) develop and test a reframed, cross-disciplinary view of trust for GPT-style LLMs; (2) design an Evaluation Framework for Trust in AI that examines where benchmarking helps and where it does not; and (3) prototype crowd-supported, human-centred benchmarking to explore whether reliability, risk coverage and decision usefulness can be improved.
Problem and motivation
LLMs can speak fluently and confidently, sometimes sounding authoritative when they are incorrect. Existing evaluations tend to emphasise technical metrics, may under-specify stakeholder needs and domains, and may only partially detect conversational failure modes such as confident hallucinations. There is a need to test clearer concepts tailored to human-like dialogue, a coherent evaluation framework, and benchmarks that incorporate human judgement.
Aims, research questions and hypotheses
- Aim 1 — Normative foundation for LLM trust
What should trust mean for LLMs across domains and cultures, given human-like dialogue, learning mechanisms such as token prediction and reinforcement learning from human feedback, and careful parallels with neural processing? - Aim 2 — Evaluation framework with benchmarking
What belongs in an Evaluation Framework for Trust in AI, and where does benchmarking fit in value, limits and reporting, considering stakeholder groups, stages of trust and domains? - Aim 3 — Human-centred improvement to benchmarking
Can crowd-supported approaches strengthen reliability and risk coverage beyond current benchmarks, especially for conversational failure modes?
Hypotheses to be tested
- H1 Coverage gap: Without a GPT-era trust model, current LLM evaluations may miss important trust dimensions, especially those that arise in human-like dialogue.
- H2 Human-supported improvement: Adding crowd-supported human interactions to benchmarking improves reliability and detection of risks such as hallucinations, prompt injection and privacy leakage compared with technical benchmarks alone.
Existing conceptual trust frameworks to build on
Trust theory (individual → organisational → institutional)
- Mayer–Davis–Schoorman (1995) — Ability, Benevolence, Integrity (ABI).
- Lewis & Weigert (1985) — cognitive vs emotional trust.
- McKnight & Chervany (2001) — trusting beliefs, intentions, behaviours.
- Lewicki & Bunker (1995) — developmental stages of trust.
- Hardin (2002) — encapsulated interest (mutual-benefit trust).
- Bryk & Schneider (2002) — relational trust in organisations.
- Luhmann (1979) — systems/institutional trust.
Trust evaluation and risk frameworks (risk and regulation focus)
- NIST AI Risk Management Framework (AI RMF 1.0). & ISO/IEC 23894 (AI risk management).
- EU AI Act (risk-based obligations).
- Australia: Zero-trust security principles (10-pillar style guidance for defensible architecture).
- Emerging regulations in the state of California
Crowdsourced benchmarks and evaluation approaches
- Benchmarks/platforms: BIG-bench, LMSYS Chatbot Arena, Dynabench, OpenAI Evals (community).
- Curation models: Wikipedia-style collaborative editing; GitHub community datasets/leaderboards.
- Community challenges: Kaggle-style data and evaluation tasks.
- Ambient feedback sources: forum/Q&A communities (e.g., Stack Overflow) and structured issue reporting.
Proposed approach
- Phase 1 — Normative view of trust for chat LLMs
Compare trust definitions across domains; examine historical tech shifts; relate model mechanisms to trust-relevant properties; use careful neuro analogies; consider identity and provenance in chat products. Outputs: proposed normative trust model tailored to LLMs in chat. - Phase 2 — Benchmarking within an Evaluation Framework for Trust in AI
Set scope for trust claims across stakeholders, stages and domains; scan standards and LLM evaluation suites; map methods to coverage and gaps; position benchmarking’s role and consider where human involvement is most needed. Outputs: A summary gap view, and roposed Evaluation Framework for Trust in AI. - Phase 3 — Crowdsourced support for benchmarking trust in LLMs
Adapt practices from Wikipedia-like platforms for governance, moderation and potentially gamifying contributions; build a focused crowd dataset and pipeline that targets conversational risks; evaluate LLMs and report reliability and any incremental benefit over baselines. Outputs: pilot crowd-maintained benchmark pipeline, dataset with reliability statistics and change logs, and a brief stakeholder guide.
Intended contribution / deliverables
- Conceptual: A normative trust model that separates believability from trustworthiness and links mechanisms to trust-relevant properties.
- Methodological: An evaluation framework that positions benchmarking among complementary methods and sets pragmatic reporting standards.
- Practical: A human-centred benchmarking prototype that, if successful, could improve reliability and risk coverage for real decision contexts.
About me and my fit for this research
Preparation and capability
Across roles in consulting, startups and government, I’ve framed complex problems, searched widely for evidence, synthesised large bodies of information, and communicated insights for decision-makers.
Research and synthesis
- Researched and authored Open Data 101 (www.opendata101.com), reviewing 700+ academic and practitioner sources and interviewing domain experts.
- Scoped the field, identified patterns across diverse materials, and produced a structured, accessible narrative.
Technical foundation
- IT degree and early software development roles; designed, tested and iterated technology for real-world use.
- Built automation and dashboard solutions linked to executive decision contexts.
- Helped organisations adopt technology frameworks (COBIT Manual; Data Management Body of Knowledge).
Governance, risk and community leadership
- Analysed how organisations manage uncertainty and accountability across program management, audit and data governance; applied governance/risk frameworks and recommended oversight improvements.
- Advisor, Wikimedia Foundation Audit Committee (2016–2025).
- National Board Member, GovHack Australia (2018).
- Speaker at multiple conferences on the benefits of open data and public-interest technology.
Education and certifications
- Project Management: PMP and PRINCE2.
- MBA, Melbourne Business School.
- Certified Data Management Professional (CDMP); Certified Information Systems Auditor (CISA).
- Neural Networks and Deep Learning, DeepLearning.AI.