Generative AI and assessment of AE

18 November 2025

Educational project

Generative AI and assessment of AE

Higher education increasingly educates students to take up complex challenges. For this, students need to develop adaptive expertise (AE) – the ability to excel when facing changing tasks and methods. Current assessment methods for AE consist mostly of self-reporting measurement scales, which are unsuitable due to socially desirable answers of respondents. Here, we created and tested scenarios as an alternative assessment approaches. We named these the Peradex (Performance based adaptive expertise). These scenarios present real-world, open-ended problems for students to solve. Utilizing GPT4.0, we created these scenarios and presented them to a representative sample of 540 students and professionals to solve. We assessed the quality of their answers using a local large language model (LLM) (Llama3.1: Phi-4). Our results showed that the scenarios as generated with GPT4.0 can be a valid and reliable way to assess performance based adaptive expertise.

Background information

Higher education has the responsibility to educate students and professionals to take up grand societal challenges (GSCs; Bayou et al., 2020), such as climate change or threats to global health. GSCs are complex and uncertain in terms of processes and outcomes. To help solve GSCs, students and professionals need to develop adaptive expertise, which is the ability to perform at a high level when facing changing job tasks and work methods. People with adaptive expertise are resilient, innovative and creative problem-solvers. This allows them to deal with the complexity and uncertainty of GSCs.

However, educators, struggle with that it is difficult to measure AE in an unbiassed manner. Various questionnaires exist that measure adaptive expertise. However, these tools are self-assessments, which might lead to bias when assessing adaptive expertise. Thus, currently there is a lack of an unbiased assessment tool or AE. A promising direction lies in designing  scenario’s with genAI. Here, the team developed a fair valid assessment tool that is applicable to students and professionals alike.

Project description

The team used ChatGPT 4.0 to develop scenario’s for 8 different educational domains. This generator can be used to develop new scenarios, but the recommendation is that teachers perform a manual check to assess whether the scenarios are suitable for the target population.

The team assessed the quality of the responses from 540 students and professionals by first removing all nonsense responses. Next, they used LLMs to construct a ranking of the quality of each answer per task. This was done by comparing all answers pairwise and by letting the LLM select the best answer of each pair. Having the best answers according to the LLMs more frequently was considered an indicator of higher performance based adaptive expertise. Next, they checked if these scores were related to self-assessed adaptive expertise in the questionnaires.

The team estimated a series of reliability analyses and regression models to determine whether the Peradex is a valid approach for measuring performance-based AE. The results show that this assessment approach, namely generating scenarios to test AE, is reliable and valid at an acceptable level. They found that the further the content of the task is from the students own area of expertise, the lower the performance-based AE. In addition, respondents with higher self-assessed AE also show higher performance-based AE. Surprisingly, they find no influence of task complexity (e.g. the difficulty level of the task) on performance-based AE. Furthermore, there is room for improvement regarding human–AI agreement in assessing the answers and the complexity of the tasks. The biggest challenge was getting the LLM to produce consistent replies to assess the quality of the answers by respondents. The LLM also struggled with responses that were of very poor quality.

Results

The team is developing a website, a paper, and a book chapter to share our results with the wider community. They are currently exploring opportunities to implement the Peradex in education. For this, they are primarily looking at embedding it in transdisciplinary education. The team looks forward to collaborating with others who wish to use the Peradex, and they are also eager to hear from colleagues who have applied it in practice.

Lessons learned

  • GenAI is promising tool to help in generating design scenario’s for assessment of AE
  • GenAI may also help in assessing the answers that students and professionals give to these scenarios.
  • Be careful with the outputs of LLMs, use manual checks!
  • Coding an LLM to do exactly what you want it to do is quite intense, and takes a lot of time
  • Check if you are allowed to use LLMs for assessing skills and developing scenario’s
  • Use LLMs for formative assessment for now, not for summative assessment, because human oversight is needed.
  • Make sure that students who complete the Peradex items are sufficiently motivated to do so and take the task seriously. Low quality answers lead to a biased assessment of AE.
  • Ensure that the tasks are sufficiently related to the students’ background knowledge so that they can meaningfully apply their own knowledge to the task.

 

Central AI policy

All AI-related activities on this page must be implemented in line with Utrecht University’s central AI policy and ethical code.
Responsibility for appropriate tool choice, data protection, transparency, and assessment use remains with the instructor.

 

 

Print

You are free to share and adapt, if you give appropriate credit and use it non-commercially. More on Creative Commons

 

Are you looking for funding to innovate your education? Check our funding calender!