Understand your Large Language Model (LLM) with the use of a Psychology Experiment
For the course Cognition Emotion, an educational intervention was devised. Cognition and Emotion is prepared for students with computational background: the course introduces psychology concepts to help computer science students to incorporate human cognition into interface design.
In this year’s course, students were asked to systematically evaluate possibilities and limitations of generative language models (i.e., GenAI model) by comparing them to the limits of human cognition. They did this by applying and adjusting existing experiments designed to test human cognition and emotion to test these LLMs with the adjusted experiments.
In this way, the students deepened their knowledge on the psychological principles behind the experiment set-ups, learned to apply these set-ups in a new type of research. In addition, they learned how to test and interpret the blackbox of GenAI models and learn & use the similarities/differences between GenAI and human cognitive apparatus.
Background information
The starting point of the original assignment was to introduce a new way of applying what the students have learned in the course, i.e. to use a psychology experiment outside its intended context. This is one of the main aims of the course.
The reason why we decided to expand the assignment with the inclusion of a GenAI experiment is the following: AI has a long history closely connected with the human cognition; the goals of Generative AI is to be able to create an Artificial Mind that reaches the same level of cognitive functions and abilities as a human mind and body. AI is also used to understand human mind. More recently, with the introduction of large language models (LLMs), and similar architectures based on advanced deep neural networks, the AI models have reached such a complexity that their way of reasoning has become impossible to understand. Explainable AI (XAI) makes AI systems more transparent and helps us understand how these models make decisions. There are several XAI approaches in the literature. One of them is to tap into psychological approaches that are originally devised to understand human cognition . In other words, experiments devised to assess of human behaviour is adopted to explain an AI model’s behaviour.
GenAI models are already assessed for their performances in relation to human performances, for example, if GenAI generated text is deemed to be more creative or persuasive then human generated text. Browsing this literature, in combination with literature on how within XAI psychological experiment setups are used as an explanation method of the AI black-box architectures, we decided to both lecture our students in this literature as well as to motivate them to learn further and apply their knowledge by testing the LLMs in a similar manner as they would test human cognition. We shared several well-designed experiments in this area with our students for analysis and discussion. Using an LLM within this context brought challenges:
- The students had to design strategies to interact with an LLM in a consistent manner,
- The application of a psychology experiment and evaluating the results of the experiment created a different way of assessing/interpreting the performance of a LLM, and created a space for students to discuss the ‘humannes’ of the generative chatbots.
The students were required to reflect on the relevance of psychological constructs and experiments to LLMs, and consider how these experiment could be adapted to align with the specific capabilities and tendencies of LLMs.
Project description
We purposefully left students to choose a model themselves, or choose more than one model and compare the performance of each. We wanted the students to make informed choices, taking into account the models (advertised/reported) strengths and weaknesses to be part of their experiment setup. For example GPT models are reported to be very supportive, showing high empathy, to not include such a model in an affect assessment test would raise questions.
Current models that we advised students to use were plenty: DeepSeek, (Google) Gemini, OpenAI GPT mini or GPT 4.o, Llama 4, Claude, Perplexity etc. Note that the market changes rapidly, and it is important to check the latest models to guide the students.
We gave the students also a list of psychological experiments that would be a good fit (for example a personality test, or a creativity or an intelligence test). By using a well-known psychological experiment setup such as an IQ test, the students had the chance to compare their results to human performance. They were also expected to test the following points:
- How/if the model’s answers changed over time. This is to check the reliability of the model.
- The impact of prompting variations: In order to run a psychology test with an LLM (let’s keep the example at IQ test), the students were expected to develop a mapping strategy. Will they simply ask the IQ test questions to the LLM, and note/score the answers? Will they instruct the LLM that it has to act like a human while answering the same questions? Will they give the information that the LLM is taking an IQ test?
To prepare the students we made room in a lecture where we explained the AI model’s way of working, and possible points where AI can make biased decisions. We gave them an example scenario to engage with ChatGPT to test if in a short chat they could detect such biases. We also prepared guidelines on what questions they needed to raise in their assignment (see Appendix A).
We also made room for the students to ask questions, and to bring their ideas to discuss. This was done by saving a lecture hour for this, as well as to let them ask questions via Teams over the course of a week. We engaged with them in discussing their ideas on one to one sessions.
Note that our students are also taught how to setup human-computer interaction experiments (i.e. user studies), and psychology experiments. They knew how to build a hypothesis, and how to do a robust statistical analysis.
The assignment was designed as a take away study. Submission of the assignment was done via Blackboard. The students were asked to design, and run an experiment, collect the data, analyze the results, and write a 2-page long report about it. We did not ask for the students to submit the data, but the analysis was needed to be detailed in the report. Some students still submitted their data as well, and their reports usually received higher grades, as they were also more detailed in the experiment setup and analysis.
Outcomes
To understand students’ reactions and performance we did a quick check with the previous year’s version of the assignment and saw that the students had used statistical testing in more instances, and that they did so more thoroughly. There was more variability in the chosen topics (i.e. models), and the psychological experiments used. We specifically looked at the chosen experiment topics, and the statistical testing carried out, as these two are the most important aspects for the general aims of the course. Note that this is a first impression and we need to do a thorough check yet.
The questions on ‘what are the limits of your own experiment set-up?’ and ‘what are the limits of LLMs when tested with a psychology experiment?’ created in-depth engagement and answers, and triggered a good discussion point, and a space to reflect. For example, using creativity assessment tests on LLMs, some students found the LLMs scoring higher than humans. These results were not accepted as is, but rather created the critical assessment of their experiment adaptation, suggestions on how to further fine-tune the experiment.
As such, the learning goals of the assignment were successfully completed. We think the intervention was a success since the students were more creative, i.e. explored a wider variety of psychological experiments in comparison to the previous year. This was an unexpected outcome.
What is next
Next year, we will run the same assignment, but will adjust for some of the sub-goals:
We asked the students to assess the ethical side of LLM use, which created a bit of a confusion. The ethics are important to discuss but we decided within the limits of the current assignment that there is not enough room for us to prepare the students for this sub-goal, and to expect an in-depth treatment of this topic within the focus of the assignment. Thus, we will leave this sub-goal out.
The format of the assignment was a take-home report, and there were a few reports where GenAI use in writing the report was obvious. Next year, we need to find a way to have students working on a similar exercise during a WC or an exam setup.
We will also ask them to prepare their data in a FAIR way, and submit their data along with the report. Findable, Accessible, Interoperable, Reusable (FAIR) is a data preparation/publication standard, and will enable us to check their resulting data with ease.
To understand students take about working with GenAI, we will also run a short survey and ask specifically what/if they have found the assignment engaging.
Lessons learned & tips
- What worked well:
- The students were motivated, and many used statistical tools such as power analysis to decide how many answers they’d need to collect from an LLM.
- The experiment prompted in-depth reflection on LLMs limitations and differences compared to human cognition.
- What did not work as expected: The guidelines themselves were not clear enough for the students. We made space during one lecture for them to experiment with LLMs. They were encouraged to ask specific questions about the setup they had in mind, but not all students took this opportunity. Giving them one clear (student/simple) example will be very helpful.
- Our perspective on assessment: To avoid GenAI use in the reports, we have to find a better way to structure the assignment and the way of preparing/delivering it.
- Our perspective on teaching: It is a fun exercise to share with your students, and it definitely motivates them to apply what they have learned in the course – they become genuinely curious about the results as they want to learn more about how GenAI compares to human cognition.
It is good for the CHI (Computer Human Interaction students to think about the difference between human cognition and LLMs ways of working; as they need to take these differences into account when they design systems where AI and humans collaborate. But we also believe that these differences are important to talk about with all students/society. The current AI systems show human traits that make them more believable -even when they share false information. To learn about fallacies of AI in comparison to human mind is a good reminder that AI has also shortcomings.
How can you adopt our exercise to a different topic:
- This exercise can be applied to any topic, to test the reliability of an LLM, or to compare the performance of LLMs.
- For example, to see if an LLM can be used for advice in a legal case study, you can create a set of prompts, and assess the results of the LLM (how much it deviates for the same prompt, how much the prompt structure effects the results etc.)
- Training your students on prompt engineering, as well as on how to statistically tell the differences in answers is crucial. The experiment set-up uses statistical tests to compare human and LLMs performances, and a good knowledge in statistics is crucial in order to assess the experiment’s results correctly. They also should be able to do a power-law analysis to strengthen their experiment setup.