A.I. Grading COPILOT

Mixed-Methods Case Study

GENERATIVE STRATEGIC QUALITATIVE QUANTITATIVE

Generative foundational research to understand what problems A.I. could solve in university science courses and what constraints exist on the design space.

The outcome was a roadmap for an A.I. grading "copilot" that both supports TA grading efficiency and provides students with individualized feedback.

Research Methods & Findings

1. Stakeholder Interviews

2. Analysis of User Experience in Rival Product

Research Impact

Student frustrated by the unclear criteria that the A.I. is scoring her work against

"I am an extremely hard-working, motivated student...[BUT] I am still being given dreadful grades...by a COMPUTER."

— University student

Challenge

A.I. is disrupting business models in the EdTech space
There are a few new products leveraging A.I. for student assessment, but limited understanding of industry best practices exist to evaluate the utility of A.I. in grading
Similarly, little is known about how A.I. impacts student, TA, or instructor experiences

Dane setting out to tackle the big challenge of A.I. in university science education

This gave me my most important, unique business challenge to date:

Can I translate insights into roadmap priorities by understanding from users:

How A.I. should be used to support lab instruction,
What negative user experiences should we avoid before developing a proof of concept?

Objective

Leverage foundational research to identify areas of Labflow experience that could be enhanced with ML/A.I. from internal stakeholders and current customers
Synthesize sentiments from users of a rival product using A.I. to grade student work
Identify salient positive and negative perceptions of A.I.
Work directly with product & engineering to articulate use cases aligned to user research insights and prioritize them for business alignment

Project Outline

This is an ongoing research project that started in Summer 2023 and will involve the creation of a proof of concept implementation of A.I. in the Labflow product.

Stakeholder interviews: Jun. – Jul. 2023
Analysis of User Experience in Rival Product
- Questionnaire: Nov. – Dec. 2023
- Interviews: Nov. 2023

📏 Scope

Document internal/external stakeholders challenges and understanding of A.I.
Explore user sentiments to A.I. in competitor product
Distill user insights into roadmap priorities for proof of concept

📦 Deliverables

Presentation deck of UX insights from market competitor using A.I.
Prioritization activity for scoring A.I. concepts

👥 Roles

Mixed-Methods UX Researcher

RESEARCH QUESTION

How do human conceptions of ML/A.I. impact their expectations & desired behavior of an A.I.-powered grading system?

Research Objectives

Uncover foundational insights about user mental models of ML/A.I. and what tasks A.I. might improve within the context of Labflow
Identify the strengths and weaknesses of competitor products leveraging A.I.
Rank concepts for integrating A.I. into Labflow with business priorities.

Research Methods & Findings

Method: In-depth interviews, Questionnaire
Findings:
- Large Language Model powered A.I. technology evokes a strong ELIZA effect, giving the impression of human knowledge and ability to understand and respond to intent. (#1)
- Interviews with internal stakeholders & current customers identified areas where Labflow might leverage A.I.: administrative tasks, content authoring, student assessment, TA coaching, and enhancing analytics (#1)
- Analysis of questionnaire data (#2) revealed some important UX factors, importantly:
  - Grading consistency must be improved to combat the stochastic nature of LLMs
  - A.I. grading criteria need to be transparent and understood to students
  - Students find A.I. to be beneficial in coaching them as they learn
  - Assessment should not rely on post-hoc human review as a QA mechanism

1. Stakeholder Interviews

Participant explains in her own words how how ML/A.I. work, displaying elements of the ELIZA effect

"[Artificial Intelligence] is a way over time to teach a computer how to answer questions for you and learn [what you need]."

— Carmen, internal stakeholder

Synthesizing across all of the internal an external stakeholders led to insights on what areas of the Labflow product could be improved by A.I.

🗒️
Admin tasks

Repetitive operations like setting dates and course initialization could be made easier with A.I.

✍️
Content authoring

Report grading code, PDFs, graphics with alt text could be generated by an LLM to provide a useful starting point for content team

👨‍🎓
Student assessment

Student work could be graded by A.I. to speed up the feedback process

👩‍🏫
Grader coaching

TAs often get little guidance on how to be good graders. A.I. could scale up the ability to provide good professional development to graders

📊
Enhanced analytics

Instructors have access to course analytics in Data Insights, but they have to seek them out. A.I. could make finding insights in data more automated

2. Analysis of User Experience in Rival Product

Faculty member who adopted market rival product explains what issues he thought A.I. could help him solve

"We recruit undergraduate TAs to run first year chem labs. Their content knowledge is not the same as senior graduate students, but I don't have enough time to train them properly.

I still want to be able to ask rigorous questions and know responses will get graded correctly for all students."

— Chemistry Professor & Coordinator of Undergraduate Laboratories

Students who used the market rival product adopted were also given a questionnaire about their experience. This included fixed choice and open-ended responses.

Dane a bit skeptical that A.I. grading is even a good approach, given that it so strongly detested

To put it bluntly, students had strong feelings about A.I. grading.

Some felt it was a double standard ("why can you use it but I can't?"). Others felt it cheapened the value of their university degree ("I'm paying to be taught by people!")

A.I. grading felt like it might be a risky path.

Q: For graded work within [competitor], do you think you were fairly graded?

43%

YES

47%

10%

UNDECIDED

Qualitative analysis of student responses for reasons why A.I. did not grade them fairly

Underlying reasons why A.I. feedback was helpful or not
(Positive = reason it helped; negative = reason it didn't help)

One "Aha!" moment I had in this data is A.I. was not universally negative for students. Specific aspects of A.I. drove both negative and positive sentiment.

Negative:

Consistency: A.I. did not grade the same inputs identically
Expectations: The grading criteria the A.I. uses to evaluate responses are unclear

Positive:

Coaching: A.I. is good at guiding students around unfamiliar content as they are learning it for the first time

Research Impact

Used stakeholder and market insights to articulate multiple candidate A.I. features in Labflow
Scored candidates with a rubric aligned to company goals and identified top candidates for an initial proof of concept

Example of A.I. concepts ranked against a business prioritization rubric

The prioritization activity yielded some clear winners for an A.I. proof of concept in Labflow.

The insights I distilled from interviews and questionnaire data really pointed to an A.I. assistant that helps grade rather than replace the TA as having high value.

For now how we plan to implement the specifics are secret, but we'll be excited to share something soon!

A.I. grading assistant

Dane getting excited to build some A.I. goodness

The next step is to roll up our sleeves in someone's garage and build this thing!

Stay tuned...

A.I. Grading COPILOT

Mixed-Methods Case Study

"I am an extremely hard-working, motivated student...[BUT] I am still being given dreadful grades...by a COMPUTER."

Challenge

Objective

Project Outline

📏 Scope

📦 Deliverables

👥 Roles

How do human conceptions of ML/A.I. impact their expectations & desired behavior of an A.I.-powered grading system?

Research Objectives

Research Methods & Findings

1. Stakeholder Interviews

"[Artificial Intelligence] is a way over time to teach a computer how to answer questions for you and learn [what you need]."

🗒️ Admin tasks

✍️ Content authoring

👨‍🎓 Student assessment

👩‍🏫 Grader coaching

📊 Enhanced analytics

2. Analysis of User Experience in Rival Product

"We recruit undergraduate TAs to run first year chem labs. Their content knowledge is not the same as senior graduate students, but I don't have enough time to train them properly.

I still want to be able to ask rigorous questions and know responses will get graded correctly for all students."

43%

47%

10%

Research Impact

🗒️
Admin tasks

✍️
Content authoring

👨‍🎓
Student assessment

👩‍🏫
Grader coaching

📊
Enhanced analytics