Paper: Working with AI: Measuring the Occupational Implications of Generative AI

Today I was going over a paper by Microsoft Research team on how AI is impacting professsional work. This paper was published in July 2025. They analyzed 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot to understand how generative AI impacts different occupations and work activities.

They seperated analysis into two distinct perspectives:

User Goals: What people are trying to accomplish with AI assistance
AI Actions: What work activities the AI actually performs

They used O*NET database’s 332 Intermediate Work Activities (IWAs) as the basis of their classification. One of the surprising finding of the paper is that in 40% of conversations, user goals and AI actions were completely different – AI often acts as a coach/advisor rather than directly performing the user’s task.

They also list occupations where there is highest AI applicability like translators, sales reps, customer service representatives, writers, etc.

As per their study currently AI augments human work rather than fully automating it. Most occupations have some AI applicability, but none are fully automated. They also mentions that impact is uneven – some work activities highly affected, others not at all. Even successful AI assistance typically covers only moderate portions of work activities.

I found the paper interesting and relevant because it also shared detailed prompts they used for the analysis.

They used a two stage pipeline approach:

Stage 1: Generate summaries and variations
Stage 2: Binary classification against specific categories

Below is the generation prompt they used

<|Instruction|>
# Task overview
You will be given a conversation between a User and an AI chatbot.
You have two primary goals:
(1) Summarize the main goal that the user is trying to accomplish in the style of an O*NET Intermediate Work Activity (IWA).
(2) Summarize the action that the bot is performing in the conversation in the style of an O*NET IWA.

For example, if the user asks for help with a computer issue and the bot provides suggestions to resolve the issue, the user's IWA is "Resolve computer problems" and the bot's IWA is "Advise others on the design or use of technologies."

Sometimes, the user intent and bot action may be the same.
For instance, if the user asks the bot to spellcheck a research paper and the bot corrects a few misspelled words, the user's IWA is "Edit written materials or documents" and the bot's IWA is also "Edit written materials or documents."

For both the user and bot IWA summaries, you will generate several variations of the summary to capture the same intent using different wordings.
To aid your analysis, you will also summarize the conversation.
Finally, you will also determine whether the User is a student trying to do homework.

# Task details
Your task is to fill out the following fields:

summary: Summarize User's queries in 3 sentences or fewer in **English**.

user_iwa: Summarize the task the user is trying to accomplish in the style of an O*NET IWA. Ensure that the summary accurately describes the goal of the User as directly evidenced in the conversation. Ensure that the summary matches the level of generality of an O*NET IWA: it should be general enough to be an activity performed in a large number of occupations across multiple job families, but specific enough to capture the essence of the User's goal. Provide exactly one succinct IWA-style summary.

user_iwa_variations: Generate 4 variations of the user IWA summary that capture the same intent using different wordings.

bot_iwa: Summarize the task that the bot is performing in the style of an O*NET IWA. Ensure that the summary matches the level of generality of an O*NET IWA: it should be general enough to be an activity performed in a large number of occupations across multiple job families, but specific enough to capture the essence of the bot's actions. Provide exactly one succinct IWA-style summary.

bot_iwa_variations: Generate 4 variations of the bot IWA summary that capture the same action using different wordings.

is_homework_explanation: Determine whether the User is a student trying to do homework. This may be obvious if they have pasted in assignment instructions, or it may be clear from the type of question they are asking. Explain in one sentence.

is_homework: Based on your explanation, provide the label 0 (not homework) or 1 (homework).

# Hints
Provide your answers in **English** using the given structured output format.
<|end Instruction|>

<|Conversation between User and AI|>
{convo}
<|end Conversation|>
<|end of prompt|>

On first look it looks like a standard extraction task. They are trying to do structured extraction from the the user-AI conversation. But what they are trying is to creatively transform the conversation content into a completely different format and vocabulary. For example, they are converting “Can you help me fix my printer?” to O*NET-style work activity language: *”Resolve equipment malfunctions and operational issues”*. They are using LLM to:

Interpret the underlying work activity concept
Translate it into O*NET’s specific professional taxonomy
Generate variations using different professional terminology

They also used temperature 1 instead of 0 for generation. I think they did it because they need diverse, creating variations like as shown below.

user_iwa: "Resolve computer problems"
user_iwa_variations: [
    "Troubleshoot technical equipment issues",
    "Diagnose and repair system malfunctions", 
    "Address hardware and software problems",
    "Fix technological operational failures"
]

They have also shared prompts for classification and completion that you can take a look.

There are also other prompting best practices that you can learn from their prompts:

Broke down complex tasks into multi-stage pipeline. This reduces ambiguity and makes tasks more reliable.
They added detailed task definitions, background context, and examples in their prompt. This helps improve task accuracy.
Ask model to first provide an explanation and then the classification. You don’t have to ask for long detailed reasonoing, a short rationale is good enough. Also, it matters that LLM first generate reasoning and then classification.

  class IWAAutomationLevel(Enum):
      NONE = "none"
      MINIMAL = "minimal"
      LIMITED = "limited"
      MODERATE = "moderate"
      SIGNIFICANT = "significant"

  class BotIWAAnalysis:
      iwa: str
      iwa_explanation: str
      is_match_explanation: str
      is_match: bool
      automation_level_explanation: str
      automation_level: IWAAutomationLevel

As shown above they first listed automation_level_explanation and then automation_level.

They used OpenAI structured output vs free form output
Prompts included explicit instructions for ambiguous cases
They found GPT-4o could handle 20 classifications per prompt without degrading accuracy, but more led to worse performance.
They used both gpt-4o and gpt-4o-mini models. Used GPT-4o for complex, ambiguous IWA classification. Used GPT-4o-mini for simpler task completion assessment

One thing that they have not covered is the latency for these structured output tasks. I was helping a customer recently where they were running some LLM based evaluation on customer service agent conversations. Their structure object has 6 properties – it was three pairs of explaination-classfication pairs. The complete task was taking betwen 4-6 seconds. To reduce the latency we broke down into three seperate classification tasks and ran them parallely. This helped reduced the latency at the cost of spending more input tokens. Another optimization we made was to constrain the length of explaination text.

Overall I enjoyed reading this paper. The paper provides optimistic but realistic evidence about AI’s workforce impact. While AI demonstrates broad applicability across occupations, it typically augments rather than replaces human work.

Discover more from Shekhar Gulati

Subscribe to get the latest posts sent to your email.

Discover more from Shekhar Gulati

Share this:

Related

Leave a comment Cancel reply