paper – Shekhar Gulati

Paper: Working with AI: Measuring the Occupational Implications of Generative AI

Today I was going over a paper by Microsoft Research team on how AI is impacting professsional work. This paper was published in July 2025. They analyzed 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot to understand how generative AI impacts different occupations and work activities.

They seperated analysis into two distinct perspectives:

User Goals: What people are trying to accomplish with AI assistance
AI Actions: What work activities the AI actually performs

They used O*NET database’s 332 Intermediate Work Activities (IWAs) as the basis of their classification. One of the surprising finding of the paper is that in 40% of conversations, user goals and AI actions were completely different – AI often acts as a coach/advisor rather than directly performing the user’s task.

They also list occupations where there is highest AI applicability like translators, sales reps, customer service representatives, writers, etc.

As per their study currently AI augments human work rather than fully automating it. Most occupations have some AI applicability, but none are fully automated. They also mentions that impact is uneven – some work activities highly affected, others not at all. Even successful AI assistance typically covers only moderate portions of work activities.

Paper: Expect the Unexpected: FailSafe Long Context QA for Finance

I am always looking for practical, real-world papers that can help in my work. I provide AI and LLM-related consultancy to multiple clients, most of whom are in the financial domain. One of the first things I do as part of my consulting engagement is create test datasets that can help me baseline and improve AI systems. This usually requires spending time with business/domain folks and looking at/reading/analyzing a lot of data. Today, I stumbled upon a paper Expect the Unexpected: FailSafe Long Context QA for Finance that details how they created a realistic dataset specific to the financial domain. For each record in the dataset, they have introduced query and context perturbations and evaluated how different models perform. They have done benchmarking on both reasoning and non-reasoning models. The paper covers two main aspects:

Testing how well LLMs handle real-world variations in queries and document quality when processing financial information
Focusing on long-context scenarios (like 10-K reports) where accuracy is crucial

Paper: Using Large Language Models to Promote Health Equity

I enjoyed reading this paper from the NEJM AI journal. This short (3-page) paper takes a positive view of large language models (LLMs) and discusses three potential LLM use cases that can make healthcare more equitable. As the paper mentions, 85% of the articles/papers that focus on equity-related impacts focus on the harm LLMs can cause. Only 15% of the articles/papers focus on equity opportunities.

NEJM AI is a new journal on medical artificial intelligence and machine learning from the NEJM Group. The New England Journal of Medicine (NEJM), published by NEJM Group, is one of the oldest and most prestigious medical journals in the world.

In this paper, the authors discuss three LLM use cases:

LLMs can improve the detection of bias.
LLMs can create structured datasets relevant to health equity.

I won’t discuss the third use case because everyone understands the importance of improving access to health information. However, people often underestimate the effort required to build a useful and reliable AI assistant.

Having spent the last two years working with LLMs, I’ve frequently encountered discussions about their inherent biases from training data. This paper presents an innovative counter-perspective: using LLMs to detect human biases in healthcare settings. While traditional methods relied on simple keyword searches, LLMs can analyze clinical notes to identify subtle linguistic patterns, sentiment, and stereotypes that may indicate bias in patient care. For example, research has shown that doctors more frequently describe Black patients as ‘difficult’ compared to white patients. LLMs can help systematically identify these kinds of biased language patterns by treating medical texts as ‘artifacts’ of a biased healthcare ecosystem.

Similarly, if LLMs can be used to promote propaganda, they can also be powerful tools for detecting it. Their deep understanding of language patterns can be turned from a potential liability into an asset for identifying harmful biases.

The second use case, creating structured datasets, is already happening. Features like structured output have made it much easier to extract data from unstructured documents. However, I still see people writing custom scripts and tools to do this. I believe that just as coding experience has improved in chat applications with the introduction of features like Canvas and artifacts, structured text extraction will also become better. The workflow will be more UI-driven rather than relying on Python code execution with a code interpreter.

RouteLLM Paper

Paper Link : https://arxiv.org/pdf/2406.18665v2

Paper Title: RouteLLM: Learning to Route LLMs with Preference Data

With the growing capabilities of large language models (LLMs), efficiently utilizing them becomes crucial. LLM routing emerges as a promising solution. It directs user queries to the most suitable LLM based on factors like complexity and domain. This approach aims to optimize response quality while minimizing costs. However, optimal routing presents a challenge: the router model needs to understand the query’s intent, complexity, and domain, along with the capabilities of candidate LLMs. Additionally, it should be economical, fast, and adaptable to new, improved models.

Practical Takeaways from “APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets” Paper

A recent paper by the Salesforce AI research team describes a method for generating function-calling datasets for Large Language Models (LLMs). Function calling enables LLMs to interact with external systems, like remote APIs, databases, or in-process code. This equips LLMs with tools to perform specific actions, such as retrieving weather information, booking reservations, or fetching stock data from APIs.

If you’re unfamiliar with function calling, refer to the OpenAI docs to learn more.

This post explores practical takeaways for developers building LLM applications.