ARCAS Systems
Chapter 10

Measuring AI Impact

The reality

A founder runs a 35 person professional services business and has been spending AED 14,000 (USD 3,810) per year on AI tools across the team. Asked at a quarterly review whether the AI spend has paid back, the founder answers "yes, definitely, the team is faster." Asked how many hours the team is faster by, the founder cannot answer. Asked whether the quality of output has improved, stayed the same, or declined, the founder cannot answer. Asked whether the team is actually using the tools, the founder estimates "most of them, I think." The honest answer is that the founder does not know whether the AI spend has paid back. Felt impact and measured impact live in different places. A felt impact that is positive may still be negative on the actual numbers, because the team's enthusiasm for new tools can mask the time spent fighting the tools, the quality issues that get cleaned up downstream, and the low adoption rates that mean only one or two people are getting the leverage. Measuring AI impact in 2026 separates AI as marketing from AI as operating leverage. Not optional.

Read this if

  • The business is spending more than AED 5,000 (USD 1,360) per year on AI tools without measured return
  • The founder cannot name how many hours per week the team is saving with AI
  • AI tool subscriptions are renewed automatically without a measured review
  • A team member's "AI is helping me" is the highest level of evidence the founder has on AI return
  • The quality of AI-assisted output has not been compared against the quality of pre-AI output
  • The business does not have a baseline (hours per week, accuracy rate) from before the AI deployment

What dysfunction costs

Quality cost. AI-assisted output that goes out without quality comparison may be reaching clients with subtle drops in accuracy, tone, or detail that the team has not noticed. The cost surfaces later as client complaints or rework that was not flagged as AI-related.

Strategic cost. A business that does not measure its AI deployment cannot communicate the return to potential clients, investors, or buyers. The investment becomes a sunk cost with no documented capability behind it.

What success looks like

When AI impact is measured:

  • Every deployed AI use case has a written baseline (hours per week, accuracy rate, output quality) from before the deployment
  • A defined measurement runs for at least 90 days post-deployment, with weekly check-ins and a structured review
  • The team's adoption rate is tracked (percentage of intended users actively using the tool, frequency of use)
  • AI-assisted output quality is compared against pre-AI output quality with a defined sample
  • A quarterly AI review covers each tool: spend, measured hours saved, quality, adoption, decision (expand, hold, kill)
  • The founder can answer "did the AI spend pay back?" with a number on the table

The framework

AI impact measurement runs as four layers. Each layer captures a different dimension of return.

Layer 1: Hours saved

The most concrete AI return is hours saved. The measurement requires a baseline (how many hours did the workflow take before AI) and a post-deployment count (how many hours does it take now). The difference, multiplied by the rate of the team member doing the work, produces a value of saved time.

The baseline is the part most often skipped. A workflow that has been running with AI for six months has no useful baseline because nobody recorded it before. The discipline is to capture the baseline in the two weeks before the AI deployment goes live. Two weeks of timekeeping on the workflow gives a reasonable average. Without it, the measurement is not real.

The behaviour to adopt this week: for the next AI deployment, schedule the two week baseline capture. Time the workflow daily for two weeks before going live. Average. Hold this number.

Layer 2: Quality

Hours saved is not enough on its own. AI may save time and produce worse output, in which case the cost appears downstream. Quality is measured by comparing a sample of AI-assisted output against a sample of pre-AI output, scored against the same rubric.

For most service business workflows, the rubric covers accuracy (did the output get the facts right), completeness (did it cover what it needed to), tone (did it match the business's voice), and revision time (how much did the senior reviewer have to change). A senior person scores 10 to 20 samples from each period.

When quality is not compared: AI-assisted output may be slightly worse and going out, with the team's awareness lagging the actual quality drift.

The behaviour to adopt this week: for the active AI use case, score 10 samples of pre-AI output and 10 samples of post-AI output. Compare. The result is the quality measurement.

Layer 3: Adoption

A tool that two of ten intended users are actively using is not a tool deployed. Adoption is measured by counting the percentage of intended users who used the tool at least three times in the last seven days. Below 50 percent adoption, the tool is not delivering its potential return regardless of what the active users say. (An earlier draft used 30 percent as the threshold; experience across rollouts has shifted the bar to 50 percent. Below half the team using the tool weekly, the friction is real and needs to be addressed before any expansion decision.)

Adoption is the diagnostic for whether the deployment was a success or whether the team is working around the tool. Low adoption surfaces the integration question (was the training adequate, is the workflow fit right, is there fear or resistance) that needs to be addressed before the tool is killed.

The behaviour to adopt this week: pull the usage data on each deployed AI tool. Count active users. Compare against intended users. Notice which tools have low adoption.

Layer 4: The quarterly review

Once a quarter, every deployed AI tool gets a structured review. Spend, hours saved, quality comparison, adoption rate, decision (expand, hold, kill). The review takes 60 minutes for the senior team. The output is a documented decision per tool.

The discipline is the cadence. A review that happens once produces one set of decisions and then the system drifts back to instinct. A review every quarter keeps the AI portfolio honest. Tools that drift below threshold get caught.

The behaviour to adopt this week: schedule the first quarterly AI review. 60 minutes. Standing template: tool, spend, hours saved, quality score, adoption rate, decision.

A founder you might recognise

A founder runs a 28 person legal services firm. AED 11M (USD 3M) last year. Through 2025 he had deployed three AI tools (a meeting transcription service, a document drafting workflow, and a client-facing chatbot for routine queries) at a total annual cost of AED 16,000 (USD 4,355). At the end of 2025 he could not say whether the spend had paid back.

In Q1 2026 he ran the measurement reset. For each tool he reconstructed a rough baseline using project records and team interviews. He scored 10 samples of pre and post AI output for each workflow with a senior associate. He pulled usage data from each tool's admin panel. The results were uneven.

Meeting transcription: 80 percent adoption, 4 hours per week saved across the senior team, quality unchanged. Decision: expand, add training so junior team also uses it.

Document drafting: 35 percent adoption, 3 hours per week saved among adopters, quality slightly higher (the AI surfaced clauses junior associates were missing). Decision: hold, address the adoption gap with structured training.

Client-facing chatbot: 92 percent of routine queries handled, but 18 percent of those handled responses required follow-up correction by a human, including one case where the chatbot misstated a service availability that the client relied on. Decision: kill the autonomous version, reposition as a human-reviewed draft assist.

Total spend after the changes was AED 9,000 (USD 2,450) annually. Total measured hours saved was 9 per week, or roughly 450 per year, valued at roughly AED 112,500 (USD 30,635) in senior time. The cost of the measurement reset had been roughly six hours of senior team time. The output had been a clear answer to the question that had not had one before: yes, AI was paying back. By a factor of 12 to 1.

Working through it

  1. Capture the baseline before the next deployment. Two weeks of timekeeping on the workflow before AI goes live. The baseline is the comparison point everything hangs on.

  2. Score quality with a defined rubric. Accuracy, completeness, tone, revision time. 10 to 20 samples pre and post. The same senior reviewer scores both.

  3. Track adoption. Percentage of intended users actively using the tool at least three times in the last seven days. Below 30 percent is a flag.

  4. Run the structured review per tool. Spend, hours saved, quality, adoption, decision. The output is a documented decision in writing.

  5. Schedule the quarterly review. 60 minutes. Every deployed AI tool. The cadence is the discipline.

Common mistakes

  • Skipping the baseline. A workflow that has been running with AI for six months has no useful baseline. The two week capture before deployment is the cheap discipline that prevents the expensive uncertainty later.
  • Measuring hours saved without measuring quality. Hours saved that comes with quality drops is not real return. Both measurements are needed.
  • Treating "the team likes it" as evidence of return. The team can like a tool that is not delivering measured return, especially in the early adoption period when the novelty masks the actual numbers.
  • Letting subscriptions auto-renew without review. Auto-renewal is the mechanism by which AI spend grows quietly. The quarterly review is what catches the drift.
  • Killing tools that have low adoption without addressing the integration question. Low adoption may be a tool problem, a training problem, or a team integration problem. Diagnose before killing.

Self-assessment

Y or N for each.

  1. Does every deployed AI use case have a written baseline from before deployment?
  2. Is quality compared between pre-AI and post-AI output with a defined rubric and sample?
  3. Is adoption tracked (percentage of intended users actively using the tool) for each deployed AI tool?
  4. Does a quarterly review cover spend, hours saved, quality, adoption, and decision per tool?
  5. Can the founder answer "did AI spend pay back this year?" with a number on the table?
  6. Has at least one AI tool been killed or expanded in the last year based on measured return?
  7. Are AI tool subscriptions reviewed before renewal, with auto-renew turned off?

Five or more "yes" answers means AI impact measurement is doing the work it is supposed to do. Three or four is the band where the structure exists in part but felt impact is still substituting for measured impact. Two or fewer means the next AI tool will be added on instinct, the existing tools will renew without review, and the founder's answer to "did it pay back?" will continue to be "yes, I think so."

Reading page 1

Measuring AI Impact: Core Work

Working page for Measuring AI Impact.