Measuring AI Adoption Through Work Artifacts

The question I wanted to answer

I wanted to know, concretely, whether AI tools were actually making me more productive. My usual answer was a shrug and some handwaving — obviously yes, but hard to quantify — and I didn’t like that. I wanted a number.

None of the obvious candidates convinced me. AI-specific ones especially — token usage tells me what I spent, not whether anything downstream actually got better.

The reframe that worked for me

The reframe that finally clicked: stop measuring productivity, and start measuring what I actually produce. Not an aggregate number — a specific artifact over time. Whatever my role outputs most often.

For me right now, that’s tickets. For someone else it maps differently:

Engineer → PRs: size, description completeness, tests added.
Architect → design docs: depth, alternatives considered, decisions recorded.
On-call → incident writeups: root cause clarity, timeline, action items.
Product owner → tickets: background, acceptance criteria, scope.

Artifacts have properties I liked — they’re dated, versioned, searchable, role-specific. No universal metric to invent, just the one that matches the work I actually do.

My case: tickets

My title is Co-CTO at a large corp, but the work splits across several hats: solution architecture, system design for new projects, design reviews — and, the biggest slice lately, product ownership for one of our core platform teams. So my highest-volume output right now isn’t a PR, it’s a Jira ticket.

Honest confession from the engineering side: for most of my career I hated producing these artifacts by hand. Detailed tickets, proposals, architecture docs — it felt like bureaucracy, and I’d rather be building something than spend a day specifying an epic. Predictably, that produced underdefined tickets, engineers coming back with clarifying questions, rework cycles — a quiet tax on the team I was supposed to be enabling. It had to change, but it also couldn’t cost me the day.

For the last ~4 months, since I wired up the Atlassian, Grafana, and Slack MCP servers to Claude, I’ve rarely written anything non-trivial by hand. Tickets get drafted through the model, grounded with Jira context, refined in conversation, filed.

Two windows

Before: Sep–Oct 2025 — tickets I reported, written by hand.
After: Feb–Apr 2026 — tickets I reported, drafted through Claude + Atlassian MCP.

Same person, same projects (mostly one backend project on both sides). Three axes: volume, structure, semantic completeness.

The numbers

Metric	Before	After	Change
Tickets created	35	50	+43%
% with meaningful description (>50 chars)	74%	94%	+20 pp
Avg description length	321 chars / 38 words	2,169 chars / 302 words	~7×
Median description length	223 chars	1,500 chars	~6.7×

Length alone is a weak signal — a wall of text isn’t quality. So I also looked at structure:

Signal	Before	After
Markdown headers	0%	62%
Bulleted lists	6%	78%
Tables	0%	26%
Code blocks	6%	28%
Bold emphasis	0%	52%

And semantic completeness (keyword-matched sections):

Section	Before	After
Background / context	6%	44%
Acceptance criteria	3%	60%
Scope / out-of-scope	0%	18%
Implementation / approach	0%	34%

Issue-type mix also shifted — Story went from 0 to 20 in the recent window, suggesting more deliberate decomposition rather than flat Task entries.

If you want to try it on yourself

Nothing about this is Jira-specific. What I did, roughly:

Pick the artifact — the thing you produce most weeks. Mine was tickets; yours might be PRs, design docs, incident reports.
Pick two equal windows — I used 2 months before heavy AI adoption and 2 months after. Equal length matters more than the exact dates.
Pull the set — filter by author = you in the system of record. Jira has JQL by reporter; Git has --author; most doc tools expose an author filter.
Score on three axes — volume (length, completeness rate), structure (headers, lists, tables — formatting signals deliberate thought), semantic completeness (does the artifact contain the sections it should: background, acceptance criteria, scope, approach — adapt the list).
Compare profiles, not single numbers — the interesting part for me was where the artifact changed and where it didn’t.

If the profile is flat between windows, whatever tooling you adopted probably isn’t doing much. If it shifted meaningfully, you’ll see exactly where.

What I took from it

It doesn’t prove I’m shipping faster, or the team is, or the tickets are better in the reader’s eye — a long structured ticket can still be wrong. What it does show me is that the artifact I hand to the team is meaningfully different than it was six months ago — more complete, more structured, with the sections a ticket should have. Anyone who’s picked up a three-line ticket at 10pm knows why that matters.

The Unix-y version of the idea I’m keeping: one measurement, one job, done well. For me that measurement is ticket quality over time, and I’m leaving it as a personal check. The moment it becomes a team KPI, Goodhart’s Law kicks in — once a measure becomes a target, it stops being a good measure — and it’ll stop meaning anything.

Credit where it’s due: the stats above were also gathered by Claude, over the Atlassian MCP, in a few minutes of conversation. I didn’t write the jq. That feels like part of the point.

Read other posts

< [Fine-Grained Authorization in 2026: Ory Keto and the Zanzibar Model]