Skip to content
Ayhan Sipahi Ayhan Sipahi

Ways of Working, Written Down: The Documents a Mature Engineering Team Owns

A guide to the team documents a mature engineering team owns: onboarding, working agreements, Definition of Done, on-call, knowledge transfer, and what makes each one good.

Most teams keep their ways of working in people’s heads. The result is predictable: every new joiner gets a different, lossy version of “how we do things,” on-call is stressful because the expectations were never written, and a key person leaving takes months of context with them. The fix is not “write more docs”; it is to treat ways-of-working as a small, named set of living documents, each with a single owner, stored next to the work, with onboarding delivered as a tracked, checklisted program rather than a buddy ritual. What follows is a taxonomy of the specific documents a healthy team owns, with a stance on what makes each one good versus its common failure mode.

This is not about API docs or architecture references; that knowledge-permanence layer is covered in documentation as infrastructure. The layer on top is the human operating manual for how a team works, joins, and hands off.

The Document Map

A mature team owns a handful of named documents, not a sprawling wiki. Each one answers a different question, and each has an owner. The map below is the whole post in one picture: six document types, the question each answers, and who keeps it alive.

Team's Ways of Working

Onboarding Program owner: team lead

Working Agreements owner: the team

Definition of Done owner: tech lead

On-Call Expectations owner: on-call lead

Decision Records owner: author

Knowledge Transfer owner: manager

The rest of this post walks each one: what it is, what the good version looks like, and the failure mode it replaces.

Onboarding as a Tracked Program

The default that works is onboarding implemented as tracked tickets, not a person to shadow. Picture a parent epic with one sub-ticket per knowledge area, each time-boxed and assigned: intro to the team, architecture overview, ways of work (meetings and boards), team agreements, Definition of Done, development lifecycle, deployment and monitoring, and access requests. Named roles carry it: a team lead owns the program, a mentor answers questions, and the new joiner (the mentee) works through the list. The program produces a measurable “ready” state instead of a vague feeling that someone has settled in.

Onboarding Epic

Intro to Team

Architecture

Ways of Work

Team Agreements

Definition of Done

Dev Lifecycle

Deploy & Monitoring

Request Access

Ready Gate

The failure mode is the buddy ritual: “follow someone around for two weeks.” It depends on who is free, it is different every time, and it has no completion signal. Nobody can say whether the new joiner is actually ready, because nothing tracked what “ready” means.

There is also a timing trap. Google’s re:Work guidance frames onboarding as something that “takes place over months, not days,” and recommends checking in more than once: surveying new hires at 30, 90, and 365 days across three areas, “1) technology and tools, 2) productivity and skills, and 3) culture and connection.” A first-week checklist that ends on day five misses most of what onboarding is. Structure the program to span the first quarter, and make the new joiner’s first pull request an improvement to the onboarding doc itself. That single habit keeps the document alive, because the person most aware of its gaps fixes it while the gaps are fresh.

Trade-off. A tracked program costs authoring and maintenance time up front, and a stale checklist is worse than none because it teaches the wrong thing with authority. The “first PR improves onboarding” rule is what pays that cost down; without an owner and a feedback loop, skip the ceremony and keep a short README instead.

Team Working Agreements

Working agreements are the explicit norms a team sets for itself: meeting cadence, core hours, code-review response time, which channel is for what, and how decisions get made. The good version is short, specific, and observable. “We review open pull requests within one working day” is an agreement; “we value collaboration” is a poster.

The non-obvious property is ownership. Agreements have to be authored by the team, not handed down. Atlassian’s Working Agreements play is explicit that they are “simply a starting point that can and should be updated over time,” to be revisited quarterly, when onboarding new members, during reorgs, or “when an agreement can no longer be upheld,” with the team voting to keep or change each one. The moment an agreement is imposed top-down, it stops describing how the team works and starts describing how someone wishes it did.

Code-review norms deserve their own line in this document, because review is where a lot of “how we work” actually lives; the difference between nitpicking and knowledge sharing is a cultural choice worth making explicit, covered in code review culture.

Trade-off. Revisiting agreements on a cadence is real recurring work, and a team that never revisits ends up with a document that contradicts its own behavior. By contrast, over-specifying every interaction calcifies into process theater. Keep the list to the handful of norms that actually cause friction when they are unwritten.

Definition of Done

The Definition of Done is the shared, written checklist for what “done” means: tests written, reviewed, docs updated, deployed, observable. The Scrum Guide 2020 calls it “a formal description of the state of the Increment when it meets the quality measures required for the product,” and notes that it “creates transparency by providing everyone a shared understanding of what work was completed,” with developers “required to conform to the Definition of Done.” Even teams that do not run Scrum benefit from the framing: a single, visible, canonical “done” removes a per-engineer negotiation that otherwise leaks bugs past review.

The failure mode is an implicit Definition of Done, which means “done” is whatever each engineer decides under deadline pressure. One person’s “done” is merged; another’s is deployed and verified in production. The gap between those two definitions is where regressions live.

There is a productive tension worth naming. Scrum treats the Definition of Done as a formal commitment developers must conform to; Kanban and continuous-flow teams often keep a lighter, evolving “done.” Both camps agree the definition must be written and shared. They disagree only on formality, and that choice should follow the team’s maturity, not dogma. A new team benefits from the discipline of a strict checklist; a high-trust team may safely run a lighter one.

Trade-off. A strict Definition of Done can slow a team that has outgrown it, turning a quality gate into a bureaucratic stamp. The fix is to let the document evolve with the team rather than freezing the version written on day one.

On-Call and Incident Expectations

On-call usually exists as a schedule. What is missing is the expectations: severity levels, escalation path, response-time targets, handoff procedure, runbooks, and a blameless postmortem template. The good version writes those down before the incident, so nobody negotiates severity at the worst possible moment.

Google’s SRE book sets a useful anchor for response times: they are “agreed to by the team and the business system owners,” with “typical values” of “5 minutes for user-facing or otherwise highly time-critical services, and 30 minutes for less time-sensitive systems.” PagerDuty’s incident-response guidance is just as blunt about the human reality, noting that team alert escalation “happens within 5 minutes” and that an on-call engineer is “expected to be able to respond to issues, even at 2am.” Writing these expectations down is what makes the rotation sustainable instead of heroic; the SRE book is explicit that reducing on-call stress keeps engineers in deliberate rather than panicked decision-making.

The postmortem half matters just as much. Google’s guidance is worth quoting directly: “Removing blame from a postmortem gives people the confidence to escalate issues without fear,” and “an atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug.” It also stresses defining postmortem criteria before an incident occurs, “so that everyone knows when a postmortem is necessary.” A blame-free, pre-defined process is the difference between learning from an incident and hiding it.

The failure mode is on-call as “whoever is around,” with no runbooks and a blame culture. It is hero-dependent, it burns people out, and it guarantees the same incident recurs because nothing was written down afterward.

Trade-off. Runbooks and postmortems cost time that, during a calm week, feels like overhead. That cost is paid back the first time a 2am page resolves in minutes because the runbook existed. The risk to avoid is runbook rot: an out-of-date runbook actively misleads, so assign each one an owner and review it after it is used.

Knowledge Transfer for Joining and Leaving

Onboarding has a mirror image that most teams skip: offboarding. The same checklist discipline that brings someone in should carry their context out when they leave. A good knowledge-transfer document lists the systems a departing person owned, transfers access deliberately, records the tacit context that lives only in their head, and names the new owner for each responsibility.

Team Topologies offers a useful lens here through the idea of a “team API,” the documented surface of how a team is consumed. Its model holds that “there are only three ways in which a team should interact: collaboration, X-as-a-service, and facilitation.” A team that has written down how it is consumed survives a member leaving, because the interface outlives the individual. (For how team boundaries shape that interface, see team autonomy structures.)

The failure mode is the bus-factor risk realized: offboarding as a farewell message, with six months of context walking out the door. The transfer mechanics belong in this document; the broader risk, and how to measure and reduce it, is the subject of bus factor and knowledge management.

Trade-off. A full transfer takes time the departing person may not have, especially on a short notice period. Prioritize: document the systems with the highest cost to rediscover first, and accept that the volatile, easily-relearned details can be left out.

Decision Records

Significant technical decisions need a written record of their context and reasoning, so the “why” survives the meeting it was made in. This is one entry in the taxonomy that already has a dedicated guide, so rather than re-teach the mechanics, treat decision records (ADRs and RFCs) as a first-class team document and lean on the existing write-up in the anatomy of a technical RFC. The good version captures the why and keeps an immutable history close to the code; the failure mode is decisions buried in chat threads, their context lost the moment the thread scrolls away.

What Makes Any Team Doc Good

The six documents above share the same properties, and those properties are the real argument here. A team document earns its place only when it is:

PropertyWhat it meansThe failure it prevents
LivingHas an owner and a review cadence; staleness is treated as a bugThe wiki graveyard nobody trusts
OwnedA named person, not “the team”Diffused responsibility, so no one updates it
ActionableA reader can do something, not just nodAspirational posters
Close to the workIn the repo or tool people already useThe forgotten wiki on a separate system
Self-serveA new joiner needs no synchronous human to startOnboarding that blocks on someone’s calendar
StructuredEach doc knows its job (tutorial, how-to, reference, explanation)Documents that mix modes and confuse readers

That last property comes from Diátaxis, which splits documentation into four modes by user need: tutorial, how-to, reference, and explanation. Mixing them in one page is a common reason docs feel confusing; a “how do I deploy” how-to and a “why we deploy this way” explanation serve different readers and belong in different places.

The single-source-of-truth principle ties the rest together. The GitLab Handbook is direct about it: instead of repeating content, cross-link it, because duplicate content creates more maintenance, and a single source of truth lives in only one system. Cross-linking, not copying, is why these documents point at each other rather than restating each other.

GitLab also models the living-document stance at its most aggressive: “everything is in draft at GitLab and subject to change, this includes our handbook.” Treating every document as a draft anyone can improve via a merge request is what keeps a large handbook from rotting. That is the handbook-first end of the spectrum, and it is not free, which leads to the central tension.

When Not to Document

Handbook-first culture, in the GitLab style, pushes toward documenting nearly everything as a single source of truth. The lean counter-argument is that volatile documents go stale faster than anyone maintains them, and over-documentation calcifies a team’s ability to change. Both are right within their context, and the resolution is a rule about what to write down: document the stable and the high-cost-to-rediscover, and skip the volatile. Staleness is a bug an owner fixes, not a reason to never write.

yes

no

yes

no

yes

no

yes

no

Setting up team docs?

New or greenfield team?

In pain on incidents?

Distributed or async?

Small, co-located, high-trust?

Start small: onboarding + DoD + agreements

On-call expectations + runbooks first

Push toward handbook-first

README-in-repo may be enough

The override cases follow from the team’s situation. A new or greenfield team should start with the smallest viable set: an onboarding checklist, a Definition of Done, and working agreements. A team in pain on incidents should write on-call expectations and runbooks before anything else. A distributed or async team gains the most from pushing toward handbook-first, because asynchronous work cannot fall back on tapping a shoulder. A small, co-located, high-trust team may find a README in the repo is genuinely enough, and forcing more process on it is the over-documentation failure in a different costume.

Common Pitfalls

A few failure patterns recur often enough to name directly. Treating the wiki as the goal rather than the source of truth is the most common: measure success by “did the new joiner self-serve,” not by page count. Writing working agreements top-down produces a document the team ignores, because they never agreed to it. Documenting everything, including the volatile, calcifies the team and buries the stable docs that matter under noise. An onboarding doc that is never updated drifts into actively wrong, which is why the “first PR improves onboarding” rule exists. And on-call written as a schedule but not as expectations leaves the rotation as stressful as it was before anyone wrote anything.

Closing

Treat ways-of-working as a small set of living, owned documents stored next to the work, and deliver onboarding as a tracked program rather than a buddy ritual. That default holds for almost any team past its first few engineers. Reach for the handbook-first extreme when the team is distributed and asynchronous, and reach for a lean README-only setup when the team is small, co-located, and high-trust enough that the cost of maintaining more would exceed its value. The single next step is the cheapest one with the highest return: turn your onboarding from a person to shadow into a tracked checklist, and make improving it the first task you hand a new joiner.

References

Related posts