On Biological Data Generation (1/n): More Is Different, and So Is the Data
A biological data factory should not be defined by volume. It should be defined by whether it makes the right variables measurable at the right layer.
I have been trying to get more precise about what we mean when we say “biology.”
Or, more precisely, soft condensed matter self-replication systems. Same thing, but more fun.
I have also been trying to get more precise about what we mean by “data factories.”
“More biological data” is true. It is also an unsatisfying, easy, intellectually lazy answer.
More of what?
Measured at which layer?
At what time resolution?
With what perturbations?
Against what notion of state?
The default answer is usually: whichever instruments we have available.
That is not good enough.
We undersample habitually. We undersample because of technical constraints in the soft condensed matter sciences. We quickly neck down to something human-manageable. And we often treat “state” as an afterthought.
One paper I keep coming back to is Philip Anderson’s 1972 essay, “More Is Different.” (https://fermatslibrary.com/p/146f107c for the annotated version). I do not read Anderson as making an anti-reductionist argument. The lower-level physics still matters. Molecules still obey physical law. Cells are built from molecules. Tissues are built from cells. Turtles all the way up.
But Anderson’s deeper point is that, as systems become organized, new variables become useful and sometimes necessary. Knowing the parts does not automatically give you the right description of the whole.
That feels like the right starting point for thinking about biological data factories.
The hierarchy is obvious. Measurement is not.
Biology is organized across layers:
Molecular
Subcellular
Cellular
Multicellular
Organ
Organism
That hierarchy sounds obvious. From a measurement perspective, it is not at all obvious. What are we measuring in each layer? How? How much? What for?
At the molecular layer, we may care about chemical identity, binding, conformation, energy landscapes, reaction dynamics, and modifications.
At the subcellular layer, the question shifts toward organization: organelles, membranes, ribosomes, vesicles, trafficking, cytoskeleton, and localization.
At the cellular layer, we start talking about “cell state.”
That phrase is doing a lot of compression.
A cell’s internal state includes transcripts, proteins, post-translational modifications, metabolites, chromatin, organelles, energy state, stress state, membrane composition, recent history, and a lot of things we do not routinely measure.
Single-cell RNA-seq gives us one projection of that state. It does not give us the state itself.
Maybe if we measured everything, we would have a better approximation of cell state. But would that tell us how two cells behave together?
Not yet.
N ≥ 2 changes the data problem.
The world outside a cell does not see the cell’s full internal state.
Neighboring cells do not inspect each other’s transcriptomes, proteomes, metabolomes, chromatin states, organelle states, and histories.
They see a different information set:
Receptors
Ligands
Glycans
Secreted factors
Metabolites
Ions
Mechanical forces
Electrical cues
Extracellular matrix remodeling
Physical contact
Other things I am probably missing
A useful way to think about a cell is as a high-dimensional hidden state with a lower-dimensional exchange surface.
Each cell contains much more information than it can realistically export. The channel capacity is much lower than the internal state space.
And what it exports is not a neutral summary. It is a compressed, context-dependent projection of itself.
That creates an information problem:
How much of the internal state of a cell is visible to its neighbors?
How does that visibility change?
How much depends on timing, spatial position, and prior history?
How much is lost because the interface is limited?
How much is inferred by the receiver?
The same signal can mean proliferation, migration, activation, exhaustion, tolerance, or death depending on the receiving cell and its local context.
So, at the multicellular layer (N ≥ 2), we are no longer just measuring the interiors of many cells.
We are measuring partially hidden systems exchanging compressed signals over space and time.
That is a different object.
The multicellular state is not the sum of cell states.
Everyone agrees that: multicellular state ≠ sum(cell states)
But that does not get us very far.
A better approximation might be: multicellular state ≈ cell states + interfaces + topology + time
That is where Anderson’s argument comes back.
If each layer has its own useful variables, then measuring the lower layer more thoroughly may still not yield the right data for the layer above.
Bulk genomic sequencing was, in some sense, the easy case of biological data extraction. DNA gave us something unusually friendly to industrialization: a stable molecule, a mostly linear alphabet, amplification, and a scalable readout.
Most of biology is not like that.
Most of biology is hidden state, partial observability, compressed exchange, spatial organization, and feedback.
Data factories should be layer-appropriate.
A biological data factory should not be defined as a machine that generates more data.
It should be defined as a machine that generates layer-appropriate data.
That means the design has to change layer by layer.
Molecular layer: capture fast physical states at massive scale.
Subcellular layer: capture spatial and temporal organization, probably with some intelligent coarse-graining.
Cellular layer: measure internal state plus mass transfer to and through the surface.
Multicellular layer: measure exchange, topology, perturbation, and time.
Organ layer: capture function, flow, architecture, innervation, perfusion, and control.
Organism layer: capture longitudinal history.
Those are not the same data problem.
They are different observability problems.
A rough mental model I am playing with is:
biological bandwidth ≈ entities × state variables × exchange variables / time constant
Low in the hierarchy, biology can be fast, physical, and high-dimensional.
High in the hierarchy, biology can be slower, history-dependent, and difficult to perturb.
In the middle, where cells become tissues, the problem seems especially hard because internal state, exposed interface, spatial topology, and time all matter at once.
That middle layer may be where many AI-for-biology conversations are still underdeveloped. The model can only learn the variables that the measurement system exposes. Yes, there may be an inference possible. I do not think we are there yet.
Change my mind.
The engineering question
The deeper engineering question is: What would we have to build so that the right variables become measurable?
A purpose-built biological data factory would turn some layer of biology into training data by selecting perturbations, measuring states, capturing exchanges, preserving spatial structure, tracking time, and using what was learned to decide what to measure next.
This is only Level 1 of the argument.
But I think Anderson gives us the right starting point.
Before asking how much biological data we need, we should ask which layer we are trying to understand.
Because more is different.
Reference: Philip W. Anderson, “More Is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science,” Science, 1972.

Very well said.
Our thesis is that functional physiological measures are under utilized in clinical trials.
High resolution molecular data ends up being anchored to reductionist clinical parameters (eg a metabolomic signature to predict biological age— what does age tell you about a person’s capabilities?)
Data sets that combine high resolution molecular information along with functional physiology (eg peak exercise capacity, insulin sensitivity via hyperinsulinemic euglycemic clamps) allow you to have meaningful functional benchmarks which you can anchor Omics to.
Would love to chat about some of the prospective clinical trials we are taking on
Biologist who did business intelligence at a hedge fund for nearly a decade here.
The data we buy in finance has a price tag weighed against the value of information it contains.
Biologists don’t often think of the value of information in financial terms, but this is crucial. In finance, we can ballpark the value of information based on the end result of the total volume of arbitrage opportunities (not just a % change, but actual number of dollars you could make trading on that info).
For bio, we’re not developing trading strategies but implicitly data drives decisions and so we need to refocus on the decisions. Are we making a diagnosis? Preparing medical managers for a surge of medical demand? Choosing a medicine with more specific estimation of safety and efficacy? Making decisions about warfighter behavioral/protocol changes in light of an outbreak in the unit? Making a new drug?
From a business intelligence perspective, the decisions we’re trying to make should inform the “request for data/information” phrased in terms of the quantity we’re trying to better estimate.
That can inform which data provides the valuable information, creating a better economic exchange of information on biological systems.