Building Computed Fields in a Biological Database

Somak Das
Benchling Engineering
12 min readOct 18, 2017

--

We recently added computed fields to Benchling’s Bioregistry. With this feature, you can store a biological entity associated with fields you set, and define computed fields that are functions over those fields. We will automatically compute those fields for you. Because biological entities are frequently interlinked with other biological entities in the Bioregistry, forming an entity graph, a computed field can end up depending on thousands of fields for its value. Here’s how we did it.

Background

Bioregistry

The Bioregistry empowers scientists to track biological entities in a structured manner. For example, a cell line researcher at a biotech company can define a Cell Line schema that has a Parent field that links to another Cell Line. They can then create an entity in the Bioregistry that has that schema: E Coli 2, with its Parent field set to an E Coli 1. Essentially, it’s a biological database for users with a background in biology (instead of in databases): schemas are tables, fields are columns, and entities are rows.

Cell Line entities E Coli 1 and its child E Coli 2

Similar to a traditional database, the Bioregistry supports auto-incrementing or UUID primary keys like CL001, foreign keys, unique constraints, non-nullable columns, and column types. However, unlike a traditional database, it also supports bio-aware constraints such as part links (the linked entity must be a valid DNA part contained in this entity) and translation links (the linked entity must be a valid amino acid translation of this entity). In the post, we will discuss another useful addition to the Bioregistry for Benchling’s users: computed fields.

Architecture

This feature needed to integrate seamlessly with the rest of the Bioregistry. Users interact with the Bioregistry in three ways: the Web app, the API, and the warehouse.

First and foremost, we have the Web app: a friendly UI to write to and read from the Bioregistry. It supports listing and searching over the entities in the Bioregistry, importing them in bulk from file, and exporting them all to file. At a biotech company, entities are often created within the context of complex R&D workflows, Benchling also supports creating entities directly from a workflow. This means that scientists can fill out lab notebook entries containing data tables, and with a click of a button, Benchling will parse their tables into entities structured for the Bioregistry and create them.

Cell Line entity E Coli 2 (CL002), with parent E Coli 1 (CL001)
Bioregistry search to find Cell Line entities with parent E Coli 1 (CL001)
Lab notebook entry to create Cell Line entity inline

For the biotech company with an in-house IT team or tech-savvy scientists, the second and third ways are the API and warehouse. The API has individual and bulk GET, POST, PATCH, and DELETE endpoints for entities that user scripts can call to programmatically access the Bioregistry. The warehouse is a PostgreSQL or Redshift database mirror of the Bioregistry that users and their existing data visualization apps can query.

Internally, we back the Bioregistry with a traditional database and sync its contents to the warehouse and a search index. Specifically, Benchling’s back-end is a Python Flask server with SQLAlchemy ORM connected to a PostgreSQL database, a PostgreSQL or Redshift warehouse, and an Elasticsearch index. It also employs Celery workers to run asynchronous tasks, like bulk imports and exports.

Motivation

Scientists want to perform analysis and answer questions about their data. The cell line researcher needs to know the properties of the current cell line before experimenting with it. These properties are determined from its traits, both new and inherited from its ancestors. An example property is the antibiotics that the cell line is resistant to, either inherited, acquired from genetic mutation, or given via genetic modification. Similarly an antibody researcher needs to know which antibody, a protein complex that helps the immune system fight disease, is a good candidate for therapy. To do so, she needs to know its biochemical properties, like molecular weight, aggregated over its individual protein chains.

Antibody entity Ab1 with chains Chain A and Chain B

Calculating these values was difficult, because they would have to export the entities to other software, like Microsoft Excel formulas, and calculate them there. Plus the calculations for each entity depend on many other entities. The cell line’s antibiotic resistance depends on its new resistance plus resistance inherited from its parent, grandparent, etc. This lineage can easily span hundreds of ancestors. The antibody’s molecular weight depends on the molecular weights of its chains and their configuration within the protein complex.

This motivated us to build computed fields. Computed fields are fields that are computed by Benchling as functions of the entity’s other fields or fields of any linked entity. They are auto-updated when any dependency field changes. We want to empower Bioregistry users to make decisions using computed field values, so they should be able to view, list, and search by them anywhere in the Bioregistry: Web app, API, and warehouse. For example, the antibody researcher can search for antibodies with a molecular weight < 150 kg/mol, and the cell line researcher can search for all cell lines resistant to the antibiotic ampicillin.

Our first design goal is that the computed field values are available for all entities at time of viewing, listing, or searching in the Web app, API, and warehouse. At a biotech company, the Bioregistry can quickly grow to tens of thousands of entities, a scale that precludes computing them on-the-fly. At this scale, we still don’t want to mislead users by showing them stale computed field values, so if a dependency changes, Benchling should simultaneously invalidate the existing value and auto-update to its new value. Finally, computed fields must support a variety of functions, like aggregating over parents recursively (the Cell Line example, with a Parent cell line) or calculating biochemical properties over children (the Antibody protein complex example, with protein Chains).

System design

To recap, the design goals of computed fields are to

  • be available across the Bioregistry including listing and searching
  • never be stale, though it’s OK to clear a stale value and replace with “Recomputing…” before the new value finishes computing
  • support computations across fields and linked entity’s fields

We divided the system design into 3 modules: the database models so computed field definitions and values persist in Benchling, invalidation so values are never stale, and computation to actually compute new values.

Previous work

Computed fields appear across various platforms. For example, Drupal has a computed field module, Ember has computed properties, and Salesforce has formula fields and roll-up summary fields.

These example implementations have a variety of guarantees as well as limitations. Drupal’s computed fields can be computed when any other field of the same content type saves. Ember’s computed properties compute when they are accessed.

Setup

The case studies are the cell line and antibody examples:

Note there is both the schema, which is the template for an entity, and entity, which is the instantiation of the schema. Since “field” can be an ambiguous word, we use the term field definition to refer to the field in the schema and field value to refer to the field in the entity. So the entity E Coli 2 with schema Cell Line has field value Parent: E Coli 1.

Because entities are interlinked with other entities in the Bioregistry, like E Coli 1 & 2 in the example, they form an entity graph where the nodes are entities and edges are field values (E Coli 2 ParentE Coli 1, Ab1 Chains Chain A, Chain B). Similarly, the schemas form a schema graph where the nodes are schemas and edges are field definitions (Cell Line ParentCell Line, Antibody Chains Chain). It is a template for the entity graph. These concepts will be useful when working with field definitions and values for computed fields.

Database models

Computed field definitions: Benchling is backed with a traditional database, which already stores field definitions: name, type, etc. To augment those field definition models for computed fields, we additionally store the function and dependencies. We had to choose a balance between hardcoding computed fields and persisting full configuration for them.

Since we defined computed fields as a function over dependency fields, we found it a natural division to store them separately. That way, the function and the inputs to the function can be edited independently. The intuition is that there are many ways to configure computed fields because users set up their schemas differently, but they share the same functions. One biotech company may have Antibody and Chain schemas, while another has Protein Complex and Protein schemas, but both can reuse the same molecular weight function that takes amino acid sequences as an input array of strings.

Since the server is written in Python, the computed field functions are written in Python as well and referred to by their unique function name, such as molecular_weight, which persists to the database. The computed field dependencies are stored as function parameter name and path along the schema graph to the dependency field.

Computed field values: Similarly, the database already stores field values: text, number, linked entity, list of linked entities, etc. To augment those field value models for computed fields, we additionally store the computation status so that Benchling knows if the computed field value is queued for computation, computing, successfully computed, or failed with an error.

We decided against the alternative of storing computed field values in a separate cache. Since we store them in the database same as other field values, we get the full functionality for free: availability across the Web app, API, and warehouse that the Bioregistry already supports. This way of modeling is agnostic of computation strategy, so we can still compute on save like Drupal, on load like Ember, or on dependency change like the design goals prescribe.

Invalidation

If any dependency changes, we want to invalidate the computed field values that dependent on it. The design decisions are where and how to do so. We decided to log all field value changes and process them in a pre-commit hook.

A simple way to never show stale values is to invalidate them in the same database transaction as the dependency change. We could invalidate asynchronously, but that approach wouldn’t have the same guarantees. So when a field value changes, we synchronously log the change as a possible dependency change. Before committing in the server, we determine if the field values are real dependencies. To do so, we query for any computed field definitions that have dependencies that contain those fields in their path. For example, if Chain A’s Amino Acid Sequence changes, we know Amino Acid Sequence is in Molecular Weight’s dependencies.

For the dependencies, we need to find out exactly which computed field values depend on it and invalidate them. To answer that, we traverse the entity graph in reverse, using the computed field dependency’s schema graph as a template. In the antibody example, many entities may link to Chain A in different ways, but we specifically want to find the Antibody that links to it via the Antibody Chains Chain schema graph and invalidate its Molecular Weight field value. By traversing the graph in reverse, finding the Antibody entities whose Chains contain Chain A, we find the right one: Ab1. Invalidation nulls the previous Molecular Weight, sets its status to queued, and queues it for computation.

Computation

At some point a Celery worker receives a Celery task to compute a field value. We chose to compute asynchronously in case computation takes too long or runs into a fatal error. Computation looks up the function corresponding to the stored function name, gathers the dependency field values, and runs the function with those values keyed by parameter name as input.

To gather the dependencies, we do the opposite graph traversal as invalidation. We have each computed field dependency and just need to traverse the entity graph, using its schema graph as a template, out to the leaf entities. In the antibody example, by finding the Chains entities that Ab1 link to, then finding their Amino Acid Sequence field values, we gather the amino acid sequences of the right ones: Chain A and Chain B. Those are passed to molecular_weight, and finally the computation stores the result in the Molecular Weight field value with status succeeded.

If the computation comes across expected errors, like “Chain has empty amino acid sequence,” then those are set with the field value’s computation status. However, it can also run into unexpected errors, like infinite loops. Since Celery tasks support a timeout, we can catch runaway computations and set the computation status to failed with “Unexpected error.” We can later investigate them and enable the function to expect the previously unexpected error, like detect if it cannot handle the input size.

This is how computed fields appear to users in the Web app:

E Coli 2 with All Resistances field computed from it and its cell line ancestors
Bioregistry search to find Cell Line entities that are resistant to antibiotic Ampicillin
Ab1 with Molecular Weight field computed from its protein chains

Other considerations

  • Recursive functions: To make sure computing a Cell Line’s All Resistances field doesn’t require traversing its potentially hundreds of ancestors, we define it to just use its direct parent’s All Resistances field. To guarantee it’s never stale, when we invalidate a computed field value, we also invalidate all dependent computed field values. The intuition is that an old ancestor’s fields are unlikely to be changed, so it’s OK to traverse its many descendants but not the other way around.
  • Race conditions: Dependencies may conflict and one transaction invalidating a field value can overlap with another computing it. We acquire read locks on the dependencies so that these transactions are serialized.
  • Just-in-time computation: If there’s a large queue of computations, then the user can view an entity with computed field values that seemingly never finish computing. In this case we fire off high-priority computations for the currently open entity and have the Web app automatically reload the new values when it finishes. This is equivalent to the compute-on-load strategy.
  • Legacy data: Users asked if they can set computed field values themselves for legacy data. However this breaks the idempotent nature of computed fields, which is a useful property because it enables freely retrying computations. So our solution for them is to define a separate Legacy Molecular Weight field and adjust the Molecular Weight function to return Legacy Molecular Weight if defined or molecular_weight(amino_acid_sequences) otherwise.

Future work

User configuration

In the future we want to enable users to define their computed field functions. Currently they must be preexisting Python functions in Benchling. There are several options to accomplish this, but each has its pros and cons.

Users are most used to Excel-like formulas that they already define on spreadsheets in lab notebook entries, but this requires adding many custom functions like LOOKUP_FIELD_VALUE, MOLECULAR_WEIGHT, etc. and adding support for recursive functions, like for Cell Line’s All Resistances computed field.

Alternatively, we can support JavaScript or Python snippets that access the entity’s fields as if it were an object. While this is the most extensible solution, there are security implications for executing user-inputted code.

We’re still investigating the right solution for user configuration. On the bright side, the database models we already defined are reusable: while the computed field function will be replaced with the user-configured function, the dependencies still apply. This means that we don’t have to rewrite the invalidation and computation modules.

More integrations

Before computed fields, Benchling was a platform to design experiments in silico, run them, and capture their biological data in one place. But if users wanted to run further analysis on that data, then they would have to export it or query the warehouse. With computed fields, it empowers them to interact with their data right inside Benchling because they can define dynamically computed state as functions over that data. Computed fields are a user-friendly way to have these computations done for you, without the need for ETL processes or other software.

As written, computed fields are applicable to entities in the Bioregistry. However, Benchling supports storing much broader types of biological data, include sample tracking (physical plates, tubes, vials, etc.), user-inputted experimental results from a workflow, and machine-generated results. We look forward to integrating computed fields in the Bioregistry with the other sample tracking and results management modules, so that users can quickly analyze all sources of data and make the best decision.

Finally, if you’re interested in helping solve these sorts of problems, we’re hiring full-stack engineers to join the team!

--

--