Creating Projects

SSSOM Curator supports creating a project with sssom_curator init.

Target Directory

SSSOM Curator will create a project in the working directory, or, in a target directory by providing a name, e.g., sssom_curator init -d foo. If there’s already a project in the target directory, e.g., if there’s already a positives.sssom.tsv file, SSSOM Curator will exit with an error.

$ sssom_curator init -d example-repo
initialized SSSOM project `example-repo` at `/path/to/example-repo`

Contents

The project includes a configuration file sssom-curator.json, a script (main.py), a readme, a license (CC0 by default), and SSSOM data files.

$ cd example-repo
$ tree example-repo
├── LICENSE
├── README.md
├── main.py
├── data
│   ├── negative.sssom.tsv
│   ├── positive.sssom.tsv
│   ├── predictions.sssom.tsv
│   └── unsure.sssom.tsv
└── sssom-curator.json

The sssom-curator.json file contains metadata described by the sssom_curator.Repository class.

{
   "predictions_path": "source/predictions.sssom.tsv",
   "positives_path": "source/positive.sssom.tsv",
   "negatives_path": "source/negative.sssom.tsv",
   "unsure_path": "source/unsure.sssom.tsv",
   "mapping_set": {
     "mapping_set_id": "https://example.org/test.tsv",
     "mapping_set_confidence": null,
     "mapping_set_description": null,
     "mapping_set_source": null,
     "mapping_set_title": "Test",
     "mapping_set_version": null,
     "see_also": null,
     "comment": null,
     "license": "spdx:CC0-1.0",
     "creator_id": null
   },
   "purl_base": "https://example.org/",
}

The main.py contains boilerplate for loading the configuration JSON and running a CLI via sssom_curator.Repository.run_cli(). It contains PEP 723 compliant inline metadata and an appropriate shebang so it can be run like:

Via uv with uv run main.py
As a script with ./main.py
As a plain Python module with python main.py (requires manual environment construction, not recommended)

Usage with Git and GitHub

Based on the Open Data, Open Code, Open Infrastructure (O3) guidelines, we suggest using git as a version control system in combination with GitHub as a web interface with the following steps:

Create an account on GitHub and sign in
Create a repository on GitHub
Clone the repository to your local system

If your repository is called owner/repository and you’re using the console, then you can run the following commands to clone the repository locally, cd into it, initialize it, then commit/push it.

$ git clone https://github.com/owner/repository.git
$ cd repository
$ sssom_curator init
$ git add --all
$ git commit -m "initialized SSSOM project"
$ git push

Making Predictions

After initialization, you can generate predicted semantic mappings using the predict command in the CLI, e.g., between Medical Subject Headings (MeSH) and the Medical Actions Ontology (MaxO) with:

$ uv run main.py predict lexical mesh maxo

Note

This is a nested command to make it possible to register additional commands in the future

Making New Resources Available

This workflow accepts two _prefixes_ for resources corresponding to records in the Bioregistry (bioregistry) as a standard. Note that despite its name, the Bioregistry (despite the “bio-” name) is domain-agnostic and contains prefixes for ontologies, controlled vocabularies, databases, and other resources that mint identifiers in other domains such as engineering, cultural heritage, digital humanities, and more. Bioregistry records that contain links to OWL, OBO, or SKOS ontologies can be readily used in the SSSOM-Curator workflow. If the Bioregistry contains such an ontology link, then the workflow uses pyobo to parse them. Otherwise, it looks in pyobo.sources for a custom import module.

If you want to use this interface to predict mappings to/from a resource that is not available in the Bioregistry, consider submitting a new prefix request on the Bioregistry’s issue tracker. If the resource you want to use already has a Bioregistry record, but does not have an ontology artifact, then request a new source module on the PyOBO issue tracker or submit a pull request implementing one.

Creating Custom Mapping Generators

Any custom workflows that produce predicted mappings can be added to the project via sssom_curator.Repository.append_predicted_mappings() like in the following, which exploits the structure of names of human proteins in MeSH to map them back to HGNC, then UniProt:

import re

import pyobo
from curies import NamedReference
from curies.vocabulary import exact_match, lexical_matching_process
from pyobo.struct import has_gene_product
from sssom_curator import Repository
from sssom_pydantic import MappingTool, SemanticMapping

from main import repository

MESH_PROTEIN_RE = re.compile(r"^(.+) protein, human$")
MAPPING_TOOL = MappingTool(name="mesh-uniprot-mapper")

grounder = pyobo.get_grounder("hgnc")
hgnc_id_to_uniprot_id = pyobo.get_relation_mapping(
    "hgnc", relation=has_gene_product, target_prefix="uniprot"
)
mappings = []
for mesh_id, mesh_name in pyobo.get_id_name_mapping("mesh").items():
    match = MESH_PROTEIN_RE.match(mesh_name)
    if not match:
        continue
    gene_name = match.groups()[0]
    for gene_reference in grounder.get_matches(gene_name):
        uniprot_id = hgnc_id_to_uniprot_id.get(gene_reference.identifier)
        if not uniprot_id or "," in uniprot_id:
            continue
        mappings.append(
            SemanticMapping(
                subject=NamedReference(
                    prefix="mesh", identifier=mesh_id, name=mesh_name
                ),
                predicate=exact_match.curie,
                object=NamedReference(
                    prefix="uniprot", identifier=uniprot_id, name=gene_reference.name
                ),
                justification=lexical_matching_process,
                confidence=gene_reference.score,
                mapping_tool=mapping_tool,
            )
        )
repository.append_predicted_mappings(mappings)

For example, you might want to implement a graph machine learning-based method for predicting mappings or implement a wrapper around some of the tricky existing mapping tools (like LogMap).

Importing Mappings

As an alternative to predicting mappings directly, SSSOM Curator exposes ways of importing mappings from other sources.

OntoPortal

OntoPortal is a generic web-based ontology catalog. It predicts mappings between its indexed ontologies through an ensemble of methods such lexical matches via LOOM and inferred mappings via the UMLS. It stores these mappings in a custom format which is missing many key metadata (e.g., predicate, mapping justification), making them a good target for processing and then curation in SSSOM.

SSSOM Curator implements a workflow for consuming mappings from an OntoPortal instance’s API:

$ uv run main.py import ontoportal snomed aero

By default, this command uses BioPortal, the flagship instance of OntoPortal which covers biological and biomedical ontologies. Other portals can be selected with the --instance flag.

See this blog post for more information on how processing is done to produce the SSSOM for curation.

Note

This command accepts Bioregistry prefixes, which are internally mapped to the appropriate OntoPortal instance’s prefixes.

SeMRA

The SeMRA Raw Mappings Database can be imported and filtered to mappings that haven’t already been curated with high precision. You need to specify two or more prefixes using the -p flag.

$ uv run main.py import semra -p mesh -p hgnc

Note, this takes about five minutes to download and twenty minutes to process due to the size of the SeMRA Raw Mappings Database.

Curation

Finally, after making predictions, a local, web-based curation application can be run with the following command. It has integrations with git to manage making commits and pushes during curation.

$ uv run main.py web

Project Maintenance

Format/lint the mappings with:

$ uv run main.py lint

Test the integrity of mappings with:

$ uv run main.py test

This can easily be incorporated in a GitHub Actions workflow like in the following:

name: Tests
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - name: Test SSSOM integrity
        run: uv run main.py test