################### Creating Projects ################### SSSOM Curator supports creating a project with ``sssom_curator init``. ****************** Target Directory ****************** SSSOM Curator will create a project in the working directory, or, in a target directory by providing a name, e.g., ``sssom_curator init -d foo``. If there's already a project in the target directory, e.g., if there's already a ``positives.sssom.tsv`` file, SSSOM Curator will exit with an error. .. code-block:: console $ sssom_curator init -d example-repo initialized SSSOM project `example-repo` at `/path/to/example-repo` ********** Contents ********** The project includes a configuration file ``sssom-curator.json``, a script (``main.py``), a readme, a license (CC0 by default), and SSSOM data files. .. code-block:: console $ cd example-repo $ tree example-repo ├── LICENSE ├── README.md ├── main.py ├── data │ ├── negative.sssom.tsv │ ├── positive.sssom.tsv │ ├── predictions.sssom.tsv │ └── unsure.sssom.tsv └── sssom-curator.json The ``sssom-curator.json`` file contains metadata described by the :class:`sssom_curator.Repository` class. .. code-block:: json { "predictions_path": "source/predictions.sssom.tsv", "positives_path": "source/positive.sssom.tsv", "negatives_path": "source/negative.sssom.tsv", "unsure_path": "source/unsure.sssom.tsv", "mapping_set": { "mapping_set_id": "https://example.org/test.tsv", "mapping_set_confidence": null, "mapping_set_description": null, "mapping_set_source": null, "mapping_set_title": "Test", "mapping_set_version": null, "see_also": null, "comment": null, "license": "spdx:CC0-1.0", "creator_id": null }, "purl_base": "https://example.org/", } The ``main.py`` contains boilerplate for loading the configuration JSON and running a CLI via :meth:`sssom_curator.Repository.run_cli`. It contains `PEP 723 `_ compliant inline metadata and an appropriate shebang so it can be run like: 1. Via uv with ``uv run main.py`` 2. As a script with ``./main.py`` 3. As a plain Python module with ``python main.py`` (requires manual environment construction, not recommended) *************************** Usage with Git and GitHub *************************** Based on the `Open Data, Open Code, Open Infrastructure (O3) `_ guidelines, we suggest using git as a version control system in combination with GitHub as a web interface with the following steps: 1. `Create an account `_ on GitHub and sign in 2. `Create a repository `_ on GitHub 3. `Clone the repository `_ to your local system If your repository is called ``owner/repository`` and you're using the console, then you can run the following commands to clone the repository locally, ``cd`` into it, initialize it, then commit/push it. .. code-block:: console $ git clone https://github.com/owner/repository.git $ cd repository $ sssom_curator init $ git add --all $ git commit -m "initialized SSSOM project" $ git push ******************** Making Predictions ******************** After initialization, you can generate predicted semantic mappings using the ``predict`` command in the CLI, e.g., between Medical Subject Headings (MeSH) and the Medical Actions Ontology (MaxO) with: .. code-block:: console $ uv run main.py predict lexical mesh maxo .. note:: This is a nested command to make it possible to register additional commands in the future Making New Resources Available ============================== This workflow accepts two _prefixes_ for resources corresponding to records in `the Bioregistry `_ (:mod:`bioregistry`) as a standard. Note that despite its name, the Bioregistry (despite the "bio-" name) is domain-agnostic and contains prefixes for ontologies, controlled vocabularies, databases, and other resources that mint identifiers in other domains such as engineering, cultural heritage, digital humanities, and more. Bioregistry records that contain links to OWL, OBO, or SKOS ontologies can be readily used in the SSSOM-Curator workflow. If the Bioregistry contains such an ontology link, then the workflow uses :mod:`pyobo` to parse them. Otherwise, it looks in :mod:`pyobo.sources` for a custom import module. If you want to use this interface to predict mappings to/from a resource that is not available in the Bioregistry, consider submitting a `new prefix request `_ on the Bioregistry's issue tracker. If the resource you want to use already has a Bioregistry record, but does not have an ontology artifact, then request a `new source module `_ on the PyOBO issue tracker or submit a pull request implementing one. Creating Custom Mapping Generators ================================== Any custom workflows that produce predicted mappings can be added to the project via :meth:`sssom_curator.Repository.append_predicted_mappings` like in the following, which exploits the structure of names of human proteins in MeSH to map them back to HGNC, then UniProt: .. code-block:: python import re import pyobo from curies import NamedReference from curies.vocabulary import exact_match, lexical_matching_process from pyobo.struct import has_gene_product from sssom_curator import Repository from sssom_pydantic import MappingTool, SemanticMapping from main import repository MESH_PROTEIN_RE = re.compile(r"^(.+) protein, human$") MAPPING_TOOL = MappingTool(name="mesh-uniprot-mapper") grounder = pyobo.get_grounder("hgnc") hgnc_id_to_uniprot_id = pyobo.get_relation_mapping( "hgnc", relation=has_gene_product, target_prefix="uniprot" ) mappings = [] for mesh_id, mesh_name in pyobo.get_id_name_mapping("mesh").items(): match = MESH_PROTEIN_RE.match(mesh_name) if not match: continue gene_name = match.groups()[0] for gene_reference in grounder.get_matches(gene_name): uniprot_id = hgnc_id_to_uniprot_id.get(gene_reference.identifier) if not uniprot_id or "," in uniprot_id: continue mappings.append( SemanticMapping( subject=NamedReference( prefix="mesh", identifier=mesh_id, name=mesh_name ), predicate=exact_match.curie, object=NamedReference( prefix="uniprot", identifier=uniprot_id, name=gene_reference.name ), justification=lexical_matching_process, confidence=gene_reference.score, mapping_tool=mapping_tool, ) ) repository.append_predicted_mappings(mappings) For example, you might want to implement a graph machine learning-based method for predicting mappings or implement a wrapper around some of the tricky existing mapping tools (like LogMap). ******************** Importing Mappings ******************** As an alternative to predicting mappings directly, SSSOM Curator exposes ways of importing mappings from other sources. OntoPortal ========== `OntoPortal `_ is a generic web-based ontology catalog. It predicts mappings between its indexed ontologies through an ensemble of methods such lexical matches via `LOOM `_ and inferred mappings via the UMLS. It stores these mappings in a custom format which is missing many key metadata (e.g., predicate, mapping justification), making them a good target for processing and then curation in SSSOM. SSSOM Curator implements a workflow for consuming mappings from an OntoPortal instance's API: .. code-block:: console $ uv run main.py import ontoportal snomed aero By default, this command uses `BioPortal `_, the flagship instance of OntoPortal which covers biological and biomedical ontologies. Other portals can be selected with the ``--instance`` flag. See `this blog post `_ for more information on how processing is done to produce the SSSOM for curation. .. note:: This command accepts Bioregistry prefixes, which are internally mapped to the appropriate OntoPortal instance's prefixes. SeMRA ===== The `SeMRA Raw Mappings Database `_ can be imported and filtered to mappings that haven't already been curated with high precision. You need to specify two or more prefixes using the ``-p`` flag. .. code-block:: console $ uv run main.py import semra -p mesh -p hgnc Note, this takes about five minutes to download and twenty minutes to process due to the size of the SeMRA Raw Mappings Database. ********** Curation ********** Finally, after making predictions, a local, web-based curation application can be run with the following command. It has integrations with ``git`` to manage making commits and pushes during curation. .. code-block:: console $ uv run main.py web ********************* Project Maintenance ********************* Format/lint the mappings with: .. code-block:: console $ uv run main.py lint Test the integrity of mappings with: .. code-block:: console $ uv run main.py test This can easily be incorporated in a GitHub Actions workflow like in the following: .. code-block:: yaml name: Tests on: push: branches: [ main ] pull_request: branches: [ main ] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: astral-sh/setup-uv@v3 - name: Test SSSOM integrity run: uv run main.py test