31 January 2008

The Nouns of Names on NEXUS

Programming is a mystery to most folks. They see a bunch of overpunctuated gobbledygook with words strewn about here and there and it's completely opaque. They know that it somehow translates into the functionality of the applications, games, websites, etc. that they use. But they have no inroads to understanding how on Earth that works.

I will now attempt a (very) partial explanation for the phylogenetics-literate crowd.

One thing people don't understand is that object-oriented computer languages (which is what I primarily use) are actually designed to be compatible with how humans think. Or at least, they're a sort of compromise between how computers think and how humans think. Natural languages, of course, are totally biased toward how humans think, while machine codes (and their slightly dressed-up cousins, assembly languages) are totally biased toward how computers think. (There are also functional languages which are slightly more computer-biased than object-oriented languages.)

Like natural languages, object-oriented languages have nouns, except they're called objects. They also have verbs, except they're called methods. Methods are usually (but not always) attached to objects. Objects can have attributes which are themselves other objects—these are called fields, and they can work a bit like adjectives (although that's not a perfect analogy).

One of the first tasks I do as a programmer when approaching a new project is to figure out what the nouns of the project are. These will be used as the basis for classes, which are the templates which objects (and their methods and fields) are created from.

So let's use Names on NEXUS as an example. This is my project, hinted at in my paper, to relate the data in NEXUS files (Maddison et al. 1997) to definitions of names as governed by the PhyloCode. So my first step is to come up with lists of nouns (i.e., class candidates) for each side of the equation:

PhyloCode (nomenclature)scientific name (or nomen), uninomen, binomen, prenomen, genus name, clade name, phylonym, definition
PhyloCode (specification)specifier, species, specimen, specimen collection, specimen accession, apomorphy, definition
NEXUSNEXUS file, tree, tree element, tree node, tree terminus, character state
sharedphylogeny, citation, piece of literature, calendar date, URI


The goal of this project is to translate a PhyloCode definition (associated with a phylonym) into a list of NEXUS taxa (i.e., operational taxonomic units) using a NEXUS tree. For that to happen, there need to be some additional nouns that help relate NEXUS entities to PhyloCode entities:

Names on NEXUScharacter state specifier, taxon specifier, character state link, taxon link


The next step is to figure out how these nouns—these classes—relate to each other. Typically, this involves statements of the form "X is a Y" (which has to do with class hierarchy) and the forms "X has a Z", "X has one or more Zs", "X has zero or more Zs", etc. (which have to do with fields). I'll also translate these nouns into capitalized "camel-humped" format, the standard format for class names in the languages I use. Lower-case "camel-humped" nouns are of primitive types (numbers, strings, Booleans) which I don't need to make a class for.

Literature
  • A LiteraturePiece has a CalendarDate, one or more authorNames, and zero or more URIs.
  • A Citation has a LiteraturePiece and zero or more authorNames.


PhyloCode: Nomenclature
  • A Nomen has a Citation, an orthography, and zero or more URIs.
  • A Uninomen is a Nomen.
  • A Binomen is a Nomen and a Phylonym, and has a Prenomen and a Uninomen.
  • A GenusName is a Uninomen and a Prenomen.
  • A CladeName is a Uninomen, a Phylonym, and a Prenomen.
  • A PhyloDefinition has a Citation, a Phylonym, one or more Specifiers, a prose statement, and a mathML statement(see my paper for details on the last one).


PhyloCode: Specification
  • A Specifier has zero or more URIs.
  • An Apomorphy is a CharStateSpecifier, and has a description and a Citation.
  • A Specimen is a TaxonSpecifier, and has one or more SpecimenAccessions.
  • A SpecimenAccession has a code and a SpecimenCollection.
  • A SpecimenCollection has a code, a name, and zero or more URIs.
  • A Species is a TaxonSpecifier, and has one or more Binomens (binomina) and one or more Specimens (name-bearing types).


NEXUS
  • A NexusFile has textData, zero or one Citations, zero or more URIs, a numTaxa amount, a numChars amount, two or more CharStates, zero or more Trees, zero or more CharStateLinks, and zero or more TaxonLinks.
  • A CharState has a character index and a character scoring.
  • A Tree has a TreeNode.
  • A TreeNode is a TreeElement and has two or more TreeElements.
  • A TreeTerminus is a TreeElement and has a taxonIndex.


Names on NEXUS
  • A CharStateSpecifier is a Specifier.
  • A TaxonSpecifier is a Specifier.
  • A CharStateLink has a CharState and a CharStateSpecifier.
  • A TaxonLink has a taxonIndex and a TaxonSpecifier.


Now I can describe the core functionality of Names on NEXUS. Taking a NexusFile, the user selects one of its Trees. Next, the application finds all PhyloDefinitions whose Specifiers are each referred to by one of the NexusFile's CharStateLinks or TaxonLinks. Using the Tree and each PhyloDefinition's mathML statement, it correlates the PhyloDefinition's Phylonym to a set of taxon indices in the NexusFile.

Of course, this is not all the application will do. (In fact, I've been done with that part of the programming for a while now.) There will also need to be a lot of programming for saving these data permanently in a database, presenting the data to the user, and making it easier for the user to enter data (for example, by creating methods for coming up with specifier suggestions based on definition statements). This may take a while....

2 comments:

  1. A quick course on software engineering :D!

    ReplyDelete
  2. Yeah, if you already understand another arcane discipline. ;P

    Oh well, maybe I'll do a more generally-appealing version later.

    ReplyDelete