Tuesday, June 28, 2011


The nescent.phylogeoref.reader package is now almost over. Here is an overview of what's into it.

Basically there are two kind of readers in here. "TreeReader" and "MetadataReader" for reading trees and metadata respectively.

Here is the class hierarchy


  1. MultiFormatReader
  2. UniversalTreeReader
  3. NeXMLReader

  1. CSVMetadataReader
  2. TextMetadataReader
  3. UniversalMetadataReader

I have given a tutorial on using the GrandUnifiedReader on github wiki.
Understanding the PhylogenyKitchen class may be a bit tricky because I myself would have to go through it again to explain you, however using it very easy. Further I have included sufficient comments to make it clear.


Tuesday, June 21, 2011

A Schematic Overview

I have prepared a schematic overview of the work in a simple easy to understand format.

Here is a link to it.

Schematic Overview

This would make most of the part clear. If you have further questions contact me !!!

Sunday, June 19, 2011

A Change in Plan

So there is a change in plan here.

What I had thought earlier :-
Earlier I thought that a nwk and a csv file have to be completely replaced by a single NeXML file. While this may be the ultimate goal but at present this is not something which is possible. The reasons for it are as follows

1) There are no examples with location metadata attached to nodes in NeXML in TreeBase.
2) There is no preferred choice which is likely to be the manner in which coordinate metadata would be attached to the nodes in future.

So here is what the current plan is.

  1. INPUT 1: A nexml file with whatever metadata it has to offer.
  2. INPUT 2: A second file with additional metadata that you might want to attach to the nodes.
  3. Having taken both these files, construct a crude Phylogeny tree from the nexml file. By crude I mean to say that this phylogeny is incomplete. This phylogeny has the basic tree structure but it may not have all the essential metadata.
  4. Now extract the additional metadata from the second file and embed it in the phylogeny.
  5. You now get a full fledged Phylogeny with all the metadata.
In the earlier plan I was trying to stuff everything into the nexml first and then use this nexml to prepare the phylogeny tree. This had the overhead of first creating a modified nexml file with all the metadata and then preparing the phylogeny from it.
I will be writing the code in a manner that the user will have the option of choosing whether this second input file is to be taken as an input or not.

While preparing a fully fledged Phylogenetic tree from NeXML would be a utopia. What we do at present is that prepare a partially prepared Phylogenetic tree which I call as crude phylogenetic tree and then patch it with metadata from the other file.

Preparing a fully functional NeXML is now at the bottom of my priority list.

Sample NeXML file

This is a sample NeXML I have prepared for testing. It demonstrates how coordinates have to be attached to the node. The color has still not been attached. I have still not decided upon how to attach color. I should be able to get a phylogenetic tree out of this nexml.

<nex:nexml about="#nex_nexml1" generator="Bio::Phylo::Project v.0.36_1660" version="0.9" xsi:schemaLocation="http://www.nexml.org/2009 http://www.nexml.org/2009/nexml.xsd">
<meta content="2011-05-22T09:19:18" datatype="xsd:date" id="meta18" property="dc:date" xsi:type="nex:LiteralMeta"/>
<otus id="otus19">

<otu id="otu21" label="A">
<meta content="835" datatype="xsd:long" id="meta1710" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
<meta content="2038" datatype="xsd:long" id="meta1709" property="tb:identifier.taxonVariant" xsi:type="nex:LiteralMeta"/>
<meta content="Alligator mississippiensis" datatype="xsd:string" id="meta1708" property="skos:prefLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/8496" id="meta1707" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta content="Alligator mississipiensis" datatype="xsd:string" id="meta1706" property="skos:altLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2813218" id="meta1705" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://purl.org/phylo/treebase/phylows/study/TB2:S2108" id="meta1704" rel="rdfs:isDefinedBy" xsi:type="nex:ResourceMeta"/>

<otu id="otu20" label="B">
<meta content="11782" datatype="xsd:long" id="meta1716" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
<meta content="28069" datatype="xsd:long" id="meta1715" property="tb:identifier.taxonVariant" xsi:type="nex:LiteralMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/94835" id="meta1714" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2539857" id="meta1713" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://purl.org/phylo/treebase/phylows/study/TB2:S2108" id="meta1712" rel="rdfs:isDefinedBy" xsi:type="nex:ResourceMeta"/>

<otu id="otu23" label="C"/>
<otu id="otu22" label="D"/>
<otu id="otu24" label="E"/>
<otu id="otu25" label="F"/>
<otu id="otu26" label="G"/>
<otu id="otu27" label="H"/>
<trees id="trees2" otus="otus19">
<tree id="tree3" xsi:type="nex:FloatTree">
<node id="node4" root="true"/>

<node id="node5" label="B" otu="otu20" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="-6.129627" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="42.865584" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node6"/>
<node id="node11"/>

<node id="node7" label="A" otu="otu21" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="-0.011702" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="43.177874" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node8"/>

<node id="node12" label="D" otu="otu22" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="14.106559" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="41.798603" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node13"/>

<node id="node9" label="C" otu="otu23" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="9.869384" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="45.786672" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node10" label="E" otu="otu24" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="20.602788" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="40.217594" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node14" label="F" otu="otu25" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="21.094837" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="42.583038" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node15"/>

<node id="node16" label="G" otu="otu26" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="15.048859" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="46.898472" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node17" label="H" otu="otu27" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="16.6239034" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="44.723789" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<edge id="edge5" length="1.67" source="node4" target="node5"/>
<edge id="edge6" source="node4" target="node6"/>
<edge id="edge11" source="node4" target="node11"/>
<edge id="edge7" length="2.56" source="node6" target="node7"/>
<edge id="edge8" source="node6" target="node8"/>
<edge id="edge12" length=".34" source="node11" target="node12"/>
<edge id="edge13" source="node11" target="node13"/>
<edge id="edge9" length=".66" source="node8" target="node9"/>
<edge id="edge10" length=".56" source="node8" target="node10"/>
<edge id="edge14" length="1.67" source="node13" target="node14"/>
<edge id="edge15" source="node13" target="node15"/>
<edge id="edge16" length="4.23" source="node15" target="node16"/>
<edge id="edge17" length="1.2" source="node15" target="node17"/>

Tuesday, June 14, 2011

Mapping of Data

One major obstacle here is mapping the nexml info into the phylogenetic tree. Since the forester libraries were made keeping the PhyloXML format in mind therefore a perfect mapping would not possible until the forester libraries are changed.
Here are the attributes that can be mapped directly.The other will be left unmapped !

-- PhylogenyNode

              -- BranchData (almost everything in branch data will be used)
                              -- BranchColor
                              -- BranchWidth
                              -- Confidence (iff confidence values are provided in nexml)

              --_node_name and _distance_parent have already been used.

              -- NodeData

                              -- Distribution (will be used for attaching the lat/long)
                              -- PropertiesMap (will be used for various properties)

I had used Identifier as node id. I don't think that this is correct and should be changed. Christian suggested me this page to gain an insight into how phyloXML format has been complied with while building the forester library.

I think for now these values would be sufficient to be displayed on the map. :)))

Monday, June 13, 2011

Grabbing NeXML data: Part II

In this blog I will show how metadata can be extracted from an OTU element.

<otu id="otu9" label="Alligator_mississippiensis">
<meta content="835" datatype="xsd:long" id="meta82796" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
<meta content="2038" datatype="xsd:long" id="meta82795" property="tb:identifier.taxonVariant" xsi:type="nex:LiteralMeta"/>
<meta content="Alligator mississippiensis" datatype="xsd:string" id="meta82794" property="skos:prefLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/8496" id="meta82793" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta content="Alligator mississipiensis" datatype="xsd:string" id="meta82792" property="skos:altLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2813218" id="meta82791" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://purl.org/phylo/treebase/phylows/study/TB2:S2108" id="meta82790" rel="rdfs:isDefinedBy" xsi:type="nex:ResourceMeta"/>

The above otu is attached to some node.
You can get the OTU attached to a node as

OTU otu= node.getOTU();

Then you can extract the set of annotations as follows.

Set<Annotation> s1= otu.getAnnotations("skos:closeMatch");
Set<Object> s2=otu.getAnnotationValues("tb:identifier.taxon");

The values are easy to extract after this. Since there are no NeXML files with geographic coordinates attached to nodes, and manually making changes in an NeXML files is a cumbersome process I am thinking of creating a utility which takes an NeXML file as input and a csv file which stores the lat/long and other metadata and the program should be able to attach the latitude/longitude metadata to the NeXML file. I still have to consult this with my mentor !!!

Grabbing NeXML data: Part I

NeXML is fundamentally a type of XML only; with some special schema. Viewing it in this light it should not be difficult to extract any information from it. Here is the explanation for code snippet which is used to grab the metadata attached to the nodes.
One short note before that, There are no trees that have geographic coordinates attached to their nodes in TreeBase. However there are trees with metadata but none with geographic coordinates?

Here is what was suggested by Rutger to attach and access DarwinCore predicates for lat/lon coordinates.

<!-- surrounded by rest of tree description --> <node id="uid1" label="Some taxon" about="#uid1"

<meta id="uid2" property="dwc:DecimalLongitude" content="45.65"
xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:DecimalLatitude" content="37.21"
xsi:type="nex:LiteralMeta" datatype="xsd:double"/> </node>
<!-- surrounded by rest of tree description -->

1) To attach a darwin core coordinate to a node, you would do:
          URI nameSpaceURI = URI.create("http://rs.tdwg.org/dwc/dwcore/");
          Annotation longitude = node.addAnnotationValue("dwc:DecimalLongitude", nameSpaceURI, new Double("45.65"));
          Annotation latitude = node.addAnnotationValue("dwc:DecimalLatitude",nameSpaceURI, new Double("37.21"));

2) Conversely, to read one:
  // a Set is returned because multiple annotations with the same predicate can exist, e.g.
  // for multiple authors on the same study Set<Object> longitudes = node.getAnnotationValues("dwc:      DecimalLongitude");
    Double longitude = (Double) longitudes.iterator().next(); Set<Object> latitudes = node.getAnnotationValues("dwc:DecimalLatitude");
    Double latitude = (Double) latitudes.iterator().next();"

Likewise information can be extracted from an OTU, I will show it in the next blog.

Wednesday, June 8, 2011

Building a reader for NeXML

So here is what has gone through in the last few days, the code for which is in the packages
  1. org.nescent.phylogeoref.nexml
  2. org.nescent.phylogeoref.nexml.utility
First is the class NeXMLReader.
Well the cynosure here is the method, parseNetwork(File networkFile)

Here is a detailed explanation of it.
        Document document = DocumentFactory.parse(networkFile);
        List<TreeBlock> treeList = document.getTreeBlockList();

Basically you parse the File object network file which wraps a NeXML file inside it. And extract the list of TreeBlocks as a list.

The next thing we do is ask the engine to construct a Phylogeny object from a network object.

           phylogenies[index] = engine.constructPhylogenyFromNetwork(network);

The NeXMLEngine class basically provides all the computation work of constructing a Phylogeny object from a Network object.

The PhyloUtility class provides various kinds of commonly used utility methods. All the methods are static and properly documented, so you can have a look at them.

Then last is the PhylogenyFactory class which is nothing but a factory for new Phylogeny objects.

So now it is possible to read a very simple NeXML files and construct the corresponding Phylogeny object. However the metadata attached to the nodes has still not been attached with the nodes. This will be some challenge as I'll have to discover methods of grabbing this information from the document. Currently the NeXML schema is undergoing a lot of simultaneous developments.

The code is up on github. There is a utility main method in the class NeXMLReader. You can run this file and have a look at it in action. There are some sample files in samples from where you can choose them. Again a reminder that I am running this on a windows machine.

Thursday, June 2, 2011

Windows + Git: An Unnecessary Trudge

So this ate up a bit of my time. So I decided to post it so that people in future may be aware of such an issue. The problem is that git cannot rename magic.java to Magic.java on a case-insensitive filesystem, like windows. So here is the solution, 

git mv magic.java magic.xyz
git mv magic.xyz Magic.java

Have changed my plan a little. Since the NeXML thing is taking time in getting into my head, meanwhile I will implement the Roderic Page technique for making the tree edges follow the curvature of the earth.