Tuesday, August 23, 2011

The End of Coding Period

Hi all,
 I could not blog during the last few weeks because I had to move back to my college so I couldn't find extra time to blog.
The recent weeks were fun too like all others. I was able to prepare a sufficiently lucid and simple manual for the project which is kept in the manuals folder. I was also able to generate a simple GUI for running the project which I believe would be a great help for the non-java users.
Using the library is very easy, even a java beginner should be able to generate kmls after reading and following the instructions in the manual.

This summer of code was a most remarkable experience for me, I hope that I continue contributing to the open source community in future too.


Thursday, July 28, 2011

Sign of 180 Resolution

This is for the case when maxDistance (the distance between maximally separated nodes is 180).

Then there arises an ambiguity is choosing the globalMaxAngle as +180 or -180 in the transformed positions. And it does make a difference to the answer.

Consider for example the case of 3 positions. 10, -170 and -110.

The maximally separated nodes are 10 and -170. Suppose 10 is chosen as origin and clockwise direction is positive. Then

Case 1: -170 ----> 180         (where ----> denotes transformation)

mean = (-120 + 180 + 0 )/3 = 20

Case 2: -170 ----> -180

mean = (-120 + -180 + 0)/3 = -100

-100 seems to be a better choice because migration distances are smaller this way and it adheres to my policy of "keep the parent node at a position within the bounds of the end nodes"

To achieve this effect, I can count which hemisphere has more number of nodes. If the negative hemisphere has more nodes then its -180 and vice versa.

How to find whether a point is in the +ive or -ive hemisphere, that is given in the last post.

Wednesday, July 27, 2011

Transformation Rules

Once I find the two positions which are maximally separated namely globalAngleZero and globalAngleMax, I need to transform the origin to globalAngleZero with positive sense towards the shortest path direction towards globalAngleMax.

maxDistance is the distance between these two positions.

Case 1: maxDistance = 180

For all position in posVector

         let L = minimumAngularDistance(position, globalAngleZero)
         if (position >= globalAngleZero OR position <=globalAngleMax)
                do nothing              

         if(position < globalAngleZero OR position > globalAngleMax)
               L = -L (invert the sign)

         sigma += L

meanPos = sigma/n (where n is the length of posVector)

The actual mean position is obtained by adding meanPos to globalAngleZero in a positive sense.

Case 2: maxDistance < 180

For all position in posVector

        let L1 = minimumAngularDistance(position, globalAngleZero)
        let L2 = minimumAngularDistance(position, globalAngleMax)
        if (L1 + L2 = maxDistance) (if this point lies inside the region bounded by end points)
              sigma = sigma + L1

             sigma = sigma - L1

meanPos = sigma/n (where n is the length of posVector)

The actual mean position can be obtained by adding meanPos to globalAngleZero in clockwise and anti-clockwise sense and that position is choosen for which L1+L2= maxDistance condition is satisfied.

Tuesday, July 26, 2011

This is hard !!!

So I encountered some problems in my centroid calculation algorithm, which I am trying to correct but I really wanted to see why am I implementing the algorithm that I have mentioned in the last to last post rather than the naive algorithm.

This gives a time complexity of O(n log n) over naive O(n^2) which would have certainly been easier. But I think this would save appreciable computational cycles.

Assuming that we are working with 4000nodes. And considering the average number of children a node has to be 8. Then here are the comparisons.

Naive Algorithm.
64 X 4000 = 256,000

New algorithm.
24 X 4000 =  96,000

Here are the steps in the new in-centroid calculation.

  1. Get the maximally separated nodes.
    (This is the part where I really used some clever techniques.)
  2. Choose one of them as zero and the other's direction as the positive x-axis.
  3. Transform the coordinates.
  4. Calculate the normal mean on the transformed coordinates.
  5. Transform back.

This  is difficult to code, requires extreme patience and there are so many cases ; I need to finish it fast.

Saturday, July 23, 2011

Failure of a simple weighted mean algorithm

I will demonstrate this with an example. The thing that I was trying to do was to make the distances exactly proportional to the edge lengths. This is not possible always.

Take a very simple example in 2D, i.e a plane.

Suppose you have 4 points A,B,C and D on the plane and you want to find out a point such that the distances of these points from that central point is in the same proportions.
It is very easy to see that this is not always possible. Well to start you can choose any 3 points and draw a circle through them but now you got yourself in a problem. It is not necessary that D is also concyclic to the same circle.
Further things in spherical geometry make things even worse.

I need some other idea to use/represent this information. Had it been an animation then things would have been different.

Wednesday, July 20, 2011

Things Done and Things Undone !

Implemented denotes what is already implemented in phyloGeoRef.
TODO: demotes that things that are must to be done.
Planned: Denotes a change in implementation or adding additional features.
  1. TREE FORMATs (Files containing the tree data)

    Implemented: newick (.nwk) , nexml (.xml) , phyloxml (.xml) , nexus (.nex), (.tolxml)

  2. METADATA FORMATs (Files which contain the spatial data)

    Implemented: csv, txt


    Implemented: Simple Mean
    Planned: Weighted Mean for trees with edge lengths
    Bug: Incorrect positioning algorithm. Modify it and make it efficient.

    Posted: https://github.com/dapurv5/phyloGeoRef/wiki/Tutorial:-Using-the-GrandUnifiedReader,-adding-new-properties.
    Planned: Tutorial on writing the main class. (1/2 day)

    Implemented: Error checking for implausible lat/long
    Implemented: Error due to absence of information given in the tree but not in the metadata file.
    Planned: Checking corner cases like when all the child nodes of a parent node have missing locations.
    Planned: Checking clade consistency.

    Implemented: Display of tip nodes on the map and the skeletal structure of the tree.
    Implemented: Levelwise slicing of the tree in folders.
    Planned: Hierarchical slicing of the tree into folders.
    TODO: HTML balloons associated with each node.
    Planned: Associating regions with HTUs also.
    Planned: Animation.
    Planned: Compressing the kml file as kmz

    Planned: Taking images and other rich data from external web services.
    Planned: Preparing a local cache for this data since downloading data for thousands of species on each run of the program would be a time wasting process.
    Planned: Writing a command line option parser for the main class.

Pseudo Code for the new Reconstruction Algorithm

When we traverse the tree post order to assign coordinates to each of the internal nodes. Do this for each of the internal nodes.

Create 4 buckets 1 for each quadrant on a circle.
Also create four flags 1 for each quadrant. A true value of a flag for a quadrant indicates the presence of atleast one node in that quadrant.

Step 1:

for each child node, child of parent node.
   Putting the node in the appropriate bucket depending on its quadrant.
   Mark the flag true for that quadrant.

Step 2:

Sort each bucket so that the minimum values are towards the top or the beginning in the bucket. Use the sorting algorithm discussed in the previous post. If you have to sort negative numbers sort mod of them and then reverse the order.
It is to be noted that since you are sorting only within a bucket all elements within the same bucket have the same sign. This takes O(d) time where d is the maximum degree of any node in the phylogeny.

Step 3:

def distance (bucket a, bucket b):
   the maximum distance between any two nodes in these buckets.

maxDistance = distance(bucket 1, bucket 3)
maxDistance = max { maxDistance, distance(bucket 2, bucket 4) }

maxDistance = max { maxDistance, distance(bucket 1, bucket 2)}
maxDistance = max { maxDistance, distance(bucket 2, bucket 3)}
maxDistance = max { maxDistance, distance(bucket 3, bucket 4)}
maxDistance = max { maxDistance, distance(bucket 4, bucket 1)}

I was trying to find out a general formula interms of variables that saves all this but my attempts in that direction didn't yield any feasible results. So now I believe that no one universal formula exists which can capture the centroid location for all the nodes on the globe so therefore these elaborations.

Once we find the 2 nodes between which we have the maximum distance we can transform the coordinates so as to make the coordinate of the one of these nodes zero. 

Calculating the Centroids: Correction

This is due to David and what he does is really very clever. There seemed to be an anomalous behavior in just taking the mean (+ some additional details) for the coordinate of the parent node.

Let me first describe what the problem is and how Dave's solution comes to the rescue.

So suppose you have 3 points A, B, C on the equator with longitudes -80, -170 & 160 respectively. If you take their simple mean that comes out to be -30.
Take the point diametrically opposite this point on the globe, that comes out to be 150. Now 150 and -30 are two potential candidates for being chosen as the mean. We choose 150 because the sum of its distances to A, B and C along the equator is minimum.
Notice that neither of 150 and -30 are between the child nodes on the map. This is weird and in fact this is wrong.

So here is what Dave does. Take the three points and find the points which have the maximum separation between them. This maximum separation is always less than 180 for otherwise we could have reached from the other side of the globe.

Finding this pair of maximally separated points is computationally expensive via the naive method of pairwise comparison. O(n^2)

In our example the pair of maximally separated points are -80 and 160, the distance between them being 130.
Now transform the axis such that -80 is the new origin. The transformed coordinates become 0, 90 and 120. These are obtained by adding to 0 the distances to the respective points A, B and C

Once this is done calculate the mean. The mean is 70. Transforming back you get -150.
This is between the child nodes.

Take another example, rather a difficult one. (-20, -180, 30). Now let's apply this new method on it.
Take the maximally separated nodes. -20 and -180. Now transform -20 -> 0, -180 -> 160 and 30 -> -50.
Take the mean.
110/3 = 36.66
Transforming back we get, -56.66.

Indeed the required angle.

In planar geometry this point would be the centroid of the polygon formed by the child nodes. I believe though it is only my intuition that this is also the centroid of the spherical polygon formed on the surface of the earth by the tip nodes.

Monday, July 18, 2011

The Midterm: On a philosophical note

I have reached a point in my project which is officially called as midterm evaluation of the project. The mentors seem to be as much as satisfied with my work as I have been.

Here is what has gone by in the project. The first and foremost step towards this construction was to capture the data in a kind of abstract entity. And this abstract entity was a Phylogeny.
The Phylogeny has been modeled as a Java class. Perhaps one of the boons  of the object oriented design principle. Now the data which was to be used in the construction of such a phylogeny comes from various sources in a diverse formats. So it became customary to design a UniversalReaders for both the spatial and tree data files and then further unify both into a GrandUnifiedReader which would read everything and construct a Phylogeny out of it.
It could happen and it does happen that not all the data finds a spot in the design of the Phylogeny class in the forester library so a kind of containers or External moulds called "PhlyogenyMould" were associated with each entity capable of storing information.

So far so good. The next task ventured into the bio-geographical reconstruction of the tree or the processing and validation of the tree. After all we only know the position of the species that exist today. The interior hypothetical taxa need somehow be assigned their  coordinates.
The method employed here was a simple one namely; taking the mean, (would be later changed to weighted mean for trees with time ratios given) with due consideration of taking the mean position that offers a minimum distance of migration from the child nodes.
This method however does present some weird patterns and can be further improved at the expense of computational cycles needed to complete the reconstruction. Indeed a scope for further improvement.

All the data has been abstractly embedded into a Phylogeny. Now in the visualization portion comes the part of something visually pleasing to the eyes, the drawing of the tree as a kml.
There are many kml features that I employed that lead to better visualization. Separating clades via colors, adding level of detail feature via regions, tessellating the lines even at an altitude above the earth. etc.

All this has begun to give a tangible form to the original idea from where the project began.
As far as my personal opinion is concerned the major portion of the project is over and I only need to polish things up. Patch a few undone things here and there. And then go on to incrementally improve it as much as time permits.

Perhaps of the most elusive incorporation's that can be done in this project are the animations but I don't know exactly how and where can they be placed to create the magic effect.

Altogether it has been a great experience till now.

Friday, July 1, 2011

Recent Log

So here is what is going through recently into my project. I had been able to read a phylogeny. Now I have also completed PhylogenyProcessor.  So now I have at hand a drawable phylogeny.

The parent nodes have been assigned the mean coordinates of their children. I am thinking to modify it to a weighted average so that the length of the edges correspond to the time span.
If clades are specified then each clade is assigned a new random color. Any parent node's color is the arithmetic mean of the colors of its children.

The code is up on github so you can have a look at it.

My current efforts are directed in coding the KmlWriter classes that will draw the phylogeny onto the kml. Presenting 4000 nodes all at once leads to clutter and chaos on the map. I am thinking of ways to prevent such a catastrophe. Ideas are welcome !
Some ideas I had are as follows.

  1. Build each level of the tree into a separate folder, this way user can select what stuff he wants to see and what he/she does not want to.
  2. The second was to display it cladewise. But this seems to be a futile idea because the cocept of clade begins to fade out as we move up the tree. So the HTU's belong to which clades when their children belong to different clades is indeed a vague question. So I have dropped this idea.
  3. The third is to put the placemarks in the map at various different levels in the map.

Tuesday, June 28, 2011


The nescent.phylogeoref.reader package is now almost over. Here is an overview of what's into it.

Basically there are two kind of readers in here. "TreeReader" and "MetadataReader" for reading trees and metadata respectively.

Here is the class hierarchy


  1. MultiFormatReader
  2. UniversalTreeReader
  3. NeXMLReader

  1. CSVMetadataReader
  2. TextMetadataReader
  3. UniversalMetadataReader

I have given a tutorial on using the GrandUnifiedReader on github wiki.
Understanding the PhylogenyKitchen class may be a bit tricky because I myself would have to go through it again to explain you, however using it very easy. Further I have included sufficient comments to make it clear.


Tuesday, June 21, 2011

A Schematic Overview

I have prepared a schematic overview of the work in a simple easy to understand format.

Here is a link to it.

Schematic Overview

This would make most of the part clear. If you have further questions contact me !!!

Sunday, June 19, 2011

A Change in Plan

So there is a change in plan here.

What I had thought earlier :-
Earlier I thought that a nwk and a csv file have to be completely replaced by a single NeXML file. While this may be the ultimate goal but at present this is not something which is possible. The reasons for it are as follows

1) There are no examples with location metadata attached to nodes in NeXML in TreeBase.
2) There is no preferred choice which is likely to be the manner in which coordinate metadata would be attached to the nodes in future.

So here is what the current plan is.

  1. INPUT 1: A nexml file with whatever metadata it has to offer.
  2. INPUT 2: A second file with additional metadata that you might want to attach to the nodes.
  3. Having taken both these files, construct a crude Phylogeny tree from the nexml file. By crude I mean to say that this phylogeny is incomplete. This phylogeny has the basic tree structure but it may not have all the essential metadata.
  4. Now extract the additional metadata from the second file and embed it in the phylogeny.
  5. You now get a full fledged Phylogeny with all the metadata.
In the earlier plan I was trying to stuff everything into the nexml first and then use this nexml to prepare the phylogeny tree. This had the overhead of first creating a modified nexml file with all the metadata and then preparing the phylogeny from it.
I will be writing the code in a manner that the user will have the option of choosing whether this second input file is to be taken as an input or not.

While preparing a fully fledged Phylogenetic tree from NeXML would be a utopia. What we do at present is that prepare a partially prepared Phylogenetic tree which I call as crude phylogenetic tree and then patch it with metadata from the other file.

Preparing a fully functional NeXML is now at the bottom of my priority list.

Sample NeXML file

This is a sample NeXML I have prepared for testing. It demonstrates how coordinates have to be attached to the node. The color has still not been attached. I have still not decided upon how to attach color. I should be able to get a phylogenetic tree out of this nexml.

<nex:nexml about="#nex_nexml1" generator="Bio::Phylo::Project v.0.36_1660" version="0.9" xsi:schemaLocation="http://www.nexml.org/2009 http://www.nexml.org/2009/nexml.xsd">
<meta content="2011-05-22T09:19:18" datatype="xsd:date" id="meta18" property="dc:date" xsi:type="nex:LiteralMeta"/>
<otus id="otus19">

<otu id="otu21" label="A">
<meta content="835" datatype="xsd:long" id="meta1710" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
<meta content="2038" datatype="xsd:long" id="meta1709" property="tb:identifier.taxonVariant" xsi:type="nex:LiteralMeta"/>
<meta content="Alligator mississippiensis" datatype="xsd:string" id="meta1708" property="skos:prefLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/8496" id="meta1707" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta content="Alligator mississipiensis" datatype="xsd:string" id="meta1706" property="skos:altLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2813218" id="meta1705" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://purl.org/phylo/treebase/phylows/study/TB2:S2108" id="meta1704" rel="rdfs:isDefinedBy" xsi:type="nex:ResourceMeta"/>

<otu id="otu20" label="B">
<meta content="11782" datatype="xsd:long" id="meta1716" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
<meta content="28069" datatype="xsd:long" id="meta1715" property="tb:identifier.taxonVariant" xsi:type="nex:LiteralMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/94835" id="meta1714" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2539857" id="meta1713" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://purl.org/phylo/treebase/phylows/study/TB2:S2108" id="meta1712" rel="rdfs:isDefinedBy" xsi:type="nex:ResourceMeta"/>

<otu id="otu23" label="C"/>
<otu id="otu22" label="D"/>
<otu id="otu24" label="E"/>
<otu id="otu25" label="F"/>
<otu id="otu26" label="G"/>
<otu id="otu27" label="H"/>
<trees id="trees2" otus="otus19">
<tree id="tree3" xsi:type="nex:FloatTree">
<node id="node4" root="true"/>

<node id="node5" label="B" otu="otu20" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="-6.129627" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="42.865584" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node6"/>
<node id="node11"/>

<node id="node7" label="A" otu="otu21" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="-0.011702" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="43.177874" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node8"/>

<node id="node12" label="D" otu="otu22" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="14.106559" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="41.798603" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node13"/>

<node id="node9" label="C" otu="otu23" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="9.869384" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="45.786672" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node10" label="E" otu="otu24" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="20.602788" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="40.217594" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node14" label="F" otu="otu25" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="21.094837" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="42.583038" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node15"/>

<node id="node16" label="G" otu="otu26" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="15.048859" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="46.898472" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<node id="node17" label="H" otu="otu27" xmlns:dwc="http://rs.tdwg.org/dwc/terms/">
<meta id="uid2" property="dwc:decimalLongitude" content="16.6239034" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:decimalLatitude" content="44.723789" xsi:type="nex:LiteralMeta" datatype="xsd:double"/>

<edge id="edge5" length="1.67" source="node4" target="node5"/>
<edge id="edge6" source="node4" target="node6"/>
<edge id="edge11" source="node4" target="node11"/>
<edge id="edge7" length="2.56" source="node6" target="node7"/>
<edge id="edge8" source="node6" target="node8"/>
<edge id="edge12" length=".34" source="node11" target="node12"/>
<edge id="edge13" source="node11" target="node13"/>
<edge id="edge9" length=".66" source="node8" target="node9"/>
<edge id="edge10" length=".56" source="node8" target="node10"/>
<edge id="edge14" length="1.67" source="node13" target="node14"/>
<edge id="edge15" source="node13" target="node15"/>
<edge id="edge16" length="4.23" source="node15" target="node16"/>
<edge id="edge17" length="1.2" source="node15" target="node17"/>

Tuesday, June 14, 2011

Mapping of Data

One major obstacle here is mapping the nexml info into the phylogenetic tree. Since the forester libraries were made keeping the PhyloXML format in mind therefore a perfect mapping would not possible until the forester libraries are changed.
Here are the attributes that can be mapped directly.The other will be left unmapped !

-- PhylogenyNode

              -- BranchData (almost everything in branch data will be used)
                              -- BranchColor
                              -- BranchWidth
                              -- Confidence (iff confidence values are provided in nexml)

              --_node_name and _distance_parent have already been used.

              -- NodeData

                              -- Distribution (will be used for attaching the lat/long)
                              -- PropertiesMap (will be used for various properties)

I had used Identifier as node id. I don't think that this is correct and should be changed. Christian suggested me this page to gain an insight into how phyloXML format has been complied with while building the forester library.

I think for now these values would be sufficient to be displayed on the map. :)))

Monday, June 13, 2011

Grabbing NeXML data: Part II

In this blog I will show how metadata can be extracted from an OTU element.

<otu id="otu9" label="Alligator_mississippiensis">
<meta content="835" datatype="xsd:long" id="meta82796" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
<meta content="2038" datatype="xsd:long" id="meta82795" property="tb:identifier.taxonVariant" xsi:type="nex:LiteralMeta"/>
<meta content="Alligator mississippiensis" datatype="xsd:string" id="meta82794" property="skos:prefLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/8496" id="meta82793" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta content="Alligator mississipiensis" datatype="xsd:string" id="meta82792" property="skos:altLabel" xsi:type="nex:LiteralMeta"/>
<meta href="http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2813218" id="meta82791" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://purl.org/phylo/treebase/phylows/study/TB2:S2108" id="meta82790" rel="rdfs:isDefinedBy" xsi:type="nex:ResourceMeta"/>

The above otu is attached to some node.
You can get the OTU attached to a node as

OTU otu= node.getOTU();

Then you can extract the set of annotations as follows.

Set<Annotation> s1= otu.getAnnotations("skos:closeMatch");
Set<Object> s2=otu.getAnnotationValues("tb:identifier.taxon");

The values are easy to extract after this. Since there are no NeXML files with geographic coordinates attached to nodes, and manually making changes in an NeXML files is a cumbersome process I am thinking of creating a utility which takes an NeXML file as input and a csv file which stores the lat/long and other metadata and the program should be able to attach the latitude/longitude metadata to the NeXML file. I still have to consult this with my mentor !!!

Grabbing NeXML data: Part I

NeXML is fundamentally a type of XML only; with some special schema. Viewing it in this light it should not be difficult to extract any information from it. Here is the explanation for code snippet which is used to grab the metadata attached to the nodes.
One short note before that, There are no trees that have geographic coordinates attached to their nodes in TreeBase. However there are trees with metadata but none with geographic coordinates?

Here is what was suggested by Rutger to attach and access DarwinCore predicates for lat/lon coordinates.

<!-- surrounded by rest of tree description --> <node id="uid1" label="Some taxon" about="#uid1"

<meta id="uid2" property="dwc:DecimalLongitude" content="45.65"
xsi:type="nex:LiteralMeta" datatype="xsd:double"/>
<meta id="uid2" property="dwc:DecimalLatitude" content="37.21"
xsi:type="nex:LiteralMeta" datatype="xsd:double"/> </node>
<!-- surrounded by rest of tree description -->

1) To attach a darwin core coordinate to a node, you would do:
          URI nameSpaceURI = URI.create("http://rs.tdwg.org/dwc/dwcore/");
          Annotation longitude = node.addAnnotationValue("dwc:DecimalLongitude", nameSpaceURI, new Double("45.65"));
          Annotation latitude = node.addAnnotationValue("dwc:DecimalLatitude",nameSpaceURI, new Double("37.21"));

2) Conversely, to read one:
  // a Set is returned because multiple annotations with the same predicate can exist, e.g.
  // for multiple authors on the same study Set<Object> longitudes = node.getAnnotationValues("dwc:      DecimalLongitude");
    Double longitude = (Double) longitudes.iterator().next(); Set<Object> latitudes = node.getAnnotationValues("dwc:DecimalLatitude");
    Double latitude = (Double) latitudes.iterator().next();"

Likewise information can be extracted from an OTU, I will show it in the next blog.

Wednesday, June 8, 2011

Building a reader for NeXML

So here is what has gone through in the last few days, the code for which is in the packages
  1. org.nescent.phylogeoref.nexml
  2. org.nescent.phylogeoref.nexml.utility
First is the class NeXMLReader.
Well the cynosure here is the method, parseNetwork(File networkFile)

Here is a detailed explanation of it.
        Document document = DocumentFactory.parse(networkFile);
        List<TreeBlock> treeList = document.getTreeBlockList();

Basically you parse the File object network file which wraps a NeXML file inside it. And extract the list of TreeBlocks as a list.

The next thing we do is ask the engine to construct a Phylogeny object from a network object.

           phylogenies[index] = engine.constructPhylogenyFromNetwork(network);

The NeXMLEngine class basically provides all the computation work of constructing a Phylogeny object from a Network object.

The PhyloUtility class provides various kinds of commonly used utility methods. All the methods are static and properly documented, so you can have a look at them.

Then last is the PhylogenyFactory class which is nothing but a factory for new Phylogeny objects.

So now it is possible to read a very simple NeXML files and construct the corresponding Phylogeny object. However the metadata attached to the nodes has still not been attached with the nodes. This will be some challenge as I'll have to discover methods of grabbing this information from the document. Currently the NeXML schema is undergoing a lot of simultaneous developments.

The code is up on github. There is a utility main method in the class NeXMLReader. You can run this file and have a look at it in action. There are some sample files in samples from where you can choose them. Again a reminder that I am running this on a windows machine.

Thursday, June 2, 2011

Windows + Git: An Unnecessary Trudge

So this ate up a bit of my time. So I decided to post it so that people in future may be aware of such an issue. The problem is that git cannot rename magic.java to Magic.java on a case-insensitive filesystem, like windows. So here is the solution, 

git mv magic.java magic.xyz
git mv magic.xyz Magic.java

Have changed my plan a little. Since the NeXML thing is taking time in getting into my head, meanwhile I will implement the Roderic Page technique for making the tree edges follow the curvature of the earth.

Saturday, May 28, 2011

Support for NeXML

Let me tell you what the problem is?
I can always add extra columns in the csv file and add this information to the kml. So far so good, but if now support for NeXML has to be provided, I should know the exact format of the NeXML data. It is important to note that both the csv file and the nwk file have to replaced by a single NeXML file. From the NeXML Manual page I got to know how a basic nwk file will be represented in NeXML format. I used the online converter from newick->nexml which can be found here. nexml.org.

Meanwhile, I have received my welcome package from google but there's some mistake in my name on the card. So bit of extra work here. :P

Sunday, May 22, 2011


What are cladograms?
Wikipedia is always the best friend for venturing in a totally unknown field. Cladogram. A cladogram is a diagram used in cladistics which shows ancestral relations between organisms. You can get them if you look for them carefully on wikipedia. I have got some I don't remember where I collected them from.

Here is a page on Phylogenetic trees which I had already looked at while preparing my proposal.

There is this another thing called the phyloXML format. There is a lot of work done on it. There is a sample phyloXML file taken from the same place.

What's a clade? A clade is grouping of an organism and all its descendants.

GeoPhylo Engine is actually something very similar to what I am building. It is different in the sense that it takes a PhyloXML file and not an NeXML file to generate the kml tree.

Saturday, May 21, 2011


I went through the following tutorials in the last few days. The project is still forked from the last year's repository. I will call this copy as v1.0 aka the starting point of my project.
Here is some good information of .nwk files.

Please don't overlook the See Also and References on the bottom of the page. They also have pretty good material. There is also an example of a large phylogram with its Newick format representation.

As far as the kml is concerned. Here is what I went through. The kml is actually quite easy to learn. Now I have to focus on how can JAK be used to create kml's with the various kind of rich features that kml supports

Sunday, May 15, 2011

First Run & Cleanup

First Run:

I have been successful in getting the project to run. The program accepts command line arguments so you can either enter the filenames as a command line parameter or temporarily hard code the name of the files in the testMain.java
On a windows platform you need to specify the paths as follows.

String intreeFile = "src\\testTree.nwk";
String coordFile = "src\\testCoords.csv";
String metadata = "n";

If you wish to use the metadata file, you will to specify "y" in metadata and use the file testCoordsMeta.csv
When you run the program after this, a kml file testfile.kml is generated.

Cleanup of the folder:

So finally I cleaned up the project folder. The following are the exact changes that I did. I will call this version 1.0 aka the starting point for my project.

  1. cd phyloGeoRef
  2. git rm -r .metadata (This will remove the metadata folder from version control)
  3. Now you can safely delete this folder.
  4. Modify the README, with the new instructions.
  5. git add . (This will stage all changes)
  6. Initially I was trying to add the extra packages as separater jars. But now I think that it is a good idea to include the relevant packages in the library itself so that the end user has minimum overhead of downloading extra jars. The jars required have been mentioned along with the sources in the README.
  7. cd phyloGeoRef (This is actually the netbeans project folder)
  8. git rm -r .settings/ (Deletes the .settings folder)
  9. git rm -r bin/        (Deleted the bin folder)
  10. cd src/
  11. git rm -r jak/        (This folder is not needed because we'll be including the JavaAPIforKml.jar)
  12. git rm -r javapaiforkml
  13. git rm -r NeXML/         (Will add the relevant packages instead of the whole folder)
  14. git rm -r opencsv-2.2/   ( "" )
  15. Now copy the following folders in src
  16. Copy the nexml folder to src/org (This package will provide support for nexml parsing)
  17. And copy folder au to src/ (This package will provide support for opencsv parsing)
  18. git add . (Stage the changes)
  19. git commit -m "first commit of the cleaned up folder"  (Do the first commit.)
  20. git tag -a v1.0 -m 'This is the version 1.0"
  21. Then push the changes to the github repo.

Now the folder looks a bit more clean. And it's still forked from the last year's project. So a bit of relief for me now !!!

Saturday, May 7, 2011

Hello World phyloGeoRef

Hi all,
I am an undergraduate Computer Science student at IIT Ropar - India.
So this was the beginning of my GSoC project "phyloGeoRef". Before coding formally I had to get acquainted with github. My mentor had sent me a couple of good links to get going with github easily. Here are they.

Forked the original repository, cloned it to my local machine and had a working copy of the project ready. There were some dependency problems which still held me back from running the project.