Thursday, July 28, 2011

Sign of 180 Resolution

This is for the case when maxDistance (the distance between maximally separated nodes is 180).

Then there arises an ambiguity is choosing the globalMaxAngle as +180 or -180 in the transformed positions. And it does make a difference to the answer.

Consider for example the case of 3 positions. 10, -170 and -110.

The maximally separated nodes are 10 and -170. Suppose 10 is chosen as origin and clockwise direction is positive. Then

Case 1: -170 ----> 180         (where ----> denotes transformation)

mean = (-120 + 180 + 0 )/3 = 20


Case 2: -170 ----> -180

mean = (-120 + -180 + 0)/3 = -100

-100 seems to be a better choice because migration distances are smaller this way and it adheres to my policy of "keep the parent node at a position within the bounds of the end nodes"

To achieve this effect, I can count which hemisphere has more number of nodes. If the negative hemisphere has more nodes then its -180 and vice versa.

How to find whether a point is in the +ive or -ive hemisphere, that is given in the last post.

Wednesday, July 27, 2011

Transformation Rules

Once I find the two positions which are maximally separated namely globalAngleZero and globalAngleMax, I need to transform the origin to globalAngleZero with positive sense towards the shortest path direction towards globalAngleMax.

maxDistance is the distance between these two positions.

Case 1: maxDistance = 180

For all position in posVector

         let L = minimumAngularDistance(position, globalAngleZero)
        
         if (position >= globalAngleZero OR position <=globalAngleMax)
                do nothing              

         if(position < globalAngleZero OR position > globalAngleMax)
               L = -L (invert the sign)

         sigma += L

meanPos = sigma/n (where n is the length of posVector)

The actual mean position is obtained by adding meanPos to globalAngleZero in a positive sense.


Case 2: maxDistance < 180

For all position in posVector

        let L1 = minimumAngularDistance(position, globalAngleZero)
        let L2 = minimumAngularDistance(position, globalAngleMax)
         
        if (L1 + L2 = maxDistance) (if this point lies inside the region bounded by end points)
              sigma = sigma + L1

        else
             sigma = sigma - L1

meanPos = sigma/n (where n is the length of posVector)

The actual mean position can be obtained by adding meanPos to globalAngleZero in clockwise and anti-clockwise sense and that position is choosen for which L1+L2= maxDistance condition is satisfied.

Tuesday, July 26, 2011

This is hard !!!

So I encountered some problems in my centroid calculation algorithm, which I am trying to correct but I really wanted to see why am I implementing the algorithm that I have mentioned in the last to last post rather than the naive algorithm.

This gives a time complexity of O(n log n) over naive O(n^2) which would have certainly been easier. But I think this would save appreciable computational cycles.

Assuming that we are working with 4000nodes. And considering the average number of children a node has to be 8. Then here are the comparisons.

Naive Algorithm.
64 X 4000 = 256,000


New algorithm.
24 X 4000 =  96,000

Here are the steps in the new in-centroid calculation.

  1. Get the maximally separated nodes.
    (This is the part where I really used some clever techniques.)
  2. Choose one of them as zero and the other's direction as the positive x-axis.
  3. Transform the coordinates.
  4. Calculate the normal mean on the transformed coordinates.
  5. Transform back.

This  is difficult to code, requires extreme patience and there are so many cases ; I need to finish it fast.

Saturday, July 23, 2011

Failure of a simple weighted mean algorithm

I will demonstrate this with an example. The thing that I was trying to do was to make the distances exactly proportional to the edge lengths. This is not possible always.

Take a very simple example in 2D, i.e a plane.

Suppose you have 4 points A,B,C and D on the plane and you want to find out a point such that the distances of these points from that central point is in the same proportions.
It is very easy to see that this is not always possible. Well to start you can choose any 3 points and draw a circle through them but now you got yourself in a problem. It is not necessary that D is also concyclic to the same circle.
Further things in spherical geometry make things even worse.

I need some other idea to use/represent this information. Had it been an animation then things would have been different.

Wednesday, July 20, 2011

Things Done and Things Undone !

Implemented denotes what is already implemented in phyloGeoRef.
TODO: demotes that things that are must to be done.
Planned: Denotes a change in implementation or adding additional features.
  1. TREE FORMATs (Files containing the tree data)

    Implemented: newick (.nwk) , nexml (.xml) , phyloxml (.xml) , nexus (.nex), (.tolxml)


  2. METADATA FORMATs (Files which contain the spatial data)

    Implemented: csv, txt

  3. NODE POSITIONING

    Implemented: Simple Mean
    Planned: Weighted Mean for trees with edge lengths
    Bug: Incorrect positioning algorithm. Modify it and make it efficient.

  4. TUTORIAL
    Posted: https://github.com/dapurv5/phyloGeoRef/wiki/Tutorial:-Using-the-GrandUnifiedReader,-adding-new-properties.
    Planned: Tutorial on writing the main class. (1/2 day)

  5. ERROR CHECKING:
    Implemented: Error checking for implausible lat/long
    Implemented: Error due to absence of information given in the tree but not in the metadata file.
    Planned: Checking corner cases like when all the child nodes of a parent node have missing locations.
    Planned: Checking clade consistency.

  6. WRITING KML
    Implemented: Display of tip nodes on the map and the skeletal structure of the tree.
    Implemented: Levelwise slicing of the tree in folders.
    Planned: Hierarchical slicing of the tree into folders.
    TODO: HTML balloons associated with each node.
    Planned: Associating regions with HTUs also.
    Planned: Animation.
    Planned: Compressing the kml file as kmz

  7. EXTRA DATA
    Planned: Taking images and other rich data from external web services.
    Planned: Preparing a local cache for this data since downloading data for thousands of species on each run of the program would be a time wasting process.
    Planned: Writing a command line option parser for the main class.

Pseudo Code for the new Reconstruction Algorithm

When we traverse the tree post order to assign coordinates to each of the internal nodes. Do this for each of the internal nodes.

Create 4 buckets 1 for each quadrant on a circle.
Also create four flags 1 for each quadrant. A true value of a flag for a quadrant indicates the presence of atleast one node in that quadrant.

Step 1:

for each child node, child of parent node.
   Putting the node in the appropriate bucket depending on its quadrant.
   Mark the flag true for that quadrant.


Step 2:

Sort each bucket so that the minimum values are towards the top or the beginning in the bucket. Use the sorting algorithm discussed in the previous post. If you have to sort negative numbers sort mod of them and then reverse the order.
It is to be noted that since you are sorting only within a bucket all elements within the same bucket have the same sign. This takes O(d) time where d is the maximum degree of any node in the phylogeny.


Step 3:

def distance (bucket a, bucket b):
   the maximum distance between any two nodes in these buckets.

maxDistance = distance(bucket 1, bucket 3)
maxDistance = max { maxDistance, distance(bucket 2, bucket 4) }



maxDistance = max { maxDistance, distance(bucket 1, bucket 2)}
maxDistance = max { maxDistance, distance(bucket 2, bucket 3)}
maxDistance = max { maxDistance, distance(bucket 3, bucket 4)}
maxDistance = max { maxDistance, distance(bucket 4, bucket 1)}

I was trying to find out a general formula interms of variables that saves all this but my attempts in that direction didn't yield any feasible results. So now I believe that no one universal formula exists which can capture the centroid location for all the nodes on the globe so therefore these elaborations.

Once we find the 2 nodes between which we have the maximum distance we can transform the coordinates so as to make the coordinate of the one of these nodes zero. 

Calculating the Centroids: Correction

This is due to David and what he does is really very clever. There seemed to be an anomalous behavior in just taking the mean (+ some additional details) for the coordinate of the parent node.

Let me first describe what the problem is and how Dave's solution comes to the rescue.

So suppose you have 3 points A, B, C on the equator with longitudes -80, -170 & 160 respectively. If you take their simple mean that comes out to be -30.
Take the point diametrically opposite this point on the globe, that comes out to be 150. Now 150 and -30 are two potential candidates for being chosen as the mean. We choose 150 because the sum of its distances to A, B and C along the equator is minimum.
Notice that neither of 150 and -30 are between the child nodes on the map. This is weird and in fact this is wrong.

So here is what Dave does. Take the three points and find the points which have the maximum separation between them. This maximum separation is always less than 180 for otherwise we could have reached from the other side of the globe.

Finding this pair of maximally separated points is computationally expensive via the naive method of pairwise comparison. O(n^2)

In our example the pair of maximally separated points are -80 and 160, the distance between them being 130.
Now transform the axis such that -80 is the new origin. The transformed coordinates become 0, 90 and 120. These are obtained by adding to 0 the distances to the respective points A, B and C

Once this is done calculate the mean. The mean is 70. Transforming back you get -150.
This is between the child nodes.

Take another example, rather a difficult one. (-20, -180, 30). Now let's apply this new method on it.
Take the maximally separated nodes. -20 and -180. Now transform -20 -> 0, -180 -> 160 and 30 -> -50.
Take the mean.
110/3 = 36.66
Transforming back we get, -56.66.

Indeed the required angle.


In planar geometry this point would be the centroid of the polygon formed by the child nodes. I believe though it is only my intuition that this is also the centroid of the spherical polygon formed on the surface of the earth by the tip nodes.

Monday, July 18, 2011

The Midterm: On a philosophical note

I have reached a point in my project which is officially called as midterm evaluation of the project. The mentors seem to be as much as satisfied with my work as I have been.


Here is what has gone by in the project. The first and foremost step towards this construction was to capture the data in a kind of abstract entity. And this abstract entity was a Phylogeny.
The Phylogeny has been modeled as a Java class. Perhaps one of the boons  of the object oriented design principle. Now the data which was to be used in the construction of such a phylogeny comes from various sources in a diverse formats. So it became customary to design a UniversalReaders for both the spatial and tree data files and then further unify both into a GrandUnifiedReader which would read everything and construct a Phylogeny out of it.
It could happen and it does happen that not all the data finds a spot in the design of the Phylogeny class in the forester library so a kind of containers or External moulds called "PhlyogenyMould" were associated with each entity capable of storing information.

So far so good. The next task ventured into the bio-geographical reconstruction of the tree or the processing and validation of the tree. After all we only know the position of the species that exist today. The interior hypothetical taxa need somehow be assigned their  coordinates.
The method employed here was a simple one namely; taking the mean, (would be later changed to weighted mean for trees with time ratios given) with due consideration of taking the mean position that offers a minimum distance of migration from the child nodes.
This method however does present some weird patterns and can be further improved at the expense of computational cycles needed to complete the reconstruction. Indeed a scope for further improvement.

All the data has been abstractly embedded into a Phylogeny. Now in the visualization portion comes the part of something visually pleasing to the eyes, the drawing of the tree as a kml.
There are many kml features that I employed that lead to better visualization. Separating clades via colors, adding level of detail feature via regions, tessellating the lines even at an altitude above the earth. etc.

All this has begun to give a tangible form to the original idea from where the project began.
As far as my personal opinion is concerned the major portion of the project is over and I only need to polish things up. Patch a few undone things here and there. And then go on to incrementally improve it as much as time permits.

Perhaps of the most elusive incorporation's that can be done in this project are the animations but I don't know exactly how and where can they be placed to create the magic effect.

Altogether it has been a great experience till now.


Friday, July 1, 2011

Recent Log

So here is what is going through recently into my project. I had been able to read a phylogeny. Now I have also completed PhylogenyProcessor.  So now I have at hand a drawable phylogeny.

The parent nodes have been assigned the mean coordinates of their children. I am thinking to modify it to a weighted average so that the length of the edges correspond to the time span.
If clades are specified then each clade is assigned a new random color. Any parent node's color is the arithmetic mean of the colors of its children.

The code is up on github so you can have a look at it.

My current efforts are directed in coding the KmlWriter classes that will draw the phylogeny onto the kml. Presenting 4000 nodes all at once leads to clutter and chaos on the map. I am thinking of ways to prevent such a catastrophe. Ideas are welcome !
Some ideas I had are as follows.


  1. Build each level of the tree into a separate folder, this way user can select what stuff he wants to see and what he/she does not want to.
  2. The second was to display it cladewise. But this seems to be a futile idea because the cocept of clade begins to fade out as we move up the tree. So the HTU's belong to which clades when their children belong to different clades is indeed a vague question. So I have dropped this idea.
  3. The third is to put the placemarks in the map at various different levels in the map.