Although abstraction is the best approach to deal with computing system complexity, sometimes implementation details should be considered. Considering on-chip interconnection networks in particular, underestimating the underlying system specificity could have non negligible impact on performance, cost or correctness. This paper presents a very efficient router that has been devised to deal with cache coherent chip multiprocessor particularities in a balanced way. Employing the same principles of packet rotation structures as in the Rotary Router, we present a router configuration with the following novel features: (1) reduced buffering requirements, (2) optimized pipeline under contention-less conditions, (3) more efficient deadlock avoidance mechanism and (4) optimized in-order delivery guarantee. Putting it all together, our proposal provides a set of features that no other router, to the best of our knowledge, has achieved previously. These are: (1’) low implementation cost, (2’) low pass-through latency under low load, (3’) improved resource utilization through adaptive routing and a buffering scheme free of head-of-line blocking, (4’) guarantee of coherence protocol correctness via end-to-end deadlock avoidance and in-order delivery, and (5’) improvement of coherence protocol responsiveness through adaptive in-network multicast support. We conduct a thorough evaluation that includes hardware cost estimation and performance evaluation under a wide spectrum of realistic workloads and coherence protocols. Comparing our proposal with VCTM, an optimized state-of-the-art wormhole router, it requires 50% less area, reduces on-chip cache hierarchy energy delay product on average by 20% and improves the cache coherency chip multiprocessor performance under realistic working conditions by up to 20%.