Scalable Parallel implementation of Conjugate Gradient Dense Linear System solver library that is NUMA-aware and cache-aware