KAU Data Scienсe Center

4. Conclusions

Machine Learning (ML), especially its subfield Deep Learning (DL), had many amazing advances in the recent years, and may lead to technological breakthroughs that will be used by billions of people. The software development in this field is fast changing with a great number of open-source software from academic, industry, start-up, and open source communities. As a new computing model, DL with GPU support is changing how software is developed and how it runs. Nowadays, ML algorithms learn from huge amount of real-world examples in variety formats. DL is about designing and training NNs. After a computationally consuming training, NNs can be deployed in data centers to infer, predict and classify from new incoming data presented to them. Trained NNs can also be deployed into intelligent IoT devices to understand the world. The deployment of trained NNs require smaller computational resources in comparison with the development phase. The new computing model in the Big Data era requires massive data processing and massive parallelism supports that are capable to scale computation effectively and efficiently according to the real need.

ML and DM are research areas of computer science with fast involving development due to the advances in data analysis research in the Big Data era. When the number of ML algorithms is extensive and growing, the number of their realizations through ML/NN/DL frameworks and libraries are extensive and growing too. The short outcome of the document is follows.

Most of the DL/NN framework development is done at the world’s largest software companies such as Google, Facebook, and Microsoft. These companies dispose a huge data, high performance infrastructure, human intelligence and investment resources. Their most popular DL tools are TensorFlow, Microsoft CNTK, Caffe/Caffe2, Torch/PyTorch, and MXNet. Apart from them, other DL tools such as Chainer, Theano, DL4J, and H2O from other companies and research institutions, are also interesting, well-supported and suitable for industrial use.
There are many high level wrapper libraries built on a top of above-mentioned DL tools (e.g. Keras, TensorLayer, Gluon) suitable for convenient DL development.
Big Data ecosystems such as Apache Spark/Flink and Cloudera Oryx 2 contain build-in ML libraries for large-scale data mining (mainly for tabular data). These ML libraries are currently in involving state but the power of the whole ecosystem is significant.
Every tool (including traditional general purpose ML tools) provides a way to process largescale data.
As of the year 2018, the Python is the most popular programming language for ML/DL applications. It is used as general purpose language for research, development and production, at small and large scales. The majority of scientific data analytic and ML/DL tools are either Python tools or support Python interfaces.
The trend shows also a high number of interactive data analytics and data visualisation tools supporting decision makers.

It is needed to notice that using these kind of tools is not the only way to build compute-intensive applications or to do data analytics and data mining. Self made code packages can do the same job. The price for this way is, of course, the time and the efforts spent on the code development and code maintenance process.

There is a challenge of managing multiple tools, multiple approaches from divergent ML/DL user communities in different applicable areas. The challenge is hard because of exposing a unified, comprehensive, efficient and coherent platform, that is capable to scale computation dynamically and on-demands. The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas. This evolution has many different faces, components and contexts, and our project DEEP Hybrid DataCloud will combine some of them to propose a new e-infrastructure framework able to address relevant challenges in research.

Return to Contemt

Google Sites

Report abuse