The Rise of Data Science Notebooks
Apache Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala or Python and more. However Apache Zeppelin is still an incubator project, I expect a serious boost of notebooks like Apache Zeppelin on top of data processing (like Apache Spark) and data storage (like HDFS, NoSQL and also RDBMS) solutions. Read more on my previous post.
Splice Machine replace traditional RDBMSs
Splice Machine delivers an open-source database solution that incorporates the proven scalability of Hadoop, the standard ANSI SQL and ACID transactions of an RDBMS, and the in-memory performance of Apache Spark.
Machine Learning as a Service (MLaaS)
Machine Learning is the subfield of computer science that “gives computers the ability to learn without being explicitly programmed”. Within the field of data analytics, Machine Learning is a method used to devise complex models and algorithms that lend themselves to prediction; this is known as predictive analytics. The rise of Machine Learning as a Service (MLaaS) model is good news for the market, because it reduce the complexity and time required to implement Machine Learning and opens the doors to increase the adoption level. One of the companies that provide MLaaS is Microsoft with Azure ML.
Apache Spark on Kubernetes with Red Hat OpenShift
OpenShift is Red Hat‘s Platform-as-a-Service (PaaS) that allows developers to quickly develop, host, and scale applications in a cloud environment. OpenShift is a perfect platform for building data-driven applications with microservices. Apache Spark can be made natively aware of Kubernetes with OpenShift by implementing a Spark scheduler backend that can run Spark application Drivers and bare Executors in Kubernetes pods. See more on the OpenShift Commons Big Data SIG #2 blog.
MapR-FS instead of HDFS
If you’re familiar with the HDFS architecture, you’ll know about the NameNode concept, which is a separate server process that handles the locations of files within your clusters. MapR-FS doesn’t have such a concept, because all that information is embedded within all the data nodes, so it’s distributed across the cluster. The second architectural difference is the fact that MapR is written in native code and talks to directly to disk. HDFS (written in Java) runs in the JVM and then talk to a Linux file system before it talks to disks, so you have a few layers there that will impact performance and scalability. Read more differences in the MapR-FS vs. HDFS blog.