Enterprise Data Management



By: Sean Mikha (smikha@gmail.com) , February 7th, 2013


 In writing this article, I began my research with the endeavor to understand the ‘Big Data’ landscape. To create an article that covered some of the paradigms like NoSQL, and In-Memory, that are being introduced as new technologies. However, as I started to think about it more and more, I concluded that ‘Big Data’ really meant nothing more than a catchphrase for the new technology architectures, and tools, that would be needed to solve tomorrow’s business challenges and stay competitive in the marketplace. What we are really talking about then, and what we have always been talking about, is not 'Big Data' but 'Enterprise Data Management' in the future.


The terms ‘Enterprise Data Management’ and ‘Big Data’ can be interchanged throughout this article, however I am of the opinion that 'Enterprise Data Management' describes the real subject matter at hand, and 'Big Data' is only a subset.


Moving away from industry buzzwords, I had the opportunity to focus my attention . Thus my goal for this research article was to first start with properly defining 'Enterprise Data Management' (EDM). I came to the conclusion that in its simplest form, EDM is the act of solving a problem.


When considering EDM, there are two key factors to solving a problem: 1) the input to the problem, or the data that needs to be captured (i.e. Enterprise data) and 2) What needs to be done with that data in terms of processing, what we consider ‘management’.




What is Enterprise Data Management?


Enterprise Data Management refers to an organization’s ability to capture and process data that is relevant to their business.


I have defined EDM throughout this article in terms of 2 specific dimensions of scale:

   

CAPTURE:  Refers to the size of the data or input that is relevant to the business and needs to be captured.

  

PROCESS:  Refers to the complexity of the problem that is relevant to the business and needs to be    

                 answered through processing.




Dimensions of EDM



To set the stage for the Enterprise Data Management Solution Model, we must first define the dimensions of scale that we will be measuring each solution on. At its most basic form, any solution can be summed up into 2 key dimensions:


Size of Data / Input: The size of data or input that the solution must ‘act upon’. 


With Enterprise Data Management in mind, all data that is captured in the Enterprise can be valuable in meeting the strategic needs of an organization. In this dimension, we measure the scale based in Bytes, and have chosen the magnitude of the scale to be in the Terabyte to Petabyte range, which most closely mimics the enterprise today. Notice as we move into the future these numbers shift accordingly to the origin and we slowly move into Exabyte and Zeta-byte ranges, as we have moved in the past from Megabyte to Gigabyte ranges.


Complexity Of the Problem: The amount of computational processing, as well as the algorithm required, measures complexity of a problem.


After the size of the input to the business question has been identified, we must understand the complexity of the problem and what is needed in terms of processing to understand that data in a meaningful way. In this dimension of scale we have two types of measurements that must both be considered in terms of resources required to solve the problem. On one hand, we have the sheer unit of work that is used to process a set of complex problems and is identified as Floating Point Operations Per Second (FLOPs), or basically the number of steps required to solve the problem. On the other hand we have the approach of the algorithm that is used to solve the problem, which can vary wildly, but can achieve great leaps in terms of the number of steps needed to solve the same exact problem.


To put the scale into perspective, we identify two key areas on this graph:


    1) In the extreme case of data size, we have identified the study published in the Journal of Science that claims all of recorded human knowledge to be in the range of ~ 295 Exabytes (2007, BBC). 


    2) In the extreme case of complexity of problem, we have identified the Titan Super Computer which has the capability to process up to 20 PetaFLOPs (2012, TITAN). Using this power to solve the most complex of problems such as climate change, new pharmaceutical drugs where the relationship of complex molecules is being modeled in a virtual environment, and to the creation of new types of efficient fuels. 





Business Domains



Once the dimensions for the Enterprise Data Management (EDM) Solution Model have been set, we are able to define the 4 quadrants that represent the business domains:


1) Enterprise Transactions: The lower-left, and starting quadrant for the business, includes the key technology enablers for running the business. These are highly structured systems, and processes, that are needed to run a successful businesses. These tools serve as the foundation of running many enterprise business processes from Financial, ERP, and more.


2) Enterprise Analytics: From the lower-left quadrant of Enterprise Transactions we move over to Enterprise Analytics. These are the key technology enablers that are used to process and understand the relationship with the customer, as well as how we are running the business. The business areas affected/interested in this quadrant can range from marketing to financial, and all across an enterprises organization. Generally we find leaders using tools in this space that involve advanced analytical practices such as predicative models for risk and fraud, and behavioral analytics to understand the customer decision-making process.


3) Enterprise Interactions: From the 2nd quadrant in analytics we move to the interactions quadrant which has only recently taken off and been driven by the proliferation of online consumerism. This refers to the many new forms of interactions that are taking place between an enterprise and their customers. This can be from shopping carts, emails, call center detail records, and web logs. The key channel that has driven a large portion of this data is known as the ‘metadata’ channel and is primarily rooted in the mobile medium. Not only are the interactions between customers and employees driving this space, but we are also seeing the interactions between customers and the different points of an enterprise’s services, and the machines that support those services, used for analysis. For example think of the data generated and passed over a Wi-Fi network for healthcare patients. Many new healthcare devices in the industry are sending out status checks of their systems across the network, which can be mined by a central system for efficiency of service and operations.


4) Competitive Advantage: Finally moving over to the 4th and final quadrant within the solution model we see competitive advantage where Enterprise Transactions, Enterprise Analytics, and Enterprise Interactions are combined to create the edge in a competitive marketplace. What we have seen in effect is we move from capturing enterprise interactions, to managing enterprise interactions by capturing those interactions and processing through advanced analytics. We also find that we move from processing the relationship with the customer to truly understanding the relationship with the customer when we are able to apply the advanced analytics to a larger data set, which involve multiple points of interaction with our customer. Finally when we are able to manage our relationship, and truly understand our relationship with the customer (as well as understand ourselves through an efficient business, i.e. quadrant one) we are able to truly shape the outcome and relationship with our customer such that there is mutually achieved satisfaction.



Business Drivers


Taking the business domains defined in the previous section, let’s tie those concepts to the business drivers in each domain:


Enterprise Transactions: Driven by an increase in productivity through business process, business collaboration, and efficiency in all business operations.


Enterprise Interactions: Driven by an increase in sales typically seen through managing customer interactions through cross-sell, up-sell, churn management, and capturing the many different channels and touch points that occur between an organization and their customer. This quadrant is dominated by the CMO of an enterprise’s organization and is leading to become the CMO / CIO of the organization as we see the rise of the marketing budget for IT and Analytics surpass that of the traditional IT budgets. In this quadrant we will see the capture of digital channels from email, telephone/call detail records, Point Of Sale (POS), social, and mobile. With this, organizations will have the ability to achieve a 360 degree view of the customer, and when coupled with enterprise analytics will give organizations the competitive advantage they need in the marketplace.


Enterprise Analytics: Within the Enterprise Analytics quadrant we see the use of algorithmic approaches and tools to create and design strategic initiatives that generally involve the execution in reduction of costs. This can range from predicative models that determine risk and fraud to behavioral analytics to help understand the customer relationship better. Though not only limited to reducing cost, we see that both quadrants for analytics and interaction are involved in driving an increase in sales and reducing costs, often times required simultaneously.


Competitive Advantage: Finally we have the last and final quadrant that results in business drivers for competitive advantage. Both the combination of an increase in sales with a reduction in costs, and at times a unique customer experience that cannot be matched by any competitor in the industry. Here we see again the reemergence of the CMO / CIO as the future dictates having the ability to execute successfully on a number of fronts. These being customer segmentation, a 360 degree view, managing the digital channel with one voice to the customer, and the combination of behavioral analytics and actionable insight, such that we can interact with our customers with intelligent human interaction.




Industry Adoption


The industry adoption research matches closely to what we see in other IT models for technology adoption:

In general, growth is spurred and created through research and the sciences. From here, new breakthroughs are turned commercial by vendor start-ups who are able to define the technology in an industry-use case. Next, the start-up vendor’s technology is fueled by projects deployed on the national and governmental level and finally many of the technologies are then adopted and expanded by the industry leaders. In other cases (such as is the case for Hadoop), the industry leaders themselves, such as financial / .COM, are the organizations creating these technologies.

The adoption of these new technologies is then found in commercial-use cases and deployed by industry leaders in their segment. Next after being adopted by industry leaders these technologies, and approaches to the marketplace, become traditional tools of the enterprise. Notice we see a natural gap in terms of the distance between science feeding into the industry leaders. Also notice that the gap between science and industry leaders on the horizontal axis (complexity of problem), is a greater gap than what we see in terms of the size of the data input (vertical axis) between the two.



Real-World Use Cases



Considering real-world-use cases within the first quadrant (lower left) we often see the sited example of the Integrated Data Warehouse. In today's day and age, the world of building a complete Enterprise Data Warehouse (EDW) has become a pipe-dream. Because, generally speaking, EDW's are costly to create and they require an ungodly amount of agreement amongst multiple lines of businesses (LOBs).


However the Integrated Data Warehouse (subset of EDW) is a key enabler to allow organizations to become the industry leader through integrating core subject areas such as membership, sales, marketing, and financials, to achieve efficiency, lower costs, and derive better sales strategies in the marketplace.


The second quadrant (lower right) provides the analytical models many industry leaders have been utilizing for over a decade. Within this space we see tools such as SAS being deployed to create predicative models that determine customer propensity to make either profitable or costly actions.


Within the 3rd quadrant (upper left) we find some of the technologies pioneered by the .COM companies that are now essential to life in a post brick-mortar world. Capturing and processing web logs have become crucial to understanding customer interaction.  Within this quadrant we mostly see Hadoop and NoSQL installations deployed to comb weblogs, and identify customers to provide effective sales through cross-sell, up-sell, market basket affinity, and recommendations. More recently with the onslaught of social media, and viral product marketing, we are starting to see sentiment analysis used to understand customer satisfaction with current marketing campaigns, and products and services offered by the organization.


Finally, in the fourth quadrant where scientific research, and industry leaders reign king, we see a combination of all previous 3 stated quadrants combined to penetrate the market and increase user satisfaction and experience. A primary example being with IBM Watson. Watson is based on a foundation of Linux and Apache Hadoop, with an application rich in artificial intelligent and analytics. The IBM Watson platform has made penetration in the healthcare market, being utilized for understanding a patient medical condition by combing: EMR, pharmaceutical, and doctor notes to derive a holistic view a patient, and deliver a better quality of care, at a lower cost.


In its most extreme case, we have all of these technologies, combined yet again, to deploy such scientific endeavors as those being calculated by the Large Hadron-Collider project. Among this project’s many goals, it is tasked with proving the existence of such particles in physics as the Higgs-Boson (God Particle).




Architectural Approaches


Within the landscape of the quadrants for business domains and drivers, we devote the next section to researching the architectural approaches that have been adopted by the industry and are currently being used to effectively address the business needs in each area:

 

Relational

In the first quadrant we find Relational, which has been the dominant architectural approach to run a business for decades. Relational systems provide a structured representation for defining business rules and processes. In this quadrant we find the business problems that involve moderate size in terms of data input, and moderate complexity in terms of processing required to solve the problem. The core technologies we find in this quadrant, also known as enterprise transactions, are known as operational database systems that have been implemented for Online Transactional Processing (OLTP) access on Symmetric Multi-Processing (SMP, meaning a single server) architectures.

 

Operational

Traditional relational database systems are used to run many lines of business and business process within an organization. Multiple applications have been developed on top of traditional OLTP systems and are engineered to work quiet well as systems for 'getting data in' and as data marts for simple reporting. These technologies do not do well in decision support environments, which require complex query join analysis. Additionally, these systems are not traditionally built to scale. These operational systems are built on SMP architectures, and follow a SCALE UP approach with heavy indexing for performance. 

 

Analytical

Analytical Relational systems are built to run Online Analytical Processing (OLAP) and have been implemented as either SMP or MPP (Massively Parallel Processing) architectures. SMP systems are known to scale through SCALE UP methodologies of increasing a single server’s RAM, number and speed of cores, as well as Disk I/O subsystem (either HDD, SSD, SAN, Direct-Attached etc.). Whereas MPP processing architectures allow for SCALE OUT builds of scalability by adding more nodes/servers, and are typically linearly scalable because they implement a shared nothing architecture. However, MPP systems require costly software to implement the architecture, primarily dependent upon a message protocol system between the nodes. In addition to this, MPP architectures generally require an advanced OPTIMIZER to use the SCALE OUT node technology efficiently.

New quasi-MPP architectures have also entered the arena quite recently with newcomers such as Oracle Exadata which deploys a MPP (SOUTH) side for the disk-I/O subsystem, but still contains inherent scale issues with a SMP (NORTH) side which is Oracle RAC. Other new technologies also utilize MPP paradigms with new architectures such as columnar. In the case of columnar we add strong capabilities for complex analytics that involve aggregations with the tradeoff of reduced performance for ETL, transactional processing, and row-oriented operations.

Analytical relational systems are typically built for high-concurrency, analytic processing of complex query with large joins, and are designed for "getting data out" within decision support and strategic environments that can span multiple subject areas in the enterprise.

 

 

Algorithmic

Next in the 2nd quadrant we find the Enterprise Analytics area where we see the approach is to use Algorithmic solutions to solve the different business challenges that arise. When we refer to an algorithmic solution we are signifying that there are many ways to solve a business problem and in this arena it is the unique and innovative approach that results in analytical advantage.

 

In-Memory

With IN-Memory we refer to the technologies that have been implemented as IN-Memory databases or other applications. These can range from relational implementations of certain applications (SAP HANA) to non-relational and analytical components (Tibco Spotfire, SAP HANA) all with the goal of utilizing the low-latency, ultra-high speed performance of in-memory hardware.
Architectures for these systems typically deploy a high compression rating to reduce the cost associated with in-memory storage and have been implemented typically through NUMA architectures that can provide some scalability, but not much.

 

Statistical

For the next tool in the Enterprise Analytics quadrant we have the 800 lb. Gorilla, which will continue to be such throughout the foreseeable future. Statistical tools. Here we refer to the ever-present SAS and SPSS tools (along with R).

The need for high-end algorithms and statistical tools developed by SAS and executed through the processing of data sets across the organization will not be going away anytime soon. However, you will see a move to high-end appliances as SAS and competitors look to gain additional sources of revenue and solve ever-changing business problems as they have done with moving to the HPA In-memory appliance.

 

Visual

Next we have the set of Visualization tools that provide a new algorithmic and user approach to solving enterprise business questions. They typically deploy a wide range of architectures from in-memory capabilities, to push-down optimization with the leading database vendors. These tools are geared toward business problems that involve a geospatial representation.

 

MapReduce

Finally we end the quadrant with one of the low-level, and also the most powerful analytic tools through MapReduce. MapReduce is a programming framework typically implemented through Java and used to execute code in parallel across many nodes of commodity hardware. Within the model you will notice that on the data size dimension MapReduce has a lower rating, which is because it is not until MapReduce is coupled with a distributed storage platform (such as HDFS) does it have the ability to process large amounts of data. When we compare MapReduce alone to the other approaches in the algorithmic quadrant we see that tools such as visual and statistical both incorporate their own mechanisms for storing data that they will be processing.

 

 

NoSQL (non-relational, semi-relational)

Moving to the 3rd quadrant, for Enterprise Interactions, we have the NoSQL movement, or architectural approach:

Here we find a broad set of database management systems that do not adhere to the relational model. NoSQL databases are not built on tables or use SQL as a language for interaction (but can have variations thereof). We define NoSQL as Not Only SQL, but it does not mean NO SQL. The purpose for the set of different types of NoSQL implementations are to address the variations we see in types of data and are affectionately known as the 3 V's (coined by Philip Russom of TDWI, referring to the characteristics of Big Data being: Variety, Volume, and Velocity). In this space of customer interactions, we find many of the new data sets that are being introduced to the enterprise today do not fit in a traditional relational model of rows/columns.

In fact, with the onset of mobile we see many touch points that are now occurring across the enterprise, its services, and its interactions with customers do not fit any relational paradigm at all, and are often ever-changing. As such, a new set of tools have been developed that can both store this data (which comes in at a high rate of velocity, and a large rate of volume, 2 of the 3 V's of Phillip Russom) and can store the data in multiple ways, Variety (from weblogs, to surveys, customer emails, instant messages, etc..).

NoSQL databases are known for fast record retrieval and writing, with high performance and scalability. NoSQL databases are also highly-specialized, without general purpose functionality, and little to no BI capabilities. These systems are used for large quantities of data that do not need a relational model for representation. 

NoSQL is not considered MPP. It does not include the management and inter-communication framework of traditional data warehouse vendors. The architecture at its most basic level uses sharding, and because of this, joins between data sets, or objects across nodes in NoSQL databases is not normally allowed. NoSQL databases are distributed, highly-available but do not adhere to ACID, and instead they rely on eventual consistency which means that with enough time, all updates will propagate correctly through the system as if the jobs were run serially and not concurrently.

Some of the more mainstream types of NoSQL implementations are based on how the data is represented, such as:
Document
Graph
Key-value
BigTable


Now for a deeper analysis of each:


NoSQL, Document

For NoSQL Document databases we see that this type of database uses the notion of data being stored as a document. Encodings in use include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on).

An advantage of a NoSQL document store, is that you can gather documents based on their key or metadata determined and defined through tags about the document. In addition, with search capabilities, you can gather documents based on the contents and fields within that document. Since each document is not the same, these fields can either be, or not be present, and thus lend themselves well to this non-relational model as opposed to a row/column database that requires a fixed-length field.

Documents can be considered as a record (or row) of information and a collection of documents can be considered synonymous with the table.

 

NoSQL, Graph

For NoSQL Graph databases we see that these kinds of databases are designed for data whose relations are represented well, as a graph (elements interconnected with an undetermined number of relations between them). This kind of data can range from social relations, public transport links, disease tracking, and other areas where node/graph edges provide a good model representation of the data.



NoSQL, Key-Value
For the NoSQL Key-Value database we see these technologies built as scalable, distributed database systems, to solve large scale problems on data sets that are not represented well or efficiently on a traditional relational database management system (RDBMS).

Also known as associative array, or dictionary, a key-value database implements an abstract data type that stores a record of information for each key that is stored. There can only be one unique variation of each key. The database that represents the key-value pair can store, delete, modify, or lookup any of the key-value pairs based on the key or other functionalities, such as key-value search, or search on metadata associated with each record. Typically, this implementation is done through a hash map data structure; however, it can be implemented based on use cases in other formats, such as a B-Tree (for selective, ranged, or ordered type of information).

 

NoSQL, BigTable

Next we move to the NoSQL BigTable implementation, which is also a key-value store, originated at Google to act as a data structure engine. This implementation has some similarities with your the previously mentioned key-value store, but also differs in some unique ways. All key values are ordered in alphabetical order in the BigTable Map. The core principles of the BigTable architecture are the following: Map, Persistence, Distribution, Sorted, Multi-dimensional. 

 

HDFS

(see Hadoop in next section)

 

Object

For our final technology in this quadrant we move to NoSQL Object oriented databases, which had their hype wave back in the early to mid 90s. The object database management system is not relational but has data stored within it as objects (much like an object-oriented programming language or framework, i.e. Java). Within this implementation we have a tighter integration between how data is represented in the object format, and how we act upon that data through object-oriented programming. Whereas in a traditional RDBMS we have a clearer division between the representation (in schema) and its access through SQL.

 

 

Competitive Advantage

Finally moving to the 4th and final quadrant, Competitive Advantage, we find ourselves in the most complex quadrant in terms of both size of data and of complexity of the problem. The architectural approaches in this area are still being defined however due to the nature of the space, we see that both the principles of being parallel and being scalable are required to understand and solve the problems that arise in this space.

 

NewSQL

The first technology we see in this quadrant is NewSQL, which refers to many of the new vendors that have started to create varied flavors of relational database management systems that follow your typical relational model with a schema and the SQL language for access.  However, outside of those precepts, the NewSQL vendors have implemented a varied mix of creative architectures and value propositions. In general, we see a common thread amongst these vendors as being cloud capable, and going after niche data management markets with a MPP architectures, and a claim on performance.

 

Hadoop

Next, we move to the heart and soul of Big Data, the pioneering technology that is leading the movement: Hadoop. Hadoop is a technology created by Yahoo Labs, under the team (now at Hortonworks) and its creator (Doug Cutting at Cloudera). Originally created to help sort and index the web for Yahoo, it is now widely used in a varied set of industry use cases. Hadoop can be considered somewhat of an extension to the Linux operating system, in that it allows for multiple nodes to be deployed in a MPP-style fashion. Here we say MPP-style and not MPP in that Hadoop contains many of the core functionalities of a MPP platform; however, does not provide all of the necessary elements to truly be MPP (namely a robust intelligent and efficient network, and a brain/optimizer). These lacking areas of Hadoop make it extremely difficult, and costly, for organizations to derive actionable intelligence from the platform. However, on the other hand, given Hadoop’s low cost of hardware, and almost zero cost for software licensing, it becomes a great fit as a factory for data storage, retrieval, refinement, and movement across the enterprise.

Hadoop combines two architectures to deliver the MPP-style platform:

HDFS - Hadoop Distributed File System, which is used to distribute data (stored as a file/object) across multiple nodes.

MapReduce - A programming framework typically implemented through Java, and used to execute code in parallel across the nodes, and data, in a shared-nothing fashion, allowing for linear scalability on commodity hardware.

Hadoop is currently being distributed through the open-source GNU license under Apache, and has multiple distributors that have added their own value-proposition and support/services to help make Hadoop enterprise ready.

 

Analytical

Finally we move to the last technology in the quadrant, and what may be the most complex in terms of potential usage, the Analytical platforms and tools sets in the competitive advantage quadrant.

In the competitive advantage quadrant, the set of Analytical tools look most like hybrids of what we see in the other quadrants. The leaders in the field have developed platforms that utilize advances in all areas to quickly derive business value. The tools/platforms are known to provide a more business-friendly version of the core tools and technologies that are architecture principals of both Hadoop/MapReduce and the Analytical systems.

In this category we see vendors like AsterData, which provides a MapReduce framework that is executed with out-of-the-box SQL-MR functions that allow business users to run SQL code with MapReduce. The architecture combines this analytical capability of MapReduce, and joins it to the traditional RDBMS system allowing for access through all of the traditional BI tools. AsterData also includes database structures for performance enhancements, a cost-based optimizer, workload management, and concurrency support.

As you can see the list of technologies in each quadrant can be exhausting and it will not slowing down with increased competition in the marketplace.





Current Vendor Landscape


In this section we attempt to describe how the current vendor landscape looks like in 2013 with respect to each architectural approach.

 

Within the model we see that there are many players in the space, and many different markets to play in. At times, the lines between solutions can become blurred. In general we see that the established enterprise IT vendors are still active, and have entered the marketplace renewed after a few years of aggressive mergers and acquisitions (M&A) in 2010 and 2011. Namely, we saw the analytical relational systems being acquired (EMC-Greenplum, Microsoft-DatAllegro, IBM-Netezza, Oracle-Exadata(SUN)).

 

However, given the state of affairs with many of the existing technologies we see up-and-coming vendors playing in an all-out land grab mode as they figure out their value-proposition to the industry and where they fit. Outside of the core systems, we will see a movement throughout the industry as we shift to analysis on semi-structured and unstructured data sets, with an emphasis on simple analytic tools with powerful capabilities.

 

Taking Microsoft as an example of one of the established enterprise IT vendors, we see that the corporate strategy at Microsoft is betting big on this market. Looking at the 4 quadrants we see that Microsoft either has products, or research in the works, to address every quadrant with at least 2 different solutions (highlighted above in RED).

 

Some examples of current implementations of technologies:

 

Document/Cloudant (CouchDB) - originates from MIT Particle Physics work ~ 100 Petabytes

Graph/InfiniteGraph on Objecitivity exceeded 1 Petabyte database. ~ 1 Petabyte installations in Telecommunications

Key-Value/Amazon DynamoDB - Used internally and throughput Amazon multi-petabyte scale ~ 3-5 Petabytes

BigTable/Google BigTable - Implementation of world wide web indexing ~ 10 Petabytes Storage, 20-30 Petabytes processed per day

Object/Intersystems Cache - European Space Agency, Spain (Chart a 3D map of the Milky Way) ~ 1 Petabyte

Operational/OLTP/Many references of multi-Terabytes (10s of Terabytes installations, traditional to the enterprise) ~ 10s Terabytes

Analytical/OLAP/Teradata, Netezza, multiple installations sited and reported ~ 10-100s of Terabytes

Collaboration/Typical systems are Enterprise wide and can range in the 10s to 100s of Terabytes in some instances.

HDFS/HDFS is used as the core file system for Hadoop and can range in implementations of 100s of Terabytes to 10s of Petabytes.

NewSQL/Amazon RDS - implementations in the Terabytes range ~ 10 to 100s of Terabytes

Hadoop/Hadoop has many uses cases in the multi-petabyte (10s to 100s) and 42,000 nodes range (Yahoo)




Where Does Big Data FIT?


Finally, as an exercise, we take the model we built to map EDM, and we draw a line around the platforms and tools that are considered part of the ‘Big Data’ sphere. By doing so we come to several key observations:

 

Within the 3rd quadrant (Enterprise Interactions / Non-Relational) we have the biggest area under the shape and interpret this as the area that is driving the most ‘Big Data’ projects. Only a sliver of the area under the shape is approaching the traditional enterprise and we see this in the analytical relational systems area.

 

Note that both ETL (Extract/Transform/Load) and Business Intelligence (BI) tools have been included in the model; however, are not included in the area underneath the curve. This is because many tools that are being introduced to replace the two standards outside of the core enterprise. Namely, we have Hadoop giving Informatica and other ETL vendors a run for their money (and all but providing the last nail in Ab Initio’s coffin). On the Business Intelligence front we have a full-on attack from the Visualization vendors on traditional BI, as well as friction from some upstarts as the figure out ways to bypass ETL and BI completely and derive analytics straight from Hadoop.

 

Using the solution model above and mapping out Big Data we are also able to see a pattern represented in the model around which areas of “BIG DATA” should see the earliest enterprise adoption due to the fact that they do not require an extreme amount of size in data or extreme amount of complexity in solving the data problem, thus:

 

QUESTION: Which three ‘architectural approaches’ will see the earliest traditional/enterprise leader adoption?

ANSWER: Analytical/Relational/OLAP, Collaborative, In-Memory



The End


Links / References:

  • http://www-01.ibm.com/software/data/infosphere/hadoop/hbase/
  • http://en.wikipedia.org/wiki/NoSQL
  • http://en.wikipedia.org/wiki/Relational_database
  • http://hypertable.org/
  • http://en.wikipedia.org/wiki/Edgar_F._Codd
  • http://en.wikipedia.org/wiki/Relational_model
  • http://en.wikipedia.org/wiki/FLOPS
  • http://en.wikipedia.org/wiki/Titan_(supercomputer)
  • http://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/
  • http://strata.oreilly.com/2012/02/nosql-non-relational-database.html
  • http://en.wikipedia.org/wiki/Mongodb#Use_cases_and_production_deployments
  • http://en.wikipedia.org/wiki/SQL_Azure
  • http://en.wikipedia.org/wiki/NewSQL
  • http://strata.oreilly.com/2012/02/nosql-non-relational-database.html
  • http://www.akiban.com/operational-big-data
  • http://blogs.the451group.com/information_management/2011/04/06/what-we-talk-about-when-we-talk-about-newsql/
  • http://en.wikipedia.org/wiki/Graph_database
  • http://static.usenix.org/event/usenix99/full_papers/olson/olson.pdf
  • http://www.dbms2.com/2012/11/29/notes-on-microsoft-sql-server/
  • http://nosql-database.org/
  • http://www.informationweek.com/software/information-management/cloudera-debuts-real-time-hadoop-query/240009673
  • http://www.informationweek.com/software/information-management/hadoop-meets-near-real-time-data/232601984
  • http://www.informationweek.com/software/information-management/13-big-data-vendors-to-watch-in-2013/
  • http://www.informationweek.com/software/information-management/mapr-promises-a-better-hbase/240009608
  • http://www.informationweek.com/software/information-management/maprs-google-deal-marks-second-big-data/240003121
  • http://www.informationweek.com/software/information-management/mongodb-upgrade-fills-nosql-analytics-vo/240006437
  • http://www.informationweek.com/software/information-management/amazon-dynamodb-big-datas-big-cloud-mome/232500104
  • http://www.informationweek.com/software/information-management/amazon-debuts-low-cost-big-data-warehous/240142712
  • http://www.informationweek.com/software/business-intelligence/with-hadoop-big-data-analytics-challenge/240001922
  • http://www.capgemini.com/technology-blog/2012/09/big-data-vendors-technologies/
  • http://www.informationweek.com/software/business-intelligence/splunk-answers-business-demand-for-big-d/232400148
  • http://www.capgemini.com/technology-blog/2012/06/nosql-hadoop/
  • http://www.itworld.com/big-datahadoop/251912/big-data-tools-and-vendors?page=0,0
  • http://www.idc.com/getdoc.jsp?containerId=IDC_P23177#.UPompR1lGSo
  • http://wikibon.org/blog/navigating-the-big-data-vendor-landscape/
  • http://www.information-management.com/news/40-Vendors-We-Are-Watching-2012-10023168-1.html?zkPrintable=1&nopagination=1
  • http://en.wikipedia.org/wiki/Key-value_database
  • http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
  • http://en.wikipedia.org/wiki/VoltDB
  • www.voltdb.com
  • http://en.wikipedia.org/wiki/Doug_Cutting
Comments