Understanding‎ > ‎

What is Unstructured Data

Digital Data can be broken down into Structured Digital Data and Unstructured Digital Data.
Structured data is best known as relational data, but is any text based data stored in such a way that enables it to be accessed and queried to an agree standard.
For relational data, its stored in a well defined mathematical structure with official rules and standards for accessing and manipulating it. Other types of databases storing text data exist and conform to different standards.
Any data that is not stored in a well defined structured format can by default be see as Unstructured. The traditional view is that unstructured data is just any binary data.
There is a fuzzy area between structured and unstructured, more akin to saying there are degrees of structure and there is a lot of overlap.
Its possible to store unstructured data in a column in a relational table, which is structured.
The physical database files containing structured data are binary and stored in a propriety format without well defined rules and is considered unstructured.

Data stored in NoSQL or XML can be considered to stored in a semi-structured format. For XML there are rules for accessing and querying it, but the data itself and its structure can vary. It can conform to agreed standards or be stored in a raw format.

Just saying that text data is structured and binary data is unstructured is not sufficient. As a text file (notepad or vi) can contain a random set of characters without definition, rules or conforming to any standard.

Unstructured data can be broken down into different groups. A well known group is Multimedia or Rich Media. Here there are types like digital image, audio, video and document (though there are more in this list). Some of these types are well defined and can contain embedded in them XML (or other) that conform to an agreed set of standards. The format of the binary data can also follow agreed rules. The digital image format JPEG is an open standard. For video, MPEG is also an open standard. Multimedia would be a category of unstructured data that is well defined. Its category is fluid and changing as technology changes and unlikely will ever be able to conform to the mathematically and well proven relational structure.

So we can now define all data as :

1. Structured - Any data stored in a well defined, non propriety system. Data is primarily text based. Typically conforms to ACID.
2. Semi Structured - Any data stored in a system, that conforms to some rules and can be propriety. Data is primarily text based. Does not have to conform to ACID.
3. Well defined unstructured - Binary data that is well defined and conforms mostly to an agreed standard.
4. Unstructured - Binary data that is propriety.

The challenge is that even based on this definition, some data falls across one or more definitions.
This is typical of what one encounters when dealing with unstructured data. There is no concise and easy to use definition. The temptation is to say that unstructured data is just any data that is not structured. Then try to fit into structured NoSQL, XML and a multitude of other storage structures that feel that they should belong. In that case, is HTML structured or unstructured? HTML in theory is a subset of XML, but errors are allowed in HTML and its not case sensitive, whereas XML is. A raw text file can be labeled as HTML and be a valid HTML file but you can't do the same with XML. An XML file with one syntax error in it, is not XML because it doesn't conform to the XML rule set.

A well known joke is "what is the name of a boomerang that doesn't return" - a stick. Except that when one looks at the true history of boomerangs, most were designed not to return. Yet we associate a boomerang as any object that when thrown returns. An object of any shape as has been shown by boomerang experts who use letters of the alphabet as the shape of boomerangs just to show how versatile the ability of an object when thrown to return can be. The point to be made is that our traditional, innate sense of what something should be and belong to, is not always right.

One can also say that unstructured data is really structured data that hasn't been defined correctly yet. Because of the exceptions to the rule it might not even be valid to break data up into structured and unstructured. Yet by breaking it up and identifying each set, one can associate rules with it, understand its limitations and formulate new concepts around it. So it is useful to be able to do this.

When we look at the situation of a digital image being stored in a relational database like Oracle, we actually see two different situations. We see the digital image, which is binary data conforming to a well defined standard, but its being stored in a structured system. We can see what the data represents and where it is stored as two different systems.

So lets look at this further. If we now separate the storage mechanism from the data itself, we can have unstructured data stored in a relational database. The unstructured data is a separate entity and even though its handled using ACID that is not important as the data itself is unstructured. Off course that raises some new issues. What about some of the text elements stored in a structured database, are they structured or unstructured? If we store a date value, that behaves as structured as its fixed and conforms to a mathematical standard. If the date is stored in a varchar field then its not structured. As any value can be stored in it. If we store an address in a varchar field, is that structured or unstructured? If we store the values in an abstract data type then it can be classed as structured as methods can be applied to it and the structure is well defined and controlled. If the address is stored in only a varchar field, then any value can be added in free-form, and its unstructured. A similar situation holds for name and a raft of other values. So it appears that a lot of the individual data items in a structured database might actually be unstructured. This issue is well known in data warehouses where a lot of time is spent cleaning the data into a structured format.

So again we come to a situation where trying to clearly define structured and unstructured always brings up inconsistencies and exceptions to the rule. At this point we realise that this isn't an issue at all and come to a better understanding of how one has to rethink the whole strategy of working with unstructured data. A document can contain in it only photos. Is it a document or a photo album? A video which only has an audio track but no picture, is it still a video? A gif animated image, is it a video? Even when looking at two images and comparing, how can we say they are the same? If one image differs from the other by one byte is it still the same? If comparing two seemingly identical video's but one is missing only the final frame which has no audio or picture, is it the same or different? The world of unstructured introduces us to a world where our traditional rules for dealing with commonly held concepts break down and don't make sense any more. The strict definitions we are used to and comfortable with for defining relational data fall apart when dealing with unstructured data.


Comments