The quickest way to identify what constitutes as unstructured data is to first determine what it is not.
Structured and Semi-Structured Data
Some of you reading this may already know what unstructured data is, some of you may think they know, and perhaps some of you may have heard the term but were never sure what it meant exactly. Regardless of which camp you fall into, I would like to take a minute to explain what I mean when I say “unstructured data.”
Not unlike many other terms in the data center industry, the definition of unstructured data can vary widely, which can make things a bit confusing. Some will say that the differences between unstructured data and structured data simply have to do with data types. For instance, they would classify all audio and video files as unstructured data and all Excel files and databases as structured. I have heard others say that it is more about how the data itself is stored and accessed by various users and applications. They consider all the data residing on their NAS filers and shares to be unstructured, and all the data residing on their block-based storage arrays to be structured.
Neither of these definitions are 100% wrong, but in my opinion, they tend to be a bit vague and fail to capture the true definition of unstructured data. Truth be told, the quickest way to begin to identify what constitutes as unstructured data is to first determine what it IS NOT, and that of course is known as structured data. Structured data refers to data that contains hard and objective facts: think numbers, dates, names, and so on. Structured data is typically found in an expected and predefined format, with common fixed fields. Another key characteristic of structured data is that it is much more easily stored into a searchable format where either human- or computer-driven analysis can be performed against it—much easier than with unstructured data—due to its fixed and enforceable format.
A good real-world example of structured data that everyone should be familiar with is what happens when you do any type of online shopping. You select a product from some e-commerce website, you enter in your credit card information and your delivery address, and then the e-commerce site commits to some kind of delivery ETA. The e-commerce site is undoubtedly storing all this information into some kind of relational database in a very structured manner.
e-Commerce Site – Structured Data Examples
- Product SKU #: 123456ABC
- Credit Card #: 1234-5678-9123-4567
- Delivery Address: 12345 Smith Ave. Wisconsin, USA
- Committed Delivery ETA: January 1, 2020
This is all considered to be structured data.
Before we get to unstructured data, there is another term known as semi-structured data that we should first demystify, as well. Semi-structured data tends to be much more ambiguous and subjective than structured data. You cannot easily store semi-structured data into a relational database. However, this type of data does tend to have certain properties, attributes, and data fields that do allow for it to be stored in a searchable format for analysis. Typically, there are either inherent metadata fields (information about the underlying data) or perhaps manually assigned custom tags that can be applied and used to assist with organization, search, and analysis—but the underlying data itself still lacks any default structured content.
Going back to our previous e-commerce example, let’s imagine that the product was delivered but it arrived damaged. You check the e-commerce FAQ on what to do if your product arrives damaged and the instructions tell you to take a photo of the damaged product and submit the photo via the e-commerce website’s Damaged Product Assistance page. The e-commerce site is not going to be able to store your photo in a traditional relational database like the previous structured data most assuredly was, but it is certainly going to save the photo itself and retain specific metadata about your photo, and it most likely will attribute some specific custom attributes to that photo that can be searched or analyzed.
e-Commerce Site – Semi-Structured Data Examples
- The Uploaded Photo itself plus various metadata:
- Date Photo was Taken: January 2, 2020
- File Size of the Photo:5 MB
- Custom Attribute/Tags: “Potential Un-satisfied Customer”
Though the photo itself is considered unstructured, certain metadata, attributes, or custom tags can be used or applied to create some kind of order so that search and analysis (however limited) can take place.
What, Exactly, is Unstructured Data?
Now that we have discussed what characteristics make up both structured and semi-structured data, we can finally discuss what unstructured data truly is. Unstructured data comes in many types and formats and can be stored onto countless types of media. The main distinction with unstructured data is that it is completely lacking of any type of structure or definition. It is not organized in any pre-defined way and does not follow any type of typical data models.
This all leads to unstructured data not being a very good fit for any kind of relational database like structured data is. The differences vary so widely between each piece of data that applying custom attributes or tags, or even attempting any kind of manual or automated analysis of its metadata, typically isn’t very helpful as it is when done with semi-structured data.
Again, using our same e-commerce example as before, let’s say that you next receive an email from the e-commerce site indicating that a replacement product is on its way. You receive the replacement product and this time there is no damage and the product arrives in perfect working order. However, the e-commerce site also sends you a $25 gift card for being a great customer and to make up for the initial inconvenience. You are so impressed with this company’s actions that you log onto its website’s comment section and give it a raving review. You also contact the company via telephone and leave a detailed message on its automated voicemail explaining how satisfied you were with how it handled the initial situation, as well as the product itself. You then refer five social media friends who also end up purchasing products from this same e-commerce site based on your referral. All of these actions end up creating different types of unstructured data:
e-Commerce Site – Unstructured Data Examples
- Comments left on the company’s own website
- Voicemail left with company’s automated voicemail service
- Positive social media referrals
As you can imagine, these examples of unstructured data are going to be very difficult to organize into any type of searchable hierarchy or structure. Coming up with custom tags for this type of data is going to be a painstaking and manual effort that, in all likelihood, will provide very little insight to the business. Does this make the unstructured data any less valuable than the structured or semi-structured data? In our theoretical e-commerce example, one could argue that since only the unstructured data captured the fact that this customer potentially generated new business for this company via referral, it is actually the most important data set of all three and should be considered a key resource for analysis.
It becomes obvious that any organization that can better store, search, and analyze unstructured data will have a significant advantage over any of their competitors that are not doing the same.
- Certainly the structured data is being looked at and analyzed:
- Which product SKUs are best-sellers?
- Which parts of the country are this company’s products not selling very well in?
Who are the repeat customers?
All this structured data is undoubtedly being imported into some type of relational database, searched, and analyzed in order to come up with various financial forecasts and marketing reports and provide enhanced value to the business.
Since the unstructured data cannot be as easily organized in a similar manner, many businesses are simply keeping it in storage, never to be searched or analyzed again. Others do not store or capture it at all! Why is this? One key reason is a belief that a company would need to invest in high-end and expensive data science teams or consultants in order to extract the business value from their unstructured data. This was certainly true in the early days of big data and data analytics, but you may be surprised at some of the solutions that exist out there and are being used by businesses to help simplify and partially automate the search and analysis of unstructured data. We could go down a serious rabbit hole here discussing big data analytics applications, so I will save that for another time.