Skip to content

Dark data may be tamed with natural language processing

Humans don’t have the time to sift through all the unused data available. NLP could be the answer.


Iaremenko, Getty Images/iStockphoto

Some serious medical conditions, such as sepsis, might be more easily predicted thanks to artificial intelligence (AI). Sepsis, which results from a massive immune response to bacterial infection can often lead to organ failure and death. 

SEE: TechRepublic Premium editorial calendar: IT policies, checklists, toolkits, and research for download (TechRepublic Premium)

Mary Beth Moore, an AI and language analytics strategist at SAS, wrote about this and other serious medical conditions and how AI can help: “Early diagnosis and rapid intervention is critical in sepsis treatment, but symptoms aren’t always apparent for its early onset stages. Mortality rates increase 8% for every hour treatment is delayed. With heavy caseloads and possibly asymptomatic patients in the early stages of sepsis, the human eye may not notice the correlation between data in medical records and early indicators of a deadly condition. But the application of natural language processing to data in those electronic health records is a key input for predictive models that trigger alert systems, notifying doctors and nurses that a patient may need medical intervention.”

The ability to decipher and to navigate through digitized but unstructured medical records goes well beyond sepsis.

One challenge is that there are still doctors who don’t use electronic medical records, and there is medical research that hasn’t even been digitized, along with other forms of medical documentation that is missing from digital repositories where it could be more easily stored and queried. 

SEE: AI: New GPT-3 language model takes NLP to new heights (TechRepublic)

Healthcare is by no means unique. Whether it’s auto care, oil and gas, retail, government, law, manufacturing, or many other types of industries, all of these industry sectors have their share of dark data that could contain information of value but that remain untamed.

What is dark data?

Dark data is any data that isn’t being used, whether it’s in computer log files, social media customer sentiment, or paper documents and video tapes stowed away in corporate store rooms. Digitizing this data so it can ultimately be probed for insights has been a major thrust of the digitalization movement that has gripped most businesses for the past five years.

Unfortunately, digitization has limits 

You still have to apply analytics to the data that is digitized if you’re going to successfully mine it for insights. This is where natural language processing (NLP) comes in, because it has the potential to mine much of the dark data that organizations have been reluctant to approach.

SEE: Natural language processing: A cheat sheet (TechRepublic)

NLP can dissect language into individual word-entity elements. It can tag parts of speech and derive meaning from a block of text, based on the rules and algorithms that human subject matter experts endow it with. The NLP process can be further automated and expedited with the help of machine learning (ML), which uses the algorithms to process text and to recognize repetitive patterns on its own, learning from what it discovers, and rapidly working through data in a fraction of the time it would take human operators to do it.

Can NLP and ML solve every dark data problem?

No, but they can help companies make headway into the troves of text- and voice-based data that have accumulated but that have remained dark and undiscovered because there simply wasn’t time for humans to go through them. And they can provide relief to the 77% of global business respondents who feel that finding and capturing dark data should be a top priority. Perhaps it can even save a few people from dying of sepsis.

This is exactly why CIOs should have NLP on their roadmaps, if they aren’t using these technologies already. NLP is one of the best ways to ensure the unstructured data you’ve accumulated doesn’t become a wasted asset.

Also see