What the Darkweb knows about you

#OSINT guide on how to use breach data more effectively

13 min readOct 24, 2023

Part 1: Automate it!

AI-Generated art with a German fugitive J.M. in the background

Lots of talk about how to use the full scale of breach data on the darkweb. Few open-source investigators really know how. “Pivoting” on data points opens entirely new avenues for investigations. Fairly new is “Datalake” by an inconspicuous US company. It offers #OSINT investigators the largest breach data collection on the planet. This is Part 1 of 2 on how investigative journalists can use it.

How to #OSINT with breach data?

Let's be honest. Open data on the web only gets you so far. Social Media Platforms, that often are the basis of person-investigaitons, are on their guard and perpetrators using social media platforms, are too. That’s why many colleagues increasingly look at breaches, published on the Darkweb.

Just a big disclaimer right at the beginning of all this: Using such data in investigations requires the highest level of integrity and requires garnering context beyond what's in the breach data and why it warrants searching for it. Getting this out of the way, breach data is currently the big new goldmine for information when it comes to chasing bad people operating on the internet.

As an #OSINT journalist, only that warrants public interest — that is in journalism defined as “Detecting or exposing crime or a serious misdemeanour….”. And breach data is only one way to go about.

(Another disclaimer: there is something called the presumption of innocence, until someone is prosecuted/sentenced. Any data found in breaches can only be a start of an investigation, used as a meant for verification, never a means to an end. I urge being extremely cautious about what conclusions are being pulled from such data). Back to the fun…

Using breach data for investigations isn't new. However, it's becoming increasingly popular and widespread. It’s also free and for those who know, with minimal effort accessible. Some Netizens maintain updated lists with links to the latest breaches by ransomware groups. Many social media platforms tightened their personal data security and restrict users what data on other users can be scraped. All that accelerates a hype around breach data.

For people-investigations, breach data can often be creepily detailed but magically fill holes and can draw enchanting new connections. We leave all sorts of information on websites. Sometimes the mere existence of our data can raise questions (think leaks of a certain porn site, membership platforms by right-wing extremists, sites that support money laundering, or breaches of public institutions such as law enforcement).

“The atmosphere around data breaches is a little bit like when, back then in the early 2000ds, secret lists of tax avoiding individuals were making their rounds”, — anonymous

If you work in OSINT, you might know how researchers operate with data breaches in 2023. They usually start checking for a data point linking to an identity, that allows to check whether that person has been part of a breach.

When big data breaches take place, the data is leaked online. Aggregators collect it, sell it or sell subscriptions to search in it (or work with journalists to improve transparency and support serious investigations). As the bad apples know how to produce it (hacking/RW groups) and some of them how to exploit it later on (Doxxing, scams, impersonation fraud), we investigators try to use it only for good.

That means following the trail of perpetrators of confirmed crimes or serious allegations that make this work worth my time. Often a confirmed e-mail address can be a starting point. Checking on “haveIbeenpawned”, or by throwing queries at platforms such as DeHashed can allow picking up a trail of personal of interest left unintentionally online (there are more of such platforms).

But that manual cross-checking of data points in breaches can be cumbersome. The pivoting on some data points might not be possible. Breach data, that is the benefit, has so many facets. And linking them is key. From a professional standpoint, that is what I expect OSINT investigations platforms to innovate on.

Nonetheless, the output can be creepy as fuck. Breach data may reveal what languages a person is studying on DuoLingo (Breach), what online courses have been accessed (DataCamp breach), to what address a target has ordered food (Yandex Food — Worth reading: “Food Delivery Leak Unmasks Russian Security Agents” by the OSINT colleagues at Bellingcat). For investigators and journalists defending principles of public interest, that faceted data can open new avenues for research. And, let's be honest, for journalists, some of it, if verifiable, is just gold for storytelling.

Now, while I can admit that taking the manual route of scavenger hunting breach data has its advantages, I ultimately expect new systems to be built that automates the collection of data and accelerates the speed in searching them. Or in short, in ten years time, OSINT investigations work might look veeeery differently.

Crosschecking, pivoting, we need the whole process to be scalable and fast. Instead of collecting, downloading, Virus-checking various breached datasets, for some time I yearn for ways to search for details on the fly and “interconnect” data points in a way that benefits fast investigations.

A hypothetical Example

By “interconnected”, I mean a way that at any data can be pivoted. It should allow cross-referencing billions of indexed breach data entries. If I lost you there, let me explain with an example. You search for an e-mail of a suspect, perhaps listed with an Interpol red notice. You find a number of breached online profiles. The person was sneaky. He maintained hundreds of fake sock puppet accounts to mislead and stalk victims. It was a big set-up for a financial crime. The email links to a breach. You find a password that confirmed person used not once but across hundreds of other sites and accounts. The suspect was careless enough to reuse passwords. We would now have access to new leads, and perhaps new ways to find victims to his crimes…

A platform that has that methodology and data on offer, is called Constella Intelligence. It’s a US tech firm that does works with law enforcement. But also does a lot of cool stuff, such as partnering with cybercrime investigative journalists and organizations fighting cybercrime such the Anti-Human Trafficking Intelligence Initiative and the UK’s Cyber Defence Alliance. Linking breach data effectively and in an automated fashion, is in my mind doubtless currently one of the most powerful open-source intelligence approach for person investigations. Capabilities of it are vast. The way to track and visualize the results are also important. For this first part, we might only scratch the surface. We will discuss some OSINT automation features.

Why automate at the beginning of investigations?

If you worked with platforms that allow cross-referencing data, then you know that at the beginning of a search you aim to come up with questions in the data, and reduce false-positives.

A reliable data point appearing in a breach at the beginning is key. Perhaps an email address. It will allow automating “the pivoting” — the crosschecking across a dozen of billions of breach entries across the dark net.

In later parts, we will get into the nitty-gritty of cross-checking botnet addresses, hacked emails, machine IDs, IPs .. . But for now, we want to effectively produce reliable connections that we can use at the start of an investigation.

Quick Disclaimer: Over the past weeks, I had the pleasure of using Constellas system. I tested it as a reviewer. It is a browser-based intelligence system. Because my answer on how investigative journalists might be more faceted than I thought, I decided to produce this short two-part OSINT training series. I am neither paid by the company, nor do I receive anything else beyond access to the platform for a few weeks. So, without Further ado:

The Crypto Queen: Ruja Ignatova

The reward for Ignatova, aka the crypto queen, stands at present at 250,000 USD. The penalty for the fugitive at roughly 90 years in prison for her Ponzi scheme. We turn to breach data and quickly arrive of an allegedly legitimate-appearing email address of hers. After chross-verifying it with some other entries, we are treating it as such. We pivot on a few data points associated to her email address and find links to red-flagged alert platform bitcoin abuse.

In crypto investigations, I appreciate platforms such as Bitcoinabuse — now partnering with another platform, Chainabuse. Cryptoaddresses are mostly only numbers. No names the ledger entails. Platforms such as Bitcoinabuse can provide users of crypto with a hunch or detailed indications of a risk of fraudulent activities.

Network analysis of breach data on the crypto queen

An email leaked in 2009 founded crypto forum provider BitcoinTalk.org

Next comes the automation part. Constella Datalake permits to “generate an ID Graph” for this email address. It automates the search for common connections, provides even a score of maliciousness. That I have to explain a bit more. Maliciousness is being machine-learning rated from 0 to 100. 0 probably OK. 100 meaning definitely involved in malicious or illegal platforms.

Many of the platforms on the dark web are known to support transactions of criminals. Some do and don't feature dodgy stuff. There, the maliciousness score is then lower. I find it helpful to be alerted where connections or accounts might “stink”.

Below the example of a featured “ID-Graph” for the respected email address, of course redacted. We could do this for any other respective indexed breach data point. For a start of an investigation, this seems immediately super helpful.

A high maliciousness score for breach entries associated with the cryptoqueen

“Jan Marsalek” “Wirecard”

Similarly, I wondered if breach data has something to offer on the international fugitive Jan Marsalek. A confirmed breach data point, in this case his old Wirecard business email. If we dig a bit deeper, particularly the email Jamba@XXX.XX features more avenues of investigations. That is because, pivoting on the breached password (to which we by the way on the platform have no access to) allows us to uncover more than 60 other email addresses linked to breach data of other platforms. We are still verifying, but smells like a dead-strong indication that Mr Marsalek orchestrated something.

BR Reporting on Wirecard fugitive Marsalek — not a Martin Weissbruck but a Martin Weiss appears to have aided in the escape (BR, 6. September, 2022)

The interconnections with an email address featuring Martin WXXXXXXck or another email featuring the word “furball” (that raises some red flags on maliciousness) could also lead to further questions in the reconstruction of how and where Marsalek maintained operations online.

Quick explainer on the breach mentions on Moneybookers (today rebranded as Skrill) and similar platforms: Constella system gives a scoring on the breach sources. Some of those sources are also known to be used also by criminals. Doesn't mean everyone who has an account there is per se a criminal. Instead, it gives the investigators some guidance on how to use that data in the ongoing research.

Lesson learned from this section: Generally, if a password pivots across only a few email addresses, that in addition, seem very similar in their nature, then something smells. It’s something to watch out for. If you can, try to analyze each of the other email addresses, verify their origin, and check if any of their attributes link back to the original target. If you manage that, then you avoid the risk of a false positive match.

Also, ask yourself why, in 190 Billion data points, across 15-years worth of breach data, there are only a hand of email addresses link to the password that your target used. In the case of Marsalek, there is little doubt now that his emails and credentials exposed a wide network of highly suspicious activity.

Investigating VIP72

vip72 operated for more than a decade as a “cybercrime anonymity service”. In simple terms, it acted as a hiding place for many criminals and fraudsters. With Vip72, they were able to mask their true location online by routing their traffic through millions of malware-infected systems. In 2021, the forum shut its doors. Constella allowed us to show the redacted data map on breach data linking to vip72. One pivot on a unique password shows a name and residence of Hayward, California as a member of the vip72 forum.

US citizen linking to fraudster forum VIP72

Investigating Stormfront-user’s link to British police

Stormfront is an Internet forum for right-wing extremists. Members also show up in breach data. In one instance, Constella research shows hints a how Stormfront user might have applied for a job at the police in the UK.

Both accounts used the same password. Of course, if this were a real investigation, it would need to withstand multiple verification checks. It is also for this reason that we do not publish any more details. Data is only as good as the intelligence it provides. Breach data rarely provides full context. It requires further research.

Luiza Rozova (Луиза Розова)

In March 2022, the Yandex leak allowed to track a food delivery to the daughter of Putin’s ex-lover Svetlana Krivonogikh. The breach contained the address where the delivery was dropped, so tells us Putin opposition writer and investigator Sobol Lyubov, who is on Alexei Navalny team. The home address is a spacey 400 square meter apartment in the center of Saint Petersburg.

Usually, we would not expose this kind of people-investigation. Rozova herself gives no reason to open up her details online. But since the case has been widely discussed and reported, including on TG channels and by Bellingcats Aric Toler, we can try cross-checking this information.

Searching for Rozova name in Russian gives us a handful of results. None of the breach instances contain a physical address. There is an IP address that can be IP-geolocted to St. Petersburg. However, we are left with a Russian phone number. Such a number can often be a pretty decent link to identities. It consists of eleven numbers, however, doesn't comply with a St. Petersburg area code (perhaps a mobile number).

Now we pivot on that very number. We test in which other breaches it is present. We received nine hits. Breaches include “cdek.ru”, a Russian courier delivery service for goods and documents. Two results in breach data referred to Russia Database 11M. Armed with those, we crosscheck quality. One entry is immediately stating the address that we recognize off the OSINT blog posts from last year.

But there is more. One detail concerns her tech. She used two specific operating systems for her communication apps. Another detail: Instead of her first name, the user of the matching entries called herself by a different name, perhaps to disguise her real name. We don't know. There are many more details, but we stop here. It’s dangerous and harming to draw conclusions without context.

Case of Carsten L.

The identity of presumed russian-german agent is known: Carsten L., comes from the picturesque city of Weilheim on the Ammer, where he trained youth at a local Football Club and was known by some as father figure, by others referred to as “soldier”. Whatever that means. By now, you know the drill for a simple probe. We search, then we pivot. We find information that we can follow in more detail.

Suffice to say that the exercise finds that even alleged spies don't get around without using online platforms. If any of us want to lead a live, the that means our data may end up online. We surrender personal data to platform providers every day, passwords, addresses. And whatever you can imagine. If these providers are then so incompetent to surrender their system data perpetrators on the dark web, then, even double agents have to be on their guard.

This was part 1, stay tuned for part 2….

A bit about Constella Intelligence Datalake

(I am not paid to advertise Constelle, I am only testing it)

Constella is the next new kid on the block of professional OSINT platforms. As social media platforms such as Twitter and Facebook continue to curtail open sourced intel, using paywalls and the increasing problem of scraping data — as well as end-to-end encryption on the platforms and in the dark web —it simply is getting harder to do #OSINT, one member of staff tells me.

How does the platform collect the breaches:

There are 100ds of billions of identity data points sitting online, thanks to a decade of data breaches, botnet leaks and pastes. Every day websites, businesses and digital platforms get breached and customer data spilled online all over the world. Every day, there are more scams and exposures affecting people. The data sits online, up for grabs. Constella takes it, cleans it, and feeds it to its database.

Likewise, they affect threat actors and criminals, too. “A treasure trove of identity data can be exposed in the these same breaches”, explains Lindsay Whyte, who works for Constella Intelligence (and is part-time “Hunter” on UK’s Channel 4 'Hunted'). They represent digital footprints and monikers that can’t be easily removed by threat actors wanting to cover their tracks.

This breached identity data can be harnessed for good, Whyte thinks. He says that OSINT movers and shakers like Michael Bazzell and Brian Krebs —some of the world’s leading cybercrime investigators — both endorse breach data as critical for digital investigations into threat actors and fraudulent businesses.

“Although breached credentials are a worrying entry point for cyber criminals and activists, it can actually be leveraged to give our intelligence analysts and strategists the upper hand”, so Whyte.

When breach data is normalized and processed, it provides a ‘super computer’ for investigators to uncover identities behind IP addresses, emails, names, addresses and phone numbers. The real identity of criminals can be unmasked…without (increasingly unreliable) social media or browser results, he says.

Why did we redact personal details in the Screenshots? Details of anyone who doesn't comply with the journalistic public interest principle, should not be named, or their information stated.