How to automate leak data OSINT verification: I-Soon

Since February 16. the world in cybersecurity is in disarray. The leak on the Hacking-for-Hire firm I-Soon gives an unprecedented look inside the Chinese state-sponsored hacking sphere. Here are the most relevant aspects for OSINT investigators. Above all, how to automate leak data analysis with large language models (LLM).

18 min readMar 1, 2024

Patient zero, the I-Soon Data dump on GitHub, now taken off the platform — we give you the top OSINT takeaways how to verify and work with the leak

Dissatisfaction inside small tech companies is no news and rather common. But when data leaks at such international importance, that shakes the world of cybersecurity and politics, every bit of breadcrumb counts.

I-Soon employee wxid_icges6alg8cl21 writes in March 2022: “…There is no salary increase for three years”. Does this warrant enough potential for whistleblowing? Matched with long hours and questionable motives. Perhaps. Perhaps not. The fact remains that literally anyone could have dropped the leak on GitHub. The anlysis suggests the documents was curated, and cleaned. For the first seven days after the leak was discovered, several questions evolved. First it was: “is this leak genuine”? Then is the company relevant? Then: Are the targets mentioned in the leak real, and were they hacked? Now: What can we learn from a nover approach researchers took, to analyse the leak? This is were this post comes in. But first:

TL;DR version on the status what threats the leak contains and what picture this company draws: I-Soon is a hacker-for-hire private company. Its offer include wesnter recognized cybersecurity and hacking tools. US cybersec giant CrowdStrike thinks I-Soon is basically Aquatic Panda, a highly aggressive advanced persistent threat (APT) actor that committed many crimes and wreaked havoc at least since mid 2020.

BushidoToken, a cybersec expert from the UK writes in his excellent blog post that tools and campaigns run by their operators “highlight how both the Chinese MPS and Chinese Ministry of State Security (MSS) outsource their intelligence gathering to commercial surveillance vendors”.

He also concludes that this leak showed, like never before, how interconnected Chinese hacking campaigns are: “The links already uncovered between multiple long-running APT campaign and I-Soon as a single entity has essentially taken a hammer and smashed the notion of neatly defined “threat groups” conducting campaigns in a siloed manner”, he writes.

Another subsequent analysis on the leak, by Harfang Lab, a french cyber security outfit, concludes that the main takeaway, above all others, is how the analysis went. In the case of I-Soon it has never been easier than before to analyze such a leak, “thanks to the progress of AI technologies”. At the end of this posts, we go deep on what this means in practice. Especially relevant, the machine-translations that quickly provided by the community, the use of public large language models (LLM) to transcribe, translate and summarize conversations and documents. I would go even as far as claiming, that it’s a new era for leak data work.

OSINT Verification

Within hours of the discovery by Taiwanese IT security expert @AzakaSekai_ (first post on Twitter, received more than 15,000 reactions), crowds of cyber security OSINTers downloaded the data (from Github, now apparently disabled), machine translated it, coordinated the analysis of the leak and shared first insights. I got in touch with @AzakaSekai_ early on Monday. He mentioned that based on the number of chat logs and sheer amount of information, it was highly “unlikely that this is staged”.

Just in November of last year, I produced a lengthy guide for OSINT investigators on how to verify a leak such as the Vulkan files. Little did I know, that three months later, we would apply it, to what seems the Chinese equivalent to the Vulkan Files.

One item that AzakaSekai_ mentioned is indeed complexity. In that way, the I-Soon data dump is with 188MB much smaller than other leaks. The most files are images, roughly 489 PNGs with no notable metadata. A check on metadata2go shows that PNG were “cleaned”, the file names of the PNGs carefully randomized or anonymized. Also, there are images, most likely from a smartphone in there, sandwiched in with Screenshots from a personal computer and chat messages, in a markdown file format. So, first hint here: this likely didn’t come straight from a single device. And seems to be above all “curated” and “composed”. One expert we interviewed said that “one has to asked what isn’t in the data” and this leak, at best, reflects “mere fragments”, however genuine it is.

Metadata check of PNG doc, “cleaned up” showing no further intel

The first impulse last Monday was to check whether the CEO of the company was somewhere mentioned. There AzakaSekai_ he pointed out that the CEO’s name, Wu Haibo (public registration data platform dingtalk), was already listed on the salary list and could be cross-checked with the public company registry in China.

5. OSINT Learning for the investigative OSINT work

To avoid repeating everything others have already said in their analysis, lets concentrate on seven main takeaway learnings.

1. Use OSINT on GitHub

The data dump occurred on GitHub, which makes it easy to leak data. It’s the world’s largest code repository, with millions of users and repositories covering a wide range of topics and industries, so the OSINT guide by fellow ethical hacker and OSINTer @Cuncis. This makes the platform an excellent resource for conducting OSINT, he writes, and offers GitHub Search, Gitrob, TruffleHog, GitDorker, Shhgit, GistSearch, GitHub Recon, GitGraber, GitHub CLI or GitLeaks as tools (you can find his post here).

Every user has an email address. With bit of knowledge, it used to be easy to find it (link). That isn’t possible anymore for this GitHub user. Subsequently, we learned that on February 23, when many of the newspaper coverage came out (including ours), GitHub disabled the I-Soon leaks repo page.

Searching on GitHub is possible via in-platform search: “https://github.com/search?q=i-soon&type=repositories”. There are other types of content on GitHub other than repos, but lets start here. It unearths one Axun account with 51 Followers, the cybersec profile D0g3-Lab based in China state hacker capital Chengdu. We also see code for Axun’s CTF (Capture the Flag events, if not familiar, basically, hackathons for hackers) every year from 2018 to 2023, updated on 25th of December), suggesting that I-Soon employees might have used GitHub in very past. The last CTF was an international one, and some colleagues mentioned it in their posts (Link).

GitHub post by cybersec analysts on *D0g3-Lab*, presumably an I-Soon employee’s GitHub account: Malware and ransomware found.

The search on GitHub now also brought to light other stuff. Since the leak is open, its contents confirmed and hacking nature of I-Soon revealed, members of the GitHub community started creating blacklist on I-Soon’s relevant domain URLs. dns-blocklists started this and added among others, following URLs:

i-soon.net

i-soon.com.cn (IP Address: 112.126.82.38)

2. The Company

Once we have some URLs, we can do some Website digging. Some reporting has obviously been done by the October 2023 published blogpost by NATTO, which takes a deep dive into the relationship to ATP41. Let’s concentrate what techniques there are to assess the company network, investors, funding etc.

At first, the company appears small and isolated. But thanks to Chinese bureaucratic company registries, including Dingtalk, pitchhub.36kr.com, datauseful.com, and QQC.com its possible to draw a company network of entities, some of them founded or invested by once hacker, then tuned big-shot businessman, Wu Haibo. Next to Wu Haibo, there are three other partners that keep showing up in the registrydata: Li Ping, Chen Cheng, and Yuan Jie.

There is also some PR material on the company on Pitchhub from September 2022, which mentioned the Ministry of Public Security (MPS) as a partner, targeting a “cross-border gambling group”:

The established links to MPS are of essence because it links to the operations on public and political security. It is also no secret that the company obtained certified supplier-ship from MPS’s Security and defense Bureau, already in 2019, that allows to partner for high level security work with government and security agencies.

2020, I-Soon received clearance for “Class II secrecy qualification for weapons and equipment research and production company”, “武器装备科研生产单位二级保密资格”, certified by the Ministry of Industry and Information Technology (MIIT).

Other certifications are listed on DingTalk, including “Information security service qualification certification” but also fairly uncomplicated ISO9000.

On the company registry Dingtalk, there are also traces of a project in Aksu, Xinjiang, in which I-Soon Shanghai was shortlisted for. To understand the subsidiary structure, take into account the following network, extracted from online registry information:

With the help of the business registry data, we can also find individuals of “confidential personnel” programed in the data 6d7fc7b3-c892–4cb5-bd4b-a5713c089d88_0.png

Chen Chen and a bunch of other candidates that show up in the leak data

Also on information on intellectual property (IP), Dingtalk keeps being a useful resource. The relevant tab lists at least 64 entries around trademark/IP related items, some registered and granted, some not. There are some interesting details to crosscheck with the leak data.

The patents listed under I-Soon’s profile are also telling. They include “Wifias proximity attack method”, a “An anonymous anti-tracing method and system based on blockchain networking” (hence the hireing of blockchain skills, further down), “Wifias proximity attack method” or “Non-intrusive website remote detection system and detection method”. Also, worth mentioning are the Software copyrights listed here. There is one for “Gmail email forensics platform”. In the leak we have a big data platform for analyzing email account information. Funnily there is one instance for “Software for efficient processing of leaked data from multiple sources”. There is also a “Divine operator password cracking platform software” that appears ominous. Now there could be some trouble with the translation or misinterpretation, but with words in the titles such as “WiFi proximity attack system”, “Soldier Toolbox Software”, or “forensics platform” one, without being a cyber security expert, can see that this company, for years, portrayed the profile of a firm that potentially could contribute to building tools for offensive warfare.

The company is also growing, so posts notes by HR to look for, in this case at least five more employees, all with technical development skills.

HR post by I-Soon in Sichuan, released end of September 2022

In a 2019 press release on a technical university campus site, the company attempts to fill 33 positions. Such campus recruitments seem very common. Next to what they are looking for, which in turn might allow a hunch what they where gearing up for (Big data engineer, Safety research engineer, Penetration testing engineer, Security service engineer, Code audit engineer & lots of sales, to sell it all), they are particularly keen on offering benefits: “Good working atmosphere and development space. Five insurances and one housing fund: Parents can feel at ease”. In another recruitment post on LinkedIn, the company hires people with “English reading ability.”

Best to search from Google Lense images and Baidu Maps

Geolocating the address on Baidu maps isn’t that comfortable that it is with an address in the US on Google maps. The Sichuan branch officially operates out of “Building B, Cuifeng International, №366 Baicao Road, High-tech West District, Chengdu, Sichuan Province”

3. Have others I-Soon on their radar: Web Archive

If companies moving into the crosshair of investigators and perhaps commercial firms and layman snooping around, their web archive records of the firms website might blow up. For I-Soon, web archive records ramped up on Feb 18, after the GitHub data dump was discovered.

But we see entries as early as 2010, when the company was founded (funnily, using a webpage theme design of an American web designer called nickifaulk), and in 2016 and 2019 when the company was on someone’s radar. Some automated analytics might reveal sites that one might should have on their radar. In comparison, the NATTO blogpost was only archived after the data leaked.

4. Chats

The chats are an incredible resource to understand the vibe inside this little company. The company officially helps the government to fight gambling crime, with something they called in the leaked documents: Falcon anti-gambling platform, “providing comprehensive
gambling-related user data, provides law enforcement with a large amount of gambling-related PII and regularly updates the platform to ensure the accuracy […] of the gambling data”.

The chats reveal that staff are not opposed to gambling, one saying: “Play Mahjong” — ” I just lost 1300 yesterday (angry)” — “Lets lose “some more today”. Such detail is of prime interest to judge the people behind the state hacking and their morals.

Also interesting, the question if the chats show any conscience or remorse by employees, relating to the state-funded hacking attacks.

Most of the chats between Shutd0wn and Lengmo, searchable here on Github

Ther first challenge of such a leak containing markdown files in Chinese, to translate it. There were some initial instances of users auto-translating the files. Michael Taggart, aka mttaggart on GitHub, provided a translation of such nature. He included the original and English translation, in a neat structure.

One of the best (and quickest) data analysis came from account soufianetahiri, who looked at the metadata of the chat messages (above). Chats senders are only reveled by their aliases. This is weird. Why would they maintain aliases internally? Did I-Soon a security protocol that required using aliases? We don’t know. But it was possible to decrypt some of the senders behind those usernames.

Overview of messanger Shutd0wn and Lengmo:

@Soufianetahiri found the most common chat senders/receivers. “lengmo”, made up the ballpark of messages, with 4981 records , and “Shutd0wn”, who send 4635 chat messages to lengmo. Now who is who? Soufianetahiri concludes that both must fill some sort of leadership role: “…potentially indicating a key relationship or hierarchy within the group. lengmo’s high level of activity could suggest a leadership or central role in the conversation dynamics”.

Shutd0wn was identified as CEO of I-SOON, Wu Haibo — also because there was a Wechat account, which later disappeared, and a Skype account. Wu Haibo also used his personal email address, shutdown@139.com — which again points at the alias — to register the website in 2010, of I-Soon: i-soon[.]net, so Nathan Patin in his Twitter post.

Baptise Robert made some interesting discovery on LinkedIn, where he found a profile linked to lengmo, that might have confused more than initially helped.

Next to Wu, most visibly present is Alias “Lengmo”, possibly Co-founder and COO Chen Cheng. He is also listed as a shareholder of the parent company I-Soon Shanghai and is minor shareholder of “Shanghai Nacan Information Technology Partnership”, with Wu also representative.

Also here are WHOIS registry data fruitful: @fs0c131y noted that lengmo created http://crst.com[.]cn a website that appears to be affiliated w/ the Chinese hacking group C.Rufus Security Team (CRST) in an early Whois record for his personal blog, he listed CRST.

In the chats, Lengmo is the one expressing sincere doubt about his value to the company. Lengmo writes about his dissatisfaction:

For the verification, others analyzed when chats took place: There is meta data on the chats and timestamps suggest peak hours of sending and receiving chats, were middle of the night, betwen 2am and 3am, as well as 8 and 10 am, and lowest traffic, at 4pm, Sichuan time. The morning traffic is normal. The midnight chat are not, meaning that could be related to someone being overseas or the founder Wu Haibo making a habits to work in the middle of the night, assuming that his counterpart is also awake.

5. PNGs

PNGs in Leak documents is a nightmare to deal with. But most content in this leak are screenshots and images. Thomas Roccia @fr0gger_ found 489 PNGs. First, they have to be OCRed (Optical Character Recognition turns pixel images into text using machine learning), then translated, then made searchable. There was one early attempt of such, a GitHub account Soufianetahiri, who exported a json file with machine generated OCR translation of all PNGs.

On the OCR side, Soudianetahiri shared here approach on GitHub. For this he writes the “function ocr_to_json”, with python packages pytesseract and Image, and lets it recognise and printout Chinese into a json file on the entire dataleak folder (anxun_leak/I-S00N/0). Pretty neat!

6. Automated summary: Using generative AI

But Thomas Roccia stepped it up the game of leak analysis, by leveraging AI to summaries leak data from PNGs in a foreign language. That’s impressive. He applied data science techniques to analyze data in PNG format and in the Chinese language, concerning a leak related to a government contractor with offensive capabilities. Roccia also explains the purpose behind the use of AI. Using generative AI is to help summarizing some of the information available in the leak and to “finally created a RAG (Retrieval Augmented Generation) to enable the exploration of and specific data requests without manually digging through the vast amount of information”.

A RAG does the following: It’s a technique that enhances language model generation by incorporating external knowledge, as Tejpal Kumawat explains here. It collects relevant information from a large corpus of documents, in our case OCRed and translated PNGs, and using that information to inform the generation process.

Roccia warns of caveats summarizing data like this, namely that there are “…possible inaccuracies in the translation or limitations inherent to LLM technologies”. After reading out the leaks PNG contents into a json file, he uses Gen AI (chat GPT 4) to summaries the data, with the following piece of code (mind the messages):

file_name = data[0]['file_name']  # Adjusted to access the first item in the list
translated_text = data[0]['translated_text']

from openai import OpenAI
os.environ["OPENAI_API_KEY"] = api_key

client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-4-0125-preview",
  max_tokens=4096,
  messages=[
    {"role": "system", "content": "You are a Cyber Threat Intelligence analyst specialized in China operation. You are dedicated to analyzing leaked sensitive information in relation to Chinese espionage capabilities. The data contains multiple format documents, chat conversion, screenshot of products."},
    {"role": "user", "content": f"Make me a summary of this information: {all_translated_texts[:10000]}" }
  ]
)

print(completion.choices[0].message.content)

Roccia show how to build with a database approach in mind a Retrieval Augmented Generation (RAG), at one stage by using the ChromaDB and Langchain, which enables to construct LLM‑powered apps. On the question “Can you give me details about intelligence capabilities from this data leak?” he receives the following output.

He then uses the assumption the documents could contain specific cyber security targets, and pivots to the following prompt: “Which countries might be a target according to the documents?”.

You can find Roccia’s Jupyter Notebook with all the packages and code here or check out his thread on X.

7. Hacker Group Connections: IP addresses domains

Finally, and for cybersec experts perhaps most important, the connections to threat actors (ATPs). There the leak kept giving. By analyzing the IP addresses or domains mentioned in chat logs or images from advertising material, ties to ATPs were loosely possible. A connection was claimed to Fishmonger ATP, as it used Winnti and ShadowPad malware. There is a connection to Earth Lusca (the same as AQUATIC PANDA). An employee apparently shared credentials to a machine whose IP address matches an entry on a black list.

One IP address matched those used in attacks taken place between November 2018 and May 2019, when senior members of Tibetan groups were sent malicious links in social engineered WhatsApp messages, with operators posing as NGO workers, journalists, and other fake personas. Harfang Lab points out that a screenshot for the Linux RAT (12756724–394c-4576-b373–7c53f1abbd94_17.png) shows the internal name “Treadstone”, a Winnti controller mentioned in FBI’s 2020 indictment of Chengdu 404. There are also chats between Lengmo and Shutd0wn, that shows a connection to Chengdu 404.

From indictment of Chinese ATP41 hackers

To conclude, to link to threat actors, it’s not only useful to check IP addresses and domain names. Sometimes its buried in the discussions and debate between employees. That what makes the large language model analysis powerful. In the case of Treadstone, there it helped to analyze the programs used in the demos. In short, much can be automated, but the forensic part should be done by hand.

Targets

The offering of I-Soon is vast and covers several “verticals”, ranging from PC permissions for government personnel, File server permissions, over to power networks and medical networks (Public infrastructure), to national surveillance (Xinjiang, HK, Taiwan related), gambling, pyramid schemes websites. Hargfang Lab points out that a cooperation proposal with the Bazhou (Xinjiang) province security bureau stated that, based on I-Soons company’s APT work for more than ten years, the company has controlled various types of server permissions and intranet permissions in multiple countries.

Certainly, worrying are some of the unconfirmed targets dropped in there. One list, the markdown file 1cdc26f-e773–4ad7–8808-d04abf16aae7.md, contains at least 78 Target names, with 50 domain addresses, all labelled as “targets” (目标名称). Of course there could be an issue with translation. But the row labelled as translated “sample data size”, sometimes with more than 600GB of data, in one instance — — 2 Terabytes, there is a high chance that this data resembles hacked loot from targets. Its being substantiated by the category filetype,

A quick summary of the data collected by the GitHub account (data here), finds the main targets by sheer protensity.

The target list includes pension funds, hospitals, police stations. In total out of this file, there is “sample data” worth (everything has been leaked, and even translated).

OCRed and Translated by Google Lense/Google Translat

To factcheck some of the containing email addresses, let us pull out an email address that is mentioned in there. Take the email “maXXXXX@kkr.gov.my” (intentionally not mentioned). By using Epieos email OSINT reverse search tool, we can find the full name of the person who owns the email address. His name is MXXXX BXXXX BXXXX and located in the direktory of the Kementerian Kerja Raya of Malaysia, who works at the “Policy and International Division”. With that, we can be sure that the (possible) victim fits the bill.

The cybersec account @stealthmole_int worked out several highlights in the data leak and visualized it. Essentially, it worked out the target countries mentioned in I-Soons PNGs and how they link to each other. With a platform, I didn’t know before (platform.stealthmole) they visualized the connection. I think this is fairly good way to assess which nations the work of I-Soon’s contractors may have concentrated on. The most connecitons we find for…

MN (Mongolia) suffering instances of oppression of the Chinese surveillance apparatus, WAP, 2023) — Doc, Doc
TH (Thailand) study on the “expanding sphere of influence wielded by the People’s Republic of China (PRC) within Thailand”, 2023, Air Force University) — 0-c5f1d959–39d1–4176–9cb1–1fb6e8baedc3.png: Thailand’s Digital Government Development Agency, 07f179c5–5705–4dbd-94a7–66eed1e066b0_2.png, DOC, DOC, DOC with Thai Railway, )
VT (Vietnam) — (0–6848748d-2881–4c26-b153-fcd5373d2f1c.png with Vietnam Airlines Vietnam Airlines; dbc9c90e-a3e6–4d71-bb93–5fb8394095ac_0.png: Vietnamese Ministry of Economy
HK (Hong Kong): I-Soon claims to have secured back-end access to higher education institutions in Hong Kong and self-ruled Taiwan, which China claims as part of its territory, according to Guardian coverage — 178e3898–903d-47cf-bfbe-061e7dc18895_5.png

From the Leak: “Scientific Internet Box-Desktop Version Product White Paper”

5. India: I-Soon claimed mentioned the government of India, a geopolitical rival of China, as key target for infiltration. In 2022 new reports emerged that China has targeted India’s public infrastructure.

32eb7662-f212–4811-a7c1–1cfeb121cd99.png
48fd4c79–41ca-459e-a5a5-a3738e7a4af3_0.png: “BSNL Operator — Airindia Airlines has 100,000+ daily check-in users”
64bba692-d430–440c-9f1e-2575f45770af_6.png: “Electric power network @ Involving India, Nepal, and Tibet”
64bba692-d430–440c-9f1e-2575f45770af_10.png: “…With a professional APT penetration research team, rich APT penetration experience and mature APT implementation process, we are oriented to domestic public security
Department, based on the business needs of the public security department, performs APT implementation tasks on specific targets to obtain key intelligence data on specific targets….hit on…involved in India, involved in Nepal, involved in Tibet”
64bba692-d430–440c-9f1e-2575f45770af_11.png: Intelligence Services-India-related
178e3898–903d-47cf-bfbe-061e7dc18895_5.png: Automatic switching
5387a301–0af8–4e24-a197–20189f87b9ef_10.png: “provide information on the direction of India”
eda5b003–9250–4913-b724–74cca86240af_7.png: Data of Indians

I-Soon: Advertising on hacking key targets in India

Hot tip on how to search: Go to the OCRed english-translated .json file by Anxun-isoon/OCRd_images, control-F search for keywords. Then pick the file and find it on mttaggart / I-S00N GitHub repo.

Doument 0–32eb7662-f212–4811-a7c1–1cfeb121cd99.png for instance picks out a dozend countries. However fails to mention what purpose these targets fulfill. Have they been hacked? Or is this a kind of wish list? We dont know.

For the Europe, Harfang Lab picked out the targets in Europe, and categorised them:

I-Soon’s web presence shows also a page themed “international”. With that, it means to help companies and the country of China’s domestic network security construction and resistance, to secure against international threats (archive link here).

There is no forensic evidence yet on the who leaked it. Also, as experts suggest, such an exposure of a leak wont necessarily hurt hacking groups and hacking-for-hire companies. Chengdu 404 has still operated years after being exposed. But what the leak unmistakably shows is low moral among employees. It can be read between the lines of the leak.