Large-Scale Online Deanonymization with LLMs

Large-Scale Online Deanonymization with LLMs

(simonlermen.substack.com)

336

by DalasNoin

Pdf: https://arxiv.org/pdf/2602.16800 (via https://arxiv.org/abs/2602.16800)

john_strinlai

many people tend to overlook how little information is needed for successful de-anonymization.

i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):

"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.

i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.

txrx0000

We don't need everyone to be completely anonymous to state and corporate actors. We just need to make it so that they can't identify and surveil everyone at once, because it would be too expensive.

The US defense budget is about $1T dollars. They can't spend it all on surveillance, but let's say tech companies + gov spends about this amount per year on surveillance in total. If we can raise the cost to surveil the average person to over $10K/yr, they just lose. This is very doable.

Every little precaution you take will raise the cost, probably more than you think. Every open-source project that aims to anonymize and decentralize is an arrow in their knee. They're hoping that you'll get cynical and stop trying because they don't stand a chance otherwise.

Taek

Unfortunately the cost for this stuff is going down. Cheaper to collect information, cheaper to store it, cheaper compute, and better algorithms that mean you need fewer resources.

If the cost to surveil the population is $10k per capita today, it'll be $1k in a few years and $100 a few years after that.

This is a war that can't be won, it's just part of the changing landscape of technology in the information era.

txrx0000

I don't think the cost has been doing down or will continue to trend downward long term. You're assuming that the public hasn't gained and won't gain additional capabilities while our adversaries evolve. But look at our communication reach, bandwidth, latency, and cipher strength.

How easy was it for the government to deliver mass propaganda before the Internet without the public realizing? How quickly and how many bits of information can Alice in Seattle reliably get to Bob in Houston with a strong cipher in the 1960s? Was there ever such a thing as a cipher that's widely used yet unbreakable by the state? Why do you think China banned TLS 1.3? Do you think it will be harder or easier to pretend to be a different person when there are open-source LLMs that can run on a gaming computer?

The Internet is a recent invention. Smartphones and seamless network coverage are even more recent, and so is curve25519. We're closer than ever to what is effectively secure instant telepathy with anyone in the world. We just need to stay vigilant and not be fall for doom and gloom in this last stretch.

DalasNoin

That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)

john_strinlai

>we make a pretty direct comparison in section 5

awesome, i saw the mention in the introduction but i havent yet had a chance for a thorough read through of the paper -- ive just skimmed it. looking forward to reading it in-depth!

mtone

14h

> Does privacy of Netflix ratings matter? The issue is not “Does the average Netflix subscriber care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy can be compromised by analyzing the Netflix Prize dataset?”

Well said.

c22

10h

A silver lining of the ai apocolypse is that users may be able to use the technology to maintain their anonymity via llm paraphrasing.

john_strinlai

as the_af says, stylometry is only one technique in a bag of techniques used for de-anonymization. a big one to be sure, but nowhere near the only one.

c22

As you say, the_af mentions this an hour before your reply. I'm curious what is the point of your posting a "me too" comment here? Was it to teach naive readers the word stylometry?

the_af

My guess is that a statistical analysis of other things such as access patterns, timestamps, content you engage with, etc, could de-anonymize you regardless of the phrasing you use, so LLMs won't save you.

c22

True, but you could also use llms to autonomously engage with content you're not interested in, batch replies for times you're not around, inject coherent, consistent, plausible, but false details into your messages, or modify/flag details you didn't mean to disclose.

Jerrrrrrrry

Throwaway accounts using "clever" turns of phrase can often be anonymized by double click, right-clicking -> googling their witty pun and seeing their the sole instance elsewhere, on Twitter, Facebook, etc

If I see a couple words I dont know in a row, I can infer a posters real name.

Id be more specific but any example is doxxing, literally so

SchemaLoad

24h

If you have access to the whole site dataset it's much more reliable with simpler checks. You can just use word usage frequency of common words. Someone posted a demo here of doing this to HN comments which was very effective at showing alt accounts for a user.

plagiarist

I assume one's vocabulary is basically a fingerprint, even if one doesn't use unique turns of phrase. Domain knowledge just leaks in and we aren't conscious of it being identifiable.

somenameforme

It also geographic. There's a bunch of quizzes online where in 10 or 20 questions, it can tell you exactly what area in the US somebody is from. It comes down to the terms you use that you might not even realize are not universal. Highway vs freeway, what you call a sugary carbonated drink, and so on.

OTOH I think a lot of these methods don't matter that much because of plausible deniability. Stylometry and other stuff processes is always probabilistic, and can be dismissed.

john_strinlai

>OTOH I think a lot of these methods don't matter that much because of plausible deniability. Stylometry and other stuff processes is always probabilistic, and can be dismissed.

while all of it is probabilistic, the issue is that the probability can quickly begin to approach 1 when multiple sources of data & varying techniques are combined.

user3939382

MIT showed this in 13 after the government was caught illegally spying on Americans with “just metadata”: https://www.nature.com/articles/srep01376

password4321

Maybe it's time to finally track down this person: http://voidnull.sdf.org

This page is anonymous

20190119 https://news.ycombinator.com/item?id=20220048 (149 points, 51 comments)

20130501 https://news.ycombinator.com/item?id=5638988 (453 points, 243 comments)

https://news.ycombinator.com/threads?id=voidnull

https://antirez.com/hnstyle?username=voidnull

alexpotato

21h

Many years ago (early 2000s) I worked for a firm that would help identify people who were doing "pump and dump" stock scams on Yahoo Finance message boards.

Step 1 was to scrape all of their posts into a database.

Step 2 was to have a human analyst review all of the posts for clues about who that person was

It was amazing that you could easily figure out:

- if they were at work or home from when they posted (9am to 5pm vs 6pm to 1am)

- what city they were in (based on sports teams, mentioning local landmarks etc0

- roughly what career they had

- their age based on cultural references

and mostly b/c they would drop a crumb of information here and there over months. They probably forgot about all of these individual events but when reading all of the posts in a few hours, the details became pretty evident. You get enough of these details and you can start to venn diagram people down to a few 100 likely candidates and then use LexisNexus style tools to narrow it down even further.

Given the above, it doesn't surprise me that LLMs can do the same but at high speed and across multiple sites etc.

rudhdb773b

20h

Did you have a contract with SEC? Just wondering what kind of business would have an interest in that.

wraptile

14h

Not OP but I have experience in private sector here - Deanonymization in private sectors is used by anti-fraud or brand protection systems. For example, in brand protection we identify same IP/scam infringer across multiple store fronts and then we can shut them down directly or get more certainty on their other posts. i.e. if it's a known infringer their scam likelyhood score goes up on all of their listings. So deanonymization doesn't have to point to exact real identity - just enough certainty to tie multiple entries together and then other systems can take it further like OP's manual review tho LLMs can obviously do a lot these days.

tsumnia

I recently decided to play around with this, given... well my profile... and I will say that Gemini was good at zeroing in on who I was, but for whatever reason would refuse to stay my name.

dirk94018

21h

This is exactly why local inference matters. Every query you send to a cloud API is another data point. Your prompts contain your code, your logs, your thought process — arguably more identifying than your HN comments.

The paper shows deanonymization from public posts. Imagine what's possible with private API traffic: the questions you ask, the code you paste, the errors you debug. Even if providers don't read it today, the data exists and the cost of analyzing it is going to zero.

Air-gapped local inference isn't paranoia. It's necessary.

Imustaskforhelp

12h

Combine this with the fact that even the private mode of any AI provider still keeps logs of the chats and from some past discussion iirc, will keep it indefinitely.

> Air-gapped local inference isn't paranoia. It's necessary.

I definitely agree, I am seeing new model like qwen-3.5-30A3b (iirc) being able to be run reasonably on normal hardware (You can buy a mac mini whose price hasn't been inflated) and get decent tps while having a decent model overall.

There are some services like proton lumo, the service by signal, kagi's AI which seem to try to be better but long term, my plan is to buy mac-mini for such levels of inference for basic queries.

Of course, in the meanwhile like for example coding, it might not make too big of a difference between using local model or not unless for the most extremely sensitive work (perhaps govt/bank oriented)

aspenmartin

19h

I tried this today with this username and other usernames on this and other platforms with Claude Code

- First it told me it couldn't do this, that this was doxxing

- I said: its for me, I want to see if I can be deanonymized

- Claude says: oh ok sure and proceeds to do it

It analyzed my profile contents and concluded that there were likely only 5 - 10 people in the world that would match this profile (it pulled out every identifying piece of information extremely accurately). Basically saying: I don't have access to LinkedIn but if I did I could find you in like 5 seconds.

Anyway, like others have said: this type of capability has always been around for nation state actors (it's just now frighteningly more effective), but e.g. for your stalker? For a fraudster or con artist? Everyone has a tremendous unprecedented amount of power at their fingertips with very little effort needed.

danielodievich

I post under my real name here, pretty much the only place I post. It keeps me honest and straight in what I say when I choose to say it. I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration. I don't know what it will be but I would expect some adversarial stuff. Trying to keep clean is what I'd prefer for myself and my kids.

On other hand, the Neal Stephenson's Fall or, Dodge in Hell book has an interesting idea in early phase of the book where a person agrees to what we now know "flood the zone with sh*t" (Steve Bannon's sadly very effective strategy) to battle some trolls. Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core. It is cleverly explored in the book albeit for too short of a time before moving into the virtual reality. I think there are a few people out here right now practicing this.

DrewADesign

> I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration.

I don’t think you’re wrong, but the fact that people consider it inevitable we’ll all have an immutable social acceptance grade that includes everything from teenage shitposts to things you said after a loved one died, or getting diagnosed with cancer, makes me regret putting even a moment of my professional energies towards advancing tech in the US.

monksy

I think he's wrong and I'm willing to say that. The ability for people to move beyond the fundamental attribution error is well known and takes major resources to correct that. For anyone that posts a comment, assuming you want to have easy attribution later is that you must future proof your words. That is not possible and it is extremely suppressive to express yourself.

For example: "Ellen Page is fantastic in the Umbrella Academy TV show" Innocent, accurate, support, and positive in 2019.

Same comment read after 1 Dec 2020 (Transition coming out): Insensitive, demeaning, in accurate.

JohnMakin

> That is not possible and it is extremely suppressive to express yourself.

Also for the fact that you cannot predict how future powers will view past comments - for instance, certain benign political views 20 years ago could become "terroristic speech" tomorrow.

I operate by a simple, general rule - I don't often say anything online I wouldn't say directly to someone's face in real life.

NetOpWibby

> I operate by a simple, general rule - I don't often say anything online I wouldn't say directly to someone's face in real life.

More people should keep this same energy. I try to stress this to my kids and it feels like it's falling on deaf ears in regards to my teen. Alas.

JohnMakin

I can be a rude prick online sometimes, but I can be in real life too - basically though the reason I do this is I never want it to be some huge surprise IRL if someone sees what I write online and be like, "wow, I didn't know that about him." I'm pretty much what I am online and IRL the same. For some reason this seems to matter for me, at least in the past when people have tried to like, send employers stuff I may have written online. The reaction is like "oh, yea, we knew that already about him."

Nothing terrible, maybe slightly embarrassing, but you know how online spaces can be. just be yourself basically, at least I try to be.

albumen

23h

Your framing is interesting. You may feel that you can’t change who you are in real life, but people have a choice on how they behave online (or choose not to engage at all). So you could choose to be nice (or at least not a jerk); I’m pretty sure you wouldn’t get people writing to your employer complaining. I’d argue that if you know you’re sometimes a jerk, it’d be less stressful for you and others if you didn’t bring that energy online.

JohnMakin

17h

Sure, there is a choice. it’s rarely/never been stressful for me though, and I value being who I am for my own reasons as a strength and not a weakness. I always try to play by the moderation rules as I can possibly and realistically do. some of what I’ve written online has gotten me opportunities it wouldn’t have if i’d been more hesitant.

My point is if you have a good track record what you maintain online vs irl doesn’t matter as much to people as you’d maybe think as long as you are being true to yourself. I’m an elder millennial though, so that’s always been the case online for me and i dont think i often get out of pocket online anyway.

maybe that won’t be the case in the future. I could write a lot more than I’d care to publicly about personal and implied threats I’ve received based on my writings, but caving to that to me would betray my own values and I choose to consume the web how i choose knowing possible consequences - plus the fact moderation standards and what is “rude” drastically differs amongst platforms.

Imustaskforhelp

13h

This really hits a string with me, adding on to this, This is how I believe the same way but I would argue that I might be more nicer online than offline because I am better able to control any emotions imo when I give more thought to it.

Because I don't really appreciate flame wars and when that's the case, I like to take some time to find common ground and just have a respectable discussion when possible.

This approach is harder to work irl because those moments are also spontaneous & it does require significantly more discipline to control one's emotion within seconds rather than minutes, but its something that I think I can work upon as well.

But I would say that aside from that, most of my comments are pretty spontaneously written. I frame it as a question of being honest with myself at times, I think I am mostly pretty much the same IRL and online as well.

Another point but such forums also act like a journal to me for my future to read as well. I try to write comments in such sense that in future, I can read them and try to accurately remember what my mind was thinking during the time/days I wrote that comment for self-retrospection as well.

Edit: Although now that I think about it, there are definitely some subtle changes I might have online vs irl but I would still say that I feel like my accounts are pretty authentic fwiw (personally) but I am happy with my authenticity online but there's definitely a level of my thinking which worries about any comment being permanently available though.

qsera

19h

As someone who gets dopamine hits from downvotes on HN, I approve of your behavior!

>just be yourself basically

Yea, it is boring when everyone is the same. I would like a rude but interesting world (even if I might not survive long in one), than a nice, boring one.

the_af

"Just be yourself" seems to me a lot like the rightfully discredited "if you don't have anything to hide...".

Everybody has something to hide. Everybody has said things they regret, or meant to be heard by some people but not others.

danilocesar

21h

This is very import: you don't know how the cancelation culture will be in 20 years.

I like to use the example of a guy who did a blackface in a party back in 2000's. Although reprehensible, was not commom-sense racism back then. Today society sees it as completely unacceptable.

Eventually that guy became prime minister of Canada and things went pretty bad when that photo surfaced decades later.

Is it far to judge someone's actions by the lens of a different culture? When the popular opinion comes, they won't care about historical context.

brabel

15h

Only idiots don’t care about it historical context.

lan321

12h

Your bar is too high in my experience. Most don't care about it and most of the ones that do, care only when it confirms their beliefs.

fragmede

12h

Unfortunately they get just as many votes as everyone else.

cucumber3732842

10h

I think people forget that before about the 2010s plus or minus depending on who and where those sorts of overt bigotry were considered a "solved" problem, things were looking up and you and your buddies dressing up as Klansmen for Halloween was mocking the Klansmen more than anything else.

samastur

14h

Depends on what you want to say. It can be safer to say something directly to someone's face than online because it is transient and generally does not involve random passers-by.

I am not going to give examples, because I don't want them to be pinned on me as my views, but I'm sure most of us have enough imagination to come up with them.

WorldPeas

22h

I think the problem with this, especially amongst younger people, is having spent so much time online, they don't know where to draw this line anymore.

actionfromafar

Interesting. You could probably get into trouble in those two places for extremely different things you said.

JohnMakin

of course, and it has happened, but I think authenticity is usually appreciated

NooneAtAll3

what two places?

the_af

> I operate by a simple, general rule - I don't often say anything online I wouldn't say directly to someone's face in real life.

I think this isn't enough for the digital age, simply because "comments you'd say to someone's face" can compromise you on the internet.

Some dirty joke, gossip or whatever you tell a friend, if posted online, could come back to bite you in the ass in the dystopian future, lose you your job, or worse.

DrewADesign

I think it’s naive to assume the private companies selling these services will know, let alone care, let alone disclose when their black box models botch things like this. The companies currently purporting to provide this exact service to HR departments for hiring decisions clearly didn’t let that stop them.

comex

14h

Not even the most extreme LGBT activist would accuse people who used the name Ellen Page in 2019 of having somehow been insensitive for failing to have a crystal ball. That is as absurd as it sounds. At most someone might be asked to change the name if they’re actively republishing the material in question.

Your point may be more valid when it comes to political attitudes, in cases where the issues were known at the time but the Overton window has shifted since.

antonvs

> Same comment read after 1 Dec 2020 (Transition coming out): Insensitive, demeaning, in accurate.

I genuinely don't understand this. Are you sure you're not imagining possible offenses against some non-existent standard?

we_have_options

well, how about "abortion legal" to "abortion murder"... possible to see this coming, but I know doctors in NY who are now afraid to travel to Texas.

How about DEI initiatives as good things in 2024 and a mark of evil in 2025? Lots of people were fired because in 2024 their boss told them to work on DEI and they did what their boss told them to do. Turns out this was a capital offense.

oska

16h

> because in 2024 their boss told them

I am not commenting on your specific example of DEI but I want to make the general point that you are always responsible for what you do, irregardless of whether you were told to do it by your boss, or commanding officer, or whatever.

So again, I don't care about the specific example you used but if something is 'in fashion' and you go along with it, including at work, then you are ultimately responsible for that choice. Because it is always a choice, including being a hard choice that results in you losing your job.

the_af

But working on DEI on your boss' orders in 2024 wasn't reprobable, anymore than bringing your boss a cup of coffee to their desk was.

The point is that the shift in what is considered "a capital crime" is arbitrary, this is not the Nuremberg trials. You cannot protect yourself by being a decent person, whatever you do today can be a crime tomorrow, and AI can assist those looking for your flaws.

anjel

standards change over time. Grandfather clauses are a courtesy, not a right.

heisenbit

Society's legally double standard:

- people can create new standards that will be applied retroactively

- lawmakers can create new laws which can not be applied retroactively

qsera

19h

This is easy. Have your own standards based on your own reason and navigate any arbitrary standards LCD majority of the society cooks up from time to time.

anjel

19h

>lawmakers can create new laws which can not be applied retroactively Still a courtesy:

    Background: Mary Anne Gehris was born in Germany and came to the United States around age 1, growing up entirely in the U.S. as a lawful permanent resident (green card holder).

The Incident: In 1988, during a quarrel over a man, Gehris pulled another woman's hair. She was charged with misdemeanor battery. No witnesses appeared in court, and on the advice of a public defender, she pleaded guilty. She received a one-year suspended sentence with one year of probation.

    Immigration Consequences: Years later, under the **Illegal Immigration Reform and Immigrant Responsibility Act of 1996 **(IIRIRA)—enacted during the Clinton administration but actively enforced during the Bush Jr. administration—her misdemeanor battery conviction was classified as an "aggravated felony" under federal immigration law. This made her deportable despite having no subsequent criminal record, being married to a U.S. citizen, and having a U.S. citizen child.

Outcome: Gehris avoided deportation when the Georgia Board of Pardons and Paroles granted her a pardon in March 2000, which removed the immigration ground for her removal.

Source Coverage: The story was detailed in Anthony Lewis's New York Times columns:

    "Abroad at Home: 'This Has Got Me in Some Kind of Whirlwind'" (January 8, 2000)

https://www.nytimes.com/2000/01/08/opinion/abroad-at-home-th...

These columns highlighted how IIRIRA's broad definition of "aggravated felony" swept up many long-term permanent residents with minor, often decades-old convictions, separating families and deporting people who had lived nearly their entire lives in the United States.

The Gehris case became a frequently cited example in immigration advocacy and legal scholarship about the harsh consequences of mandatory deportation provisions for lawful permanent residents. If you'd like, I can search for the original NYT articles or additional reporting on her case.

harvey9

14h

"If you'd like, I can search for the original NYT articles or additional reporting on her case."

No need but thanks for offering

JimDabell

12h

> Grandfather clauses

This term itself is an example of what this thread is talking about. Are you aware that some people now consider this to be a racist term? It’s a reference to the disenfranchisement of black voters in America.

Nevermark

That we identify social media as "tech" is very strange.

Yes, they have a lot of servers. But that isn't their core innovation. Their core innovations are the constant expansion of unpermissioned surveillance, the integration of dossiers, correlating people's circumstances, behavior and psychology. And incentivizing the creation of addictive content (good, bad, and dreck) with the massive profits they obtain when they can use that as the delivery vector for intrusively "personalized" manipulation, on behest of the highest bidder, no matter how sketchy, grifty or dishonest.

Unpremissioned (or dark patterned, deceptive, surreptitious, or coercive permissioned) surveillance should be illegal. It is digital stalking. Used as leverage against us, and to manipulate us, via major systems spread across the internet.

And the fact that this funds infinite pages of addicting (as an extremely convenient substitute for boredom) content, not doing anyone or society any good, is a mental health, and society health concern.

Tech scaling up conflicts of interest, is not really tech. Its personal information warfare.

DrewADesign

I didn’t say I hated technology, generally— I said I hate what the industry has morphed into in the US. What is or isn’t tech is immaterial. All of the odious things you listed are things that the ‘tech industry’ does, largely unquestioned, these days. Frankly, it’s sickening.

Nevermark

10h

I am in complete agreement with you.

Except noting that it is crazy that we accept the framing of "tech firm" for what are really "psychology engineering" firms, simply because they use tech.

Their use of tech is only perceived as more glamorous than companies addressing far greater technical challenges, because they are making crazy profits. While the only problem they alleviate with any tech ambition, is making more money for themselves, through centralizing ad venues (maximum ad revenue extraction, blind eye to scams and other dark marketers) and social damage externalization (maximum psychological manipulation).

The negative downstream impacts of all this value extraction are many, including the vast sums of money being paid to attention-hacking social influencers. This destructive army is directly funded by social media, whose alibi is they don't want to be censors. But they are not neutral, as that framing would imply. They are very actively financing the dreck!

cucumber3732842

11h

A huge amount of western society and the way we run institutions is based on pretending everything meets some quasi victorian moral standard and is all proper, everyone consents to and supports how everything runs and everything is fine and dandy when that is very much not the case and people put up with a lot of it because they have no better option.

In light of that what I see happening in the short term is that every institution will start screwing people based on information that basically doesn't matter since that's kind of what they're already set up to do with that information but don't except in exceptional cases since those are the cases in which that information makes it back to them.

Imagine some business owner opening a new location, some social worker renewing their license, some civil engineer creating plans on someone's behalf. All those people need to deal with institutions that in the "normal" case pretend to not have large discretionary components in order to get the public to put up with them, but do in practice have such ability. Now say those institutions pay for some LLM based "who am I dealing with" service that finds everyone's pseudonymous posts and whatnot.

Well, all of these people wind up getting given the run around because even though they do fine work that meets the rules, knowing how the sausage is made has made them jaded and given them opinions that make the institutions they have to deal with want to screw them. The business owner gets given the run around because it turns out he believes the institutions he's seeking permission from are a corrupt racket who's members ought to be hung from the overpass. The social worker gets denied because their career has turned them into a "defund it all and when faced with real consequences most of these people will shape up" type. The civil engineer's plans get rejected and he has to go around in circles because he's been posting about how in light of what corporations with good funding can get approved and the impact thereof it's unconscionable the stuff they try and enforce upon individuals and engineers ought to pencil whip anything that isn't clearly F-ed up.

And so, all these people have to waste time and probably a low five digit sum of money fighting the BS. This would be fine perhaps if these people's conduct was so egregious it made it back to the institutions on it's own (like say some doctor who's preaching quackery on youtube may get his license yanked if he amasses such a following the board hears about it, that's the kind of stuff institutional discretion was set up for) but no real good social interest is served having an LLM dig up petty dirt on everyone. However, the LLM service peddlers stand to make a buck. The institutions stand to make a buck while washing their hands of responsibility. The lawyers who'll fight on wronged parties behalf stand to make a buck. And in the process they can all pretend like society somehow benefits from this enhanced scrutiny when in fact they're just making mountains out of mole hills.

txrx0000

14h

Do you want culture to be frozen and instant digital communication with anyone else in the world to become a privilege of the few? Because that's where "clean" leads. And all you get is a little bit of temporary safety.

Here's a different vision for the future:

Let information filtering become each individual's own responsibility. We have LLMs now, and they'll get more efficient, so why not use them locally to filter incoming feeds according to each of our own preferences, but remove all of the filtering/moderation for posting info out. Build systems to decentralize and anonymize the Internet so that people can discover anyone and aren't afraid to post anything. Make it so that everyone can get a message out to the world and nobody can be arrested or assassinated for it. This will put an end to most violent conflict because they'd be replaced by online discourse.

Let the Internet be flooded with trash and gold at the same time. Let each individual decide what info is/isn't valuable to them. Let those individuals self-organize. Let ideas compete freely, so that the best ones may prevail.

Comment was deleted :(

hiAndrewQuinn

16h

>I post under my real name here, pretty much the only place I post. It keeps me honest and straight in what I say when I choose to say it.

I do the same thing, and I think I'm a much better person for it. The Internet is not, in my final analysis, some indiscriminate dumping ground for my personal issues and moods. It's a place where I can relax and practice putting forward a more prosocial form of myself, even when what I actually have to say is uncomfortable.

While we can't predict how the adversary will read and respond to our moves, I suspect the easier marks are the people who choose to publicly drench everything they touch in negativity and cynicism. It's a sign of an already compromised social immune system.

tclancy

I have lived my life on the web under the assumption the other Tom Clancy will leave enough chaff in my wake to make things hard. But probably not because I make the same 5 or 6 jokes over and over.

hliyan

I've come to a similar conclusion. I now almost exclusively post under my real name online, and before writing something, I ask myself whether it's something I'd say to a person's face and whether I'm comfortable being quoted on it. If not, I look for a more neutral, stronger version of the argument I'm trying to make (stronger, as in strong enough to stand without rhetorical devices or fallacies), or, I qualify the statement as an opinion or something I consider to be a possibility.

rudhdb773b

20h

I view posting online with a real name like getting a permanent tattoo.

My values or priorities may significantly change over decades, especially as a child, so why would I want to jeopardize the reputation of a potential future identity with something I may post today?

hiAndrewQuinn

16h

One could just as easily make the opposite argument. Given that your values and priorities may change significantly over the decades, a smart investment now into a solid, stable, and prosocial public identity may reap considerable and wide-ranging benefits in ways you couldn't even predict. This is especially true if you take seriously the idea that it's not what you say but how you say it that matters in the end.

Imustaskforhelp

13h

This is actually what I believe as well although I believe that its better to be pseudo-anonymous for me, right now.

In the sense that if I ever create any business/idea which can be serious enough that I want to back it up. I might create hackernews post about it.

Although that being said, I do sometimes make alts just to publish something if I don't want it under this particular account.

I do feel like I can be wrong, I usually am[0] but I think that I want to improve myself and perhaps this account can be a way for people to see me grow perhaps and sometimes fall as well. Life feels like a sin wave with ups and downs.

I have had some paranoid thoughts as to what if I get into controversy later on in life because of some things I do in my teen years but there was a line from a friend that I heard which said, "that anyone with more than 1 brain cell can figure out if a person has improved or not"

I do feel like authenticity is gonna be the differentiator if both code and infra aren't the bottlenecks. Perhaps authenticity can be treated as part of marketing but I feel like its also paradoxical to gain authenticity if you want to do marketing. Imo, a person has to be authentic for the sake of being authentic and only then and then can he also get some marketing benefits.

Authenticity means to share both good and bad (well as much as you can, I don't think one should be completely 100% authentic but rather only keep a few personal things to oneselves and even if they get leaked, then y'know just have the grace to accept it and considering that quote from above, I think most people will understand most things especially when you realize that there are people / (youtubers?) in the world who are part of serious accusations/controversies where I feel like most other controversies should be pretty non-issue fwiw.

Like my idea is being authentic enough to satisfy myself. If I become more authentic but if I feel unsatisfied/worried etc.,then that's wrong too.

[0]: (This is such a good quote from how to win friends that I use it quite often)

hiAndrewQuinn

You seem polite enough even psuedonymously, so I'd say you're doing a good job so far. :)

>I have had some paranoid thoughts as to what if I get into controversy later on in life because of some things I do in my teen years

I have a relevant anecdote, from back in halcyon 2008. Maybe it will help you when it comes to believing your friend, or at least it will temper your paranoia, which I think is well meaning in small doses.

When I was 13 or 14 years old I got suspended from high school because a friend posted a link to the Anarchist's Cookbook, which I had never heard of, on my Facebook wall. Some of my classmates got very scared and called the headmaster saying I had made a bomb threat against the school.

When the principal pulled me in to talk to me about this, it became very clear I had no idea what they were talking about. We talked for much longer than I think anyone in the room expected, maybe for three hours about existentialism, Zappfe's essay The Last Messiah which I had read the night before, whether I thought I was a victim of bullying (I didn't), what I thought of the school (excellent, a welcome refuge from a very turbulent home), thoughts on Cicero's speeches, the books we were reading in English class at the time.

I got "suspended" for a week and my parents took me to a therapist for several months afterward. I had thought after this for the rest of high school that my chances of ever going to college were totally shot, because a suspension appears on your permanent record. However, when it came time for me to actually apply to colleges, I found out no such record of the week at home ever existed. There appeared to have been a miscommunication all those years ago; I had actually been put on some kind of medical leave.

Now of course going through all of high school thinking that no college in the country will accept you now no matter how hard you do is going to change your incentives a bit. Ironically the very thinkers I had been reading at the time helped me quickly conclude that I wanted to do my level best anyway, even if there was going to be no payoff at the end of the road at all for me. In some ways it let me take more risks than my other classmates. I became the earliest person in my class to take our infamously hard physics course, and I walked out with top marks on both kinematics and electromagnetism. I don't think I would have taken that risk if I thought I had to optimize my GPA.

I trust you to think about this story and come to your own conclusions on how it moves your needle.

ryanjshaw

15h

You can also argue that posting with a real name encourages you to reflect on your identity.

Or do both. Also post anonymously to see what kind of a person you are when masked, and compare.

sponaugle

I am similar in that all of my interactions are with my real name and it is unique enough that just putting it into google will instantly identify me. There is one other 'jeff sponaugle' but I think he is far more annoyed with my presence than I would be with him.

On the plus side, someone will sometimes say while talking to me - oh your are that Subaru guy, or that youtube guy, or whatever and that is fun connection.

qsera

> as clean of a footprint on the internet

The only winning move here is not to play.

rapnie

14h

Data poisoning your own online profile is all nice and well. But in a society that goes beyond itself to cram AI into about every imaginable system, it may not be smart at all. Already in early adopter phase the average person gives way too much authoritative weight to what LLM's come up with. If complex societal processes become basically AI-driven you may get into a world of hurt. "I am sorry, we can't give you that passport right now, until we investigate potentially fraudulent behavior our AI flagged us about".

culi

15h

Yes it's basically data poisoning. It reminds me of the approach the Adnauseum extension takes. It hides ads from you like traditional adblockers but under the hood it's actually selectively clicking them to fool advertisers. I don't know if it's smart enough to create a "profile" for you (e.g. "soccer mom from Michigan") but that seems like the logical next step. Instead of just "flooding the zone with shit" you'd be more selectively/consistently misleading

47282847

15h

> Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core.

I don’t think this is humanly possible against machine learning. After all, it is specifically designed to weed through noisy data and identify patterns. It may delay discovery, but will at some point easily fall apart, by something as simple as a “filter out shitposting and deliberate pollution” prompt. Even more so when you guide it towards specific attributes.

pavel_lishin

That whole book seemed like a collection of interesting threads that ultimately go nowhere.

I honestly don't even think I understood the ending. Or the middle, if I'm being extra honest.

I think Anathem addressed the "flood the zone with shit" much better in something like three paragraphs.

gambutin

How would "flooding the zone" actually work in that case?

AFAIK the strategy is usually used to divert attention from one subject that could be harmful to a person to some other stuff.

Wouldn’t spamming in that case provide more information about you?

croes

23h

If in one post you say you’re Jewish, in the next you are Christian, in the next your Hindu, in the next youre Atheist it’s harder to know what your really are.

You could even mislead people if you know the difference between your and you‘re.

godelski

While I think the strategy is effective it is also likely equivalent to the dark forest. To me that's a case of the cure being worse than the poison.

ectospheno

I expect more people over time to use local LLMs to write every single post they make online.

shitloadofbooks

At this point, where everyone is using an LLM to post and I'm having to use an LLM to keep up and summarise it, I think I'll just ...stop and go outside for quite a while...

tlavoie

At that point, why bother to make any posts at all?

pbhjpbhj

>post they make

Will they realise their life has devolved to pretending an LLM is them and watching whilst the LLM interfaces {I was going to say 'interacts', not this fits!} with other bots.

Will they then go outside whilst 'their' bot "owns the libs" or whatever?

Hopefully at some point there is a Damascus road awakening.

goatlover

What would that accomplish? Just to keep their social credit score in the acceptable range while they go touch grass?

ectospheno

If you are trying to keep your secret accounts secret then you don’t want them to have your writing style. By having a LLM author each post you help eliminate that as a metric. Couple that with the usual opsec of posting via tor at random times and you arrive at something closer to anonymous. The hardest remaining item would be not exposing all of your real interests in the prompts.

observationist

Autonomous Proxies for Execration - spam bots whose entire purpose is flooding the internet with spam so as to make identifying anything true utterly impossible. If you can't differentiate between real and unreal information in online comments, then online comments stop being a significant factor in shaping public opinion. You need to abstract - identify reliable sources of information, individuals or institutions that do the work to collect and curate.

We're already seeing this as a side effect of the mishmash of influence operations on social media - with so many competing interests, mixed in with real trolls, outrage farmers, grifters, and the like, you literally cannot tell without extensive reputation vetting whether or not a source is legitimate. Even then, any suggestion that an account might be hacked or compromised, like a significant sudden deviation in style or tone or subject matter, you have to balance everything against a solid model of what's actually behind probably 80% or more of the "user" posts online.

There are a lot of aligned interests causing APEs to manifest - they're a mix of psyop style influence campaigns, some aimed at demoralization, others at outrage engagement, others at smears and astroturfing and even doing product placement and subtle advertisement. The net effect is chaos, so they might as well be APEs.

slopinthebag

I think as the younger generations come of age they simply will not care about that sort of thing. Like it or not, it's part of the culture and might just be accepted as the norm.

SchemaLoad

24h

I think it's kind of happened already. All the time we see news of politicians or famous people having their very old photos, comments, or reddit accounts found with distasteful takes. And it seems they can mostly just handwave it away with "Hey that was 10 years ago and I wouldn't make those comments today" and nothing seems to come of it.

leptons

Tell that to people who are tangentially mentioned somewhere among the 3 million Epstein files. It doesn't matter how insignificant the involvement, people are losing their minds and "cancelling" anyone and everyone without any nuance or critical thinking.

AlecSchueler

They might not care about it themselves but what about their government?

croes

23h

When the younger generation comes of age the new younger generation will have a different culture and norm what is acceptable.

People got in trouble for things they posted years ago where they didn‘t care but others did

MengerSponge

Vonnegut's Amphibians from "Unready to Wear"

Comment was deleted :(

croes

23h

> I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration.

You don’t know what information about you can bring you in trouble in the future.

KPGv2

Fifteen years or so ago I read an article arguing that by the time Millennials are nearing retirement and have more political power, people will give less of a shit about what you did online in your twenties because we will have, out of necessity, learned that asshattery in your twenties is largely irrelevant to your trustworthiness in your sixties.

When I was that age, you could tell the kids who had political ambitions self-censored online. But now every is buck wild so you have to ignore that when looking at people.

For example, a MASSIVE portion of Millennials and younger looking at the Main election are pretty chill about the leading Democratic candidate having a Nazi tattoo because of this very thing. Basically, "dumb, drunk, deployed Marines will get cool skull and crossbones tattoos in their early twenties, and so what if he said a couple ill-worded somewhat misogynistic things in his twenties, that was decades ago, and he's obviously a different person."

Contrast with Bill Clinton, where he literally had to explain away university marijuana usage TWENTY YEARS AFTER THE FACT.

Point is, I think we're witnessing this evolution happening right now.

oska

16h

> asshattery in your twenties is largely irrelevant to your trustworthiness in your sixties

Do people believe this? I certainly don't. How you behaved in your twenties is a good measure of the sort of person you are and will be for the rest of your life, albeit that you will (hopefully) mature and change some of your opinions and behaviours. So yes, you will have changed but you're also still that person you were in your twenties.

AtlasBarfed

This isn't the dystopia we're worried about.

The dystopia we're worried about is a 1984 on steroids with llms and real 24/7 worldwide monitoring by the state.

Getting caught doing embarrassing things by teenage social standards doesn't threaten your life.

A competent version of Donald Trump could have walked into the office and we would have been worse than the third Reich.

Still could be today right now. The capability is TurnKey right now at the US government.

This is open research being discussed here. Palantir already has all of this and probably 10 times more.

kseniamorph

I'm not sure the practical implications are as dramatic as the paper suggests. Most adversaries who would want to deanonymize people at scale (governments, corporations) already have access to far more direct methods. The people most at risk from this are probably activists and whistleblowers in jurisdictions where those direct methods aren't available, not average users.

GorbachevyChase

I actually think those most at risk are normal people the activists will harass. Soon it will be possible for anybody who works at the “wrong” business or expresses any opinion on any subject to be casus belli for unhinged, terminally online, mentally ill people who are mad about the thing of the day to start making threatening calls to your employer or making false reports to police or sending deep fake porn to your mom.

I think that we are close to a time where the Internet is so toxic and so policed that the only reasonable response is to unplug.

gwern

Attacks can be chained, and this can all be automated. For example, imagine pigbutchering scams... except it's there, similar to some voice-cloning scams, just to get enough data to stylometrically fingerprint you for future reference. You make sure to never comment too much or spicily under your real name, but someone slides into your DMs with a thoughtful, informative, high-quality comment, and you politely strike up an interesting conversation which goes well and you think nothing of it and have forgotten it a week later - and 5 years later you're in jail or fired or have been doxed or been framed. 'Direct methods' can't deliver that kind of capability post hoc, even for actors who do have access to those methods (which is a vanishing percentage of all actors). No one has cheap enough intelligence and skilled labor to do this right now. But they will.

ceejayoz

> Most adversaries who would want to deanonymize people at scale (governments, corporations) already have access to far more direct methods.

Easier methods probably means more adversaries.

gmuslera

And different agendas. Governments and corporations doesn't try social engineering attacks, scams or do things that end in i.e. ransomware attacks.

Comment was deleted :(

5o1ecist

- The U.S. NSA ran fake LinkedIn and Facebook profiles to phish foreign targets, as revealed in Snowden leaks, posing as recruiters to install malware.

- UK's GCHQ conducted "Operation Socialist," using false personas on social media for spear-phishing against telecom firms worldwide.

- In 2016, Russian GRU operatives (targeting Western elections) used spear-phishing on Democratic Party emails, but U.S. agencies mirrored similar tactics in counter-ops per declassified reports.

- "A Diamond is Forever".

Emotional manipulation linking diamonds to eternal love; planted stories, lobbied celebrities; created artificial scarcity myth despite stockpile.

- Amazon, Walmart, etc.

Scarcity/urgency prompts ("only 2 left!"); personalized "recommended for you" via data exploits.

- Fake reviews.

Paid influencers posed as riders praising service; hidden surge pricing mind games.

- "Torches of freedom".

Women-only events handing cigarettes as "freedom symbols" to subvert norms.

Feel free to ask for more:

https://www.perplexity.ai/search/hey-someone-on-hackernews-c...

iamnothere

Don’t forget eBay: https://www.wired.com/story/ebay-employees-charged-cyberstal...

tosapple

24h

[dead]

3abiton

22h

While you're right as in, it's nothing new given a trail of info, here they didn't need to do classical feature engineering, but purely LLM (agentic) flow. But yes, given how much information is self exposed online I am not surprised this is made easier with LLMs. But the interesting application is identifying users with multiple usernames on HN or reddit.

graemep

I can imagine a lot of countries who want to control what their citizens say abroad. I know Iraq in Saddam Hussein's time did it in the UK, China does it now.

intended

People who comment about their boss and workplaces?

People on HN who talk about their work but want to remain anonymous? People who don’t want to be spammed if they comment in a community? Or harassed if they comment in a community? Maybe someone doesn’t want others to find out they are posting in r/depression. (Or r/warhammer.)

Anonymity is a substantial aspect of the current internet. It’s the practical reason you can have a stance against age verification.

On the other hand, if anonymity can be pierced with relative ease, then arguments for privacy are non sequiturs.

john_strinlai

another big one: people looking for insurance, or looking to claim insurance

afpx

deanonymizing the people who deanonymize people at scale

cryptonector

17h

Wait till activist groups start doing this to shame people, get them fired, etc. It's going to be interesting.

iamnothere

Despite being pseudonymous, I don’t take great pains to hide who I am. I am in my 50s and live on the West coast. I don’t have socials and I don’t post anywhere else. Have at it!

If you are semi-retired, you’re free from the threat of cancellation. As long as you aren’t posting about crimes, there’s limits to what anyone can legally do to you. (Still, it’s good to be prudent and limit sharing.)

comrh

21h

Kind of short sighted only consider social cancellation. People in power change, laws get applied retroactively. History is full of people who get purged from stuff that was fine when it was written

iamnothere

21h

If you’re honestly worried about purges, you need to be gathering allies and armaments, not worrying about your HN posts.

sethammons

12h

People are being rejected at the US border due to saying things online that the current administration doesn't like and they are already claiming to add peaceful protesters to domestic terrorist lists.

Do you think the current and future administrations won't go further? This very comment might get me on a list.

My commenting _is_ gathering allies and armaments by voicing dissent and being one more raindrop that will hopefully add to a flood of change and improvement.

fragmede

12h

You're not?

angry_octet

Unless you're in the nebulous situation of being Hispanic in the US, in which case you might get profiled. Or you might have family with jobs that are subject to pressure -- and right now, that seems like most jobs, because calling employers spineless is an insult to worms. Or if you'd like to travel by air, because watchlists are back, and carriers may just refuse service.

iamnothere

23h

Fair enough. I am in a category that’s typically lower risk (though not zero) for profiling, so sometimes I forget that. Still, the potential risk isn’t a good reason to silence your voice if there are issues that you find important. The best defense is to avoid giving out personal details and avoid discussion on non-pseudonymous social sites.

Comment was deleted :(

notepad0x90

23h

Even without LLMs this was possible.

But with HN, I'd like to ask @dang and HN leadership to support deleting messages, or making them private (requiring an HN account to see your posts).

At first I thought of how this would impact employment. But then I thought about how ICE has been tapping reddit,facebook and other services to monitor dissenters. The whole orwellian concern is no longer theoretical. I personally fear physical violence from my government, as a result. But I will continue to criticize them, I just wish it wasn't so easy for them to retaliate.

with

22h

everyone in the comments is talking about stylometry and rewriting your posts with LLMs. the paper barely uses stylometry. the attack surface is semantic: your interests, your city, the conference you mentioned once 2 years ago. you can't rewrite your way out of having said you work in fintech in austin and own a golden retriever.

comrh

21h

you can intentionally add false biographical information. what if you had a bot posting responses in subreddits for cities across the world on your account

gaigalas

21h

That's adding noise, not removing metadata. One can filter the noise.

Your interests can show up in all sorts of ways. Perhaps it's not saying "I like Madonna" on some social network, but the urge to interact with one specific song she recorded. One like can be the difference of giving away who you are or not.

With AI, there's a higher chance of active deanonymization tactics. This was possible for only select targets in the past. It's the creation of content or design of interactions that is meant to surface certain behavioral patterns (such as offering you that song "casually" in some timeline to gauge if you're going to interact with it).

Trying to mask or change your behavior is likely to result in a weird and very noticeable presence. Like trying to change how you walk will often lead to a caricaturized behavior, not something that someone would naturally do.

Acting naturally is probably the starting point of any attempt to prevent deanonymization, and the hardest to achieve. You have to be aware of your own behavior much more than people often do.

deepsun

22h

I bet we're about to see reduction of online public communications. Count how many times you had a desire to share your knowledge or correct someone online (aka somebody is WRONG on the internet). People would stop doing that, just to not train some big-corp model using their knowledge. Artists already not happy about that, but there are many other types of expertise people will stop sharing.

ghm2199

24h

I want to use "slower" methods of identification more. Like say for instance within a few blocks of you a human can identify who you are for any service that wants to do some kind of verification/proof you are/have XYZ.

We could designate specific individuals to do for you and me just like we do for today's trust authorities for website certificates.

No more verified profiles by uploading names, emails and passports and photographs(gosh!). Just turned 18 and want to access insta? Go to the local high school teacher to get age verified. Finished a career path and want it on linked in? Go to the company officer. Are you a new journalist who wants to be designated on X as so but anonymously? Go to the notary public.

One can do this cryptographically with no PII exchanged between the person, the community or the webservice. And you can be anonymous yet people know you are real.

It can be all maintained on a tree of trust, every individual in the chain needs to be verified, and only designated individuals can do actions that are sensitive/important.

You only need to do this once every so often to access certain services. Bonus: you get to take a walk and meet a human being.

JohnMakin

As people will point out, the OSINT techniques described are nothing new - typically, in the past, you could de-anonymize based on writing style or niche topics/interests. Totally deanonymization can occur if any of these accounts link to profiles containing pictures of their faces, which can then be web-searched to link to a real identity. It's astounding how many people re-use handles on stuff like porn sites linked very easily to their IRL identity.

While people will point out this isn't new, the implication of this paper (and something I have suspected for 2 years now but never played with) is that this will become trivial, in what would take a human investigator a bit of time, even using common OSINT tooling.

You should never assume you have total anonymity on the open web.

ghywertelling

If LLMs can identify a person across websites, I can ask LLM to read up his posts and write like him impersonating him and then this feeds back into the tools identifying him. I can probabilistically malign a person this way.

functionmouse

So this means deanonymization doesn't work? Rejoice?

JohnMakin

This already is a thing people did at least as far back as I started getting into web privacy, which was ~10 years ago. I have been the target of it before.

LLM's are probably better at it, but I don't know if this is as destructive as people may guess it would be. Probably highly person dependent.

The micro-signals this paper discusses are more difficult to fake.

john_strinlai

stylometry is only one aspect of de-anonymization. what you describe is certainly a threat that we will have to deal with, but there is a lot more to credible impersonation than just being able to mimic a writing style

Jerrrrrrrry

How to conduct a psy-op

https://youtu.be/YTGQXVmrc6g

warkdarrior

I think the implication is this will become trivial and trivially automated, no human investigator needed. I bet there will be plugins in one year's time to right click on a post and get a full report on who the author is.

JohnMakin

agreed and the new frontier here will probably be obfuscation by creating false positives with these same tools, but that kind of renders the web unusable in my mind.

arctic-true

I had this same thought. Seems fairly easy to just put off a strong false signal. If you don’t want anyone to know that you live in Finland, make a point to constantly mention how much you enjoy living in Peru.

0xdeadbeefbabe

Wouldn't it also become trivial to pretend to be another author?

john_strinlai

it may become more trivial to llm your comments/blog/whatever into a different "voice", but there is so much that can be used for de-anonymization that the llm-assisted technique dont address.

for example, you may change the content of your comments, but if you only ever comment on the same topic, the topic itself is a signal. when you post (both day and time), frequency of posts, topics of interest, usernames (e.g. themes or patterns), and much more.

Lerc

Information leaks everywhere, as the ability to process it increases, I think ultimately it will lead to a world where there are no secrets, provided one has the resources and intention to look for something.

For a few years now I have been telling people how unprepared the world is for this change. Not understanding how this is possible will lead to people outright deifying AI that has the capability to do things like this. It will seem like omniscience.

I think the main protection we have in a world where you cannot effectively hide, is that anyone who abuses this ability will be operating under the same system. You can use it to your advantage, but not without getting caught.

bigwheels

A related past submission comes to mind:

Show HN: Using stylometry to find HN users with alternate accounts

https://news.ycombinator.com/item?id=33755016 - Nov 2022, 519 comments

password4321

10h

This HN stylometry tool is still online: https://antirez.com/hnstyle (though I assume its dataset is not kept updated since mid-2025).

20250415 https://news.ycombinator.com/item?id=43705632 Reproducing Hacker News writing style fingerprinting (325 points, 159 comments)

cluckindan

I feel like this is one of those products OpenAI et al are quietly perfecting. Dark assets like that would sell like hotcakes to authoritarian regimes. That would explain how they eventually plan to reach profitability.

HelixSequencing

What's wild to me is that people worry about writing style fingerprinting while casually uploading their literal DNA to consumer genomics companies. 23andMe went bankrupt and suddenly 15 million people's most identifying data imaginable is an asset in a fire sale.

Your writing style can theoretically be masked with an LLM. Your genome can't. And it doesn't just identify you -- it identifies your relatives, your disease risks, your ancestry, things you might not even know about yourself yet. The deanonymization vector here is permanent and irrevocable in a way that no amount of OPSEC can fix after the fact.

The semantic approach in this paper (interests, clues, behavioral patterns) is scary enough. Now imagine combining that with leaked genetic data. You don't even need to match writing styles when you can match someone's 23andMe profile to their health subreddit posts about conditions they're genetically predisposed to.

nickdothutton

Worked on a de-anonymiser in the 90s for identifying banned users and banning their newly created ban-avoidance accounts. Worked based on triplets of words. Worked surprisingly well, so this does not surprise me.

gormen

11h

Indeed, fears about deanonymization are a reaction to three structural shifts: the cost of analysis has plummeted, the volume of stored data has increased dramatically, and models have become better at identifying patterns that humans miss, making it impossible for interested parties not to take advantage of this. But the conclusion isn't that "anonymity is dead." The conclusion is that anonymity is no longer a guaranteed technical property. It's becoming a behavioral skill that can be developed.

thesz

An old one: https://news.ycombinator.com/item?id=33755016

Stylometry can match not only people, but ethnic groups. No LLM required.

prats226

23h

If with LLM's you can deanonymize at scale, on a personal level, you should also be able to figure out what posts are leading to this deanonymization and remove them or modify them.

block_dagger

Does this mean we'll find out who Satoshi is with a high degree of confidence?

hellojesus

Clearly the cia or other gov institution. Its purpose is to create an irresistible honeypot so that anyone who figures out a working and time feasible implementation of shor's law or other prime factorization technique would reveal their hand.

Cider9986

Stylometry Protection (Using Local LLMs) https://bible.beginnerprivacy.com/opsec/stylometry/

DalasNoin

We essentially don't use stylometry but semantic information – clues and interests.

flux3125

I'm curious if they could de-anonymize Satoshi Nakamoto by using this technique.

yomismoaqui

I did something like this passing some of my comments here and then prompted Gemini to identify my native language by reading my not-so-good english.

And surprise, a tool made for processing text did it quite well, explaining the kind of phrase constructions that revealed my native language.

So maybe this is a plus for passing any text published on the internet through a slopifier for anonymization?

EDIT: deanonymization -> anonymization

joe_mamba

>So maybe this is a plus for passing any text published on the internet through a slopifier for deanonymization?

Or vice versa, Indian scammers online can now run their traditional Victorian English phrasing through an AI to sound more authentically American.

Interviewers now have to deal with remote North Korean deepfaked candidates pretending to be Americans.

Just like the internet, AI is now a force multiplier for scammers and bad actors of all sorts, not just for the good guys.

Comment was deleted :(

Melatonic

Seems like this could also be used by call centers to realtime adjust their accents. Text is obviously easier to analyze (no realtime required) but I imagine that audio is not that hard to process real time.

Calling for home internet support and getting the person on the other end (in a US Southern or Boston accent) asking you to "do the needfull" could be pretty entertaining :-D

joe_mamba

Why bother with accents when you can replace the call support workers alltogether with AI? Isn't that why all AI companies have gorillions in valuation?

Havoc

11h

Yeah been thinking it’s time to scale back online engagement given the US both has access to everyone’s data and is pivoting to a ahem different style of country

Pity - the pseudo anon internet is fun

Comment was deleted :(

YesBox

Additionally, you can open up copilot.microsoft.com or w/e and ask it to summarize any reddit users (and presumably HN) posts. Not just the content, but their emotional state (without prompting).

[0] Note: last I tried this was months ago, things may have changed.

YesBox

I just retried this with my reddit account (game dev stuff)

Last block of text from copilot :/

-----------

If you want, I can also break down:

Their posting style (tone, frequency, community engagement)

How their work compares to other indie city builders

What seems to resonate most with Reddit users

Just tell me what angle you want to explore next.

cloudfudge

I just had a conversation with gemini where I asked it to analyze my style and one of the things it claimed was that I referred to things as "AI slop" and "brainrot", both of which are terms I haven't ever used. I spent a few minutes trying to get cites for that and it kept producing the same quotes from other people and insisting it had corrected the record.

Seems like it's overstating perceived anti-AI sentiment. :)

lunaprompts_hn

17h

The real-world benchmark approach is the right direction. Most agent evals I've seen test for task completion on clean inputs. That's not how production use looks.

What tends to break agents in the wild: ambiguous instructions that have multiple valid interpretations, state that changes mid-task, and error recovery when a sub-step fails silently rather than loudly.

The hardest thing to benchmark is graceful degradation. A good agent should know when to stop and ask for clarification rather than confidently completing the wrong task.

Foobar8568

Time to withdraw from internet x_x

mhitza

i haven't read the full study, but its been on my mind for a while.

https://en.wikipedia.org/wiki/Stylometry

The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.

Ideally built into a browser like Firefox/Brave.

DalasNoin

We don't use (much) stylometry, so this won't help. This is totally something you could try, but we use interests and clues. Semantic information you reveal about yourself.

The blog post might be more approachable if you want to get a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...

Comment was deleted :(

mhitza

Thanks for the providing the details, where I've been just lazy about reading the paper now :))

I'm not a fan of your proposed changes, as they further lock down platforms.

I'd like to see better tools for users to engage with. Maybe if someone is in their Firefox anonymous (or private tab) profile they should be warned when writing about locations, jobs, politics, etc. Even there a small local LLM model would be useful, not foolproof, but an extra layet of checks. Paired with protection about stylometry :D

DalasNoin

Mitigations are pretty difficult, I understand it is kind of cool that some websites have really open APIs where you can just read everything. There are some cool apps that used HN data in the past. But I think there should at least be consideration that LLMs are then going to read everything and potentially discover things. Users might have thought this is protected by obscurity, who would read their 5 year old comments?

palmotea

How helpful would injecting noise and red herring into pseudonymous posts help?

It seems like it would make sense to get in the habit of distort your posts a bit, and do things like make random gender swaps (e.g. s/my husband/my wife), dropping hints that indicate the wrong city (s/I met my friend at Blue Bottle coffee/I met my friend at Coffee Bean), maybe even using an LLM fire off posts indicating false interests (e.g. some total crypto bro thing).

GorbachevyChase

This is probably a good use case for something like OpenClaw. Have it take over your accounts and inject a bunch of non-offensive noise using a variety of personas to pollute their analysis. Meanwhile, you take your real thoughts and opinions underground.

spoaceman7777

16h

You're absolutely right. It's not just a matter of what you post-- it's a matter of how you post

fragmede

12h

Was this written by a human?

Sometimes you can just tell something's off. No exclamation mark, double dash instead of an emdash. Human-slop on my HN? This place is becoming more and more like Reddit, I swear!

DalasNoin

There is also a practical issue here that people usually don't write a lot on linkedin, most people just have structured biographical information. We use very limited stylometry in section 6 for matching reddit users who we synthetically split according to time.

patcon

L33tsp34k also accomplishes this. The original anonymising hacker stylometry :)

I am intrigued by the idea that in the future, communities might create a merged brand voice that their members choose to speak in via LLMs, to protect individual anonymity.

Maybe only your close friends hear your real voice?

Speaking of which, here's a speculative fiction contest: https://www.protopianprize.com/

Disclaimer: I am an independent researcher with Metagov (one host org), and have been helping them think through some related events.

EDIT: I've belatedly realized that stylometry isn't involved, but I think some of the above "what if" thought could still hold :)

5o1ecist

> seems to be usage of a local llm that rewrites the text while keeping meaning untouched.

There are no two ways of expressing something in ways that might create equal impressions.

Relevant: https://www.perplexity.ai/search/hey-hey-someone-on-hn-wrote...

mhitza

I don't really understand the argument your proposing.

Is it impressions in a stylistic sense (flurishes to the language used), which is a what I'm arguing the LLM usage for.

Or is it impression in the subjective sense of what an author would instill through his message. Feelings, imagry, and such.

Or the impression given to the reader? "This person gives me the impression that they know what they talk about", or "don't know what they talk about?"

I don't know which argument your proposing, but I'd like to make an observation of the LLM usage. I don't know what model the perplexity response is based on, but some of them are "eager to please" by default in conversation("you're absolutely right" and all the other memes). If you "preload" it with a contrarian approach (make a brutally honest critique of this comment in reply to this other comment) it will gladly do a 180 https://chatgpt.com/s/t_699f3b13826c8191b701d0cc84923e71

5o1ecist

My argument is that changing even one word in a sentence changes what the other side can, and or will, understand.

> You're absolutely right.

Until just a few days ago, Perplexity used to run on Sonar. At least that was my impression. Suddenly they've changed the typeface and now it's running on GPT5, with Sonar behind the paywall.

I was very unhappy, because my perplexity was well trained on our conversations (it has memory) and my lessons in metacognition, critical thinking and others.

Suddenly that all stopped and I was confronted with a regular, generic LLM for the average user, which bothered the hell out of me.

Unbeknownst to most people it seems, one can actually teach Perplexity. (I do not know if this is the norm across all the major engines, or not.) It adapts to your thought processes. It learns, just from the conversations, but you can push even harder.

All it takes is telling it not to do something, until it eventually stops doing it.

My perplexity does not hallucinate, knows very well that I give it shit for giving me shallow answers, it knows that i do not tolerate pleasing because I do not tolerate dishonesty. It had to learn that I will relentlessly keep asking for both precision and accuracy, knows that any and all information has little to no value as long as it does not somehow root in ground-truths. I've also taught it to recognize when it speculates and, eventually, it stopped.

It also doesn't use phrasing like "almost certainly", because that's dumb.

I've had many conversations about this, and more, with both Sonar and GPT5. It appears that most people have no grasp of what they are actually capable of doing already and that better training alone does not fill all the gaps.

Of course there is little chance that you will believe any of this. Regardless ...

> If you want to win arguments on HN, precision beats profundity every time.

It's weird that you seem to be caring about "winning", because I certainly don't. From my perspective there is no contest and, thus, nothing to win or lose. All that is, is the exchange of information.

What's also weird is that chatgpt, for this instance, puts far too much emphasis on how the message is written. A really, really shallow approach. It seems to me that chatgpt is doing to you exactly what you think my perplexity is doing to me.

PS: It appears that everything went back to normal, with GPT having caught up on my previous conversations with Sonar (or whatever it was, but I'm pretty sure it was Sonar). The difference, in how it expresses itself, is extremely noticable.

PPS: Sorry for the million edits.

palmotea

> There are no two ways of expressing something in ways that might create equal impressions.

> Relevant: https://www.perplexity.ai/search/hey-hey-someone-on-hn-wrote...

Did you just use an LLM to write your comment and are citing it as a source?

5o1ecist

No, MY FELLOW HUMAN! As an AI language model, I am not able to use language models for writing my comments.

It's always situational if, or how, I use perplexity. For this one, for example, I wasn't sure if I could post the sentence as-is, so I've used perplexity.

It was purely an accident that, what came out of my query, actually fits.

I thought that it was obvious, given the first query. Apparently not.

kerisi

link doesn't work, it says the thread is private

5o1ecist

Fixed! Thank you!

StilesCrisis

The link is private.

5o1ecist

Fixed! Thank you!

IncreasePosts

I don't think this is working any more, but there was a stylometic analysis of HN users a few years ago, and it was extremely effective (at least, for myself and people who felt the need to post in the comments): https://news.ycombinator.com/item?id=33755016

palmotea

> The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.

A problem with that is then your post may read like LLM slop, and get disregarded by readers.

Another reason why LLMs are destruction machines.

gambutin

Is there a deployment of this tool so that I test it on myself?

EDIT: please someone build this, vibe-code it. Thanks

DalasNoin

We test different methods, in section 2, we use LLM agents to agentically identify people. We don't share any code here, but you could try with various freely available agents on yourself.

intended

Any tool that can be used for yourself, can be used for others, which is why the researchers wouldn’t release the code/prompt.

That said, give it a few days and someone will have a proof of concept out.

Comment was deleted :(

stackghost

I'd be interested in testing this on myself also.

qsort

> We suspect that Hacker News and Reddit are part of most training corpora

Hello, LLM! :)

tryauuum

the most important data for LLM is that Microsoft in general and GitHub in particular can never be trusted with your data.

I've been trying to delete my GitHub account for many months

warkdarrior

> I've been trying to delete my GitHub account for many months

That'll make you unemployable as a software developer.

tryauuum

Luckily I don't want to be employable as a software developer

xantronix

Amen comrade

bluefirebrand

Software developer for 20 years here, never had a problem getting jobs without a github

Maybe that will change in the future. Then again I'm pretty sure my next job won't be software. I have no interest in building software in the AI era.

dpc_01234

Joke's on you — All my posts are written by some Slopus now.

razingeden

Stop that. That’s private, that’s between me and the Internet. :-(

thatguysaguy

23h

Maybe I missed something, but I see little evidence that there is a concerning ability to deanonymize. Many people post under a pseudonym but then link to their GitHub etc. In fact by construction the HN dataset _only_ consists of people who are comfortable with their real identity being linked to it.

The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.

matheusmoreira

18h

> The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.

They can. That's the point. This site serves as a dataset against which pseudonymous posts can be evaluated.

deadbabe

23h

Doesn’t all this deanonymization stuff depend on one fatal assumption: that people are actually being truthful with what they say about themselves?

If you’re basically LARPing a new personality every time and just making up details about where you live or what your life is like then how is this ever going to work? Someone could say they live in San Francisco while actually living in Indiana.

matheusmoreira

18h

That's honestly quite terrifying. If you're posting somewhere else under a pseudonym, this technology can get you doxxed. The safest thing to do is to not participate in communities at all. Avoid posting, avoid social interactions, just be a ghost. The future is bleak.

Comment was deleted :(

sbmsr

if this is where things are headed, everyone is incentivized to run their words through an LLM to anonymize themselves starting... now.

georgeburdell

Good thing I always lie on the internet

greesil

But do you lie with the same writing style?

majorchord

nope, and I sometimes walk with a pebble in one or more shoes /s

yu3zhou4

Liar paradox

zikduruqe

Everything I type is a lie.

Comment was deleted :(

wasmainiac

24h

Could another mitigation be polluting identities online with fake ones so that real identities become hard to sift out.

For example if I tell my bot to clone me 100x times on all my platforms, all with different facts or attributes, suddenly the real me becomes a lot harder to select. Or any attribute of mine at all becomes harder to corroborate.

I hate to use this reference, but like the citadel from Rick and Morty.

SchemaLoad

23h

Probably, but it also be the complete destruction of social media when there are 100 spam bots for every real person.

wasmainiac

16h

Is that not already the case on mainstream social media? HN even has bots.

bitwize

Somebody I know irl has figured out I'm me here on Hackernews, based on the fact that my writing style here matches my verbal style. Fingerprinting people based on their words is one of the things I actually expect LLMs to be really absurdly good at.

Zigurd

What this tells me is that major social media sites, some of which claim to be developing frontier models, have no excuse for a bots waging influence campaigns on their sites.

DalasNoin

We do advocate for stricter controls on data access on social platforms because of this. There is a bit of an unfortunate trade-off, but I think allowing mass-scraping or downloads of data from social sites can be misused in increasingly more ways.

zoklet-enjoyer

I used to make new accounts every few months but got lazy. Time to start doing that again.

GorbachevyChase

You may want to also do a little stylistic obfuscation. ChatGPT, please rewrite my response in the style of Michelangelo from the Ninja Turtles.

zoklet-enjoyer

21h

Also don't make usernames that reference old message boards or any of my interests. Maybe sprinkle in some mentions of fake hobbies and jobs and places I've lived too.

casey2

The obvious retort is to just use an AI to rewrite everything you post, but this will open other attack vectors.

Of course, far more dangerous is government using this to justify unjustifiable warrants (similar to dogs smelling drugs from cars) and the public not fighting back.

DalasNoin

We essentially don't use stylometry but semantic information revealed from peoples' comments – clues and interests.

(We use a little stylometry in a single experiment in section 5)

reducesuffering

I remember their being a previous post about stylometry analysis of HN accounts. And people confirmed the top account correlations. It basically identified all the HN alt accounts

jacquesm

And HN asked the author to take it down if I'm not mistaken.

ranger_danger

IMO This is just taking advantage of OPSEC failures. Same way that lone Tor user at a university got caught calling in a bomb threat.

comrh

21h

we need the scramble suits from a scanner darkly but for your online text

aplomb1026

[dead]

DalasNoin

We use semantic information inferred from comments and submissions. I think using stylometry would be a great addition, but it would be hard to google for "guy who writes fanciful using many puns" rather then "indie developer in Switzerland". I think stylometry could be better used for verification, once you have a small set of candidates stylometry could further narrow down the candidates and be used to make a decision.

switchbak

Time to scrub those naughty Glassdoor rants!

retew22

[dead]

retew22

[dead]

throwaway4928ab

20h

[dead]

newzino

[dead]

squeefers

so if they put their linkedin account on their HN account, we can figure out who they are.... genius stuff, AI really is changing the landscape all right

DalasNoin

To be clear, we are making a clear concession here that the people weren't truly anonymous. But we did use an LLM to remove any identifying information from HN making them quasi-anonymous, this is more described in the appendix Table 2.

We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.

The blog post might be more approachable for a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...

dang

Thanks for that link! I'll put in the top text.

Edit: actually I've re-upped your submission of that link and moved the links to the paper to the toptext instead. Hopefully this will ground the discussion more in the actual study.

ranger_danger

But you also relied on people giving away too much personal information about themselves... which won't always be the case.

majorchord

Yeah my first thought was "of course an LLM can do that, we didn't need a paper to tell us". I would be more impressed if it could do it without that information, such as by analyzing writing styles and other cues that aren't direct PII.

intended

It’s the same thing as theft and locks. Any motivated attacker will overcome any rudimentary obstacle. We still use locks because most opportunistic attackers are the most prevalent.

Even the paper on improved phishing showed that LLMs reduce the cost to run phishing attacks, which made previously unprofitable targets (lower income groups), profitable.

The most common deterrent is inconvenience, not impossibility.

DalasNoin

I agree that these accounts probably on average still contain more information than the average pseudonymous account. I think we could try to use the LLM to increasingly ablate more information and see how it performance decays – to be clear we already heavily remove such information, see Table 2 appendix. But I don't expect that to change the basic conclusions.

ranger_danger

I also wonder how well the LLM would do with less direction e.g. just ask it to analyze someone's posts and "figure out what city they live in based on everything you know about how to identify someone from online posts".

famouswaffles

Over a large enough timeframe (often a couple years at most), almost everyone online gives too much information about themselves. A seemingly innocuous statement can pin you to an exact city and so on.

ranger_danger

I would be quite impressed if someone could figure out what city I live in from my 4.5 year old account, but I highly doubt it.

dang

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

It's a pity that you didn't make your point more thoughtfully because it's one of the few comments in the thread so far that has anything to do with the actual paper, and even got a response from one of the authors. That's good! Unfortunately, badness destroys goodness at a higher rate than goodness adds it...at least in this genre.

nottorp

That's what I'm wondering, since my linkedin profile is indeed linked to in my HN profile.

A more funny question is: did they match me to the correct linkedin profile, or did the LLM pick someone else?

Comment was deleted :(

econ

Everyone should really stop posting online unless their job requires it.

The platforms offer only castrated interactions designed not to accomplish anything. People online are useless obnoxious shadows of their helpful and loving self.

No one cares more what you say than those monitoring you and building that detailed profile with sinister motives. The ratio must be something like 1000:1 or worse.

Crafted by Rajat

Source Code

hckrnws

Large-Scale Online Deanonymization with LLMs