How we handle fake news at Adarga

Features

14 Aug 2019
Twitter
Linked In

How we handle fake news at Adarga

Anyone even remotely familiar with today's news landscape is probably already sick of hearing the phrase ‘fake news’. It is a very real and very ubiquitous problem but at the same time a highly modern phenomenon in how it works - and humanity is still just experimenting with ways to fight it. 

Handling data and working in analytics always raises a number of difficult questions and challenges. Many are well-known and encountered on an everyday level: data privacy, the identifiability of people, trust in human and machine predictions, and their associated ethics issues. 

Adarga's primary information source in our tools is a news stream and as such fake news is something we have to consider. So how do we deal with it? What is possible? What are current practices? And what makes sense for a customer? 

Ingestion of news into our database is automatic and so it is resilient to one of the biggest fake news threats that your average human news consumer faces: fake publications and typosquatting. 

The latter word may be unfamiliar: it’s the name for taking advantage of a missed keystroke (typing error/typo) by setting up various web domains close in spelling to an original domain and then using it for nefarious purposes ranging from deception to information theft. 

Imagine if “goohle.com” was put in control of an evil web developer. They could make it look like the real google.com and steal user credentials entered when people try to sign in to their Gmail from there, then redirect users to the real Gmail and leave them none the wiser. Users would probably never realise that they missed a key for an adjacent one. 

Some fake news websites can operate the same way. Many emulate real sites like The New York Times but insert their own fake content into it. Sometimes they make up entire fake publications like how the Denver Guardian reported on a scandal in the 2016 US Presidential Campaign. Sounds believable, but turns out no such real publication exists. 

Adarga's automatic ingestion pipeline is not vulnerable to these common, primarily manual errors. Web scraping algorithms never mistype established domains and suspicious, newly-created low-traffic ones do not make it to the source list for ingestion. 

There is a secondary issue though: even reputable and trusted sources will occasionally publish factually incorrect things. This could either be due to a simple mistake or because publishing the official statement or an interview with the suspects is necessary, although of dubious truth value, as in the case of the Salisbury attack. 

It is not yet possible to differentiate between well-constructed fake news and real news based on text content alone. There is a large grey area for what exactly can count as fake and claims are often legally challenged or classified as an opinion to dodge the brand of fake news publisher. It is possible to do better than flipping a coin but machine accuracy in this extremely difficult task is still very low - and we do not claim to have a silver bullet. This is the reason most current fake news filtering systems use an aggregated human ranking of sources to judge validity, but this too has caveats. 

Many automatic filtering approaches ran into the problem of community outrage once a group's favourite source ran afoul of rules. Some publications heavily depend on getting the maximum number of reactions and shares on social networks to remain economically sustainable, therefore end up making content designed to go viral with less regard for accuracy or impartiality. 

To get good human filtering a site must have either a very large sample group of users willing to rate sources or experts employed for the same task (preferably both) but neither is cheap or easy to manage fairly. Facebook for example often faces heavy criticisms over its unwillingness to disclose how they select groups of users to trial features or affect algorithms like event feed filtering.  

For our customers some of the above issues are immediately solved: expert analysts will not get mad over the exclusion of a questionable source and can also tell the validity of content designed for different ways of monetisation. 

As far as source filtering goes on this level less filtering is actually preferable. Not no filtering: we don’t, for example, include Myspace pages unedited since 2006 in our data pipeline. But something having been published has information value regardless of whether it's true or not. For example, if one actor releases multiple contradictory statements on a single topic it may be in their favour that there is confusion and contention on said topic - see the Salisbury attack again for reference. 

At Adarga we help minimise the risk and impact of human error through the ingestion pipeline. We also recognise the skill and knowledge of our domain expert users and provide them the most possible freedom and data to draw from. We believe that for complex fields, state-of-the-art accuracy lies in enhancing human insights, not just pure machine learning. 

Cookie Policy

We'd like to set Analytics cookies to help us to improve our website by collecting and reporting information on how you use it. The cookies collect information in a way that does not directly identify anyone.

For more detailed information about the cookies we use, see our Legal Page

Analytics Cookies