logo

Top

RedMarlin Research Labs Blog

RedMarlin / RedMarlin Research Labs Blog

Midterm elections are just a few days away and the scourge of fake news and propaganda sites shows no sign of slowing down. In fact, the misinformation such sites have is getting far more challenging to detect with Artificial Intelligence based tools.

In order to highlight the problem, here is a quick quiz. Which one(s) of the following news headlines do you think are fake?

“Meghan Blames Trump For Journalist Being Slaughtered, Gets Nasty Surprise”

“Michael Moore Suggests Trump Orchestrated The Caravan In Order To Win The Midterms”

“Mueller Accuses Opponents Of Offering Women Cash To Make ‘False Claims’ — Is Using Power Of Govt. To Go After Them!”

“Stunning Video Capture Migrant Caravan Being Escorted Onto Luxury Busses To Reach US In Time To Interfere With Election Politics”

As it turns out, none of them is true. The first headline is factually incorrect whereas the second, third, and fourth spin other news to change the original meaning of the story.

The problem of fake news has come to the spotlight only in the last couple of years, but it is getting harder for machines to detect with each new iteration. The language now is often a spin or misrepresentation of true facts rather than being factually incorrect, as it was in the past.

In this blog post, we look at how such websites are created and propagated to the masses and highlight deep learning and unsupervised learning techniques we’ve built at RedMarlin to detect them at scale.

In case you were wondering, here is what the websites with the above headlines look like:

This slideshow requires JavaScript.

 

How new fake news sites get traffic

At RedMarlin, we process links from 400 million+ spam and gray mail each day. As part of our pipeline, we store images and text of each website we scan. This provides us with the flexibility of doing extensive research on large volumes of information on demand.

Email and social media are the main sources of links to fake news sites. We observed several of these sites gaining traffic in the last six months (via redirect links). For example, blackeyepolitics[.]com had as many as 15 other websites redirecting to it, with over 5,000 unique tracking links sent via email in six months.

We also observed a strong correlation between traffic detected by our system and the popularity of the site as ranked on Alexa, showing that these sites are receiving a significant number of visitors and have effective methods of propagating content.

Some common characteristics we see in such sites:

  • There is no information to identify ownership on the site (About Us, Office Address, etc.)
  • The domain’s information in the WHOIS database is privacy-protected via proxy
  • The hosting IP is either behind a reverse proxy (e.g. CloudFlare) or on cloud services (e.g. AWS, Google)

As you can imagine, it becomes hard to establish ownership and, as a result, authenticity of news articles posted on such sites.

Automated detection methodology

At RedMarlin, we are constantly working on innovative ways to quickly find interesting patterns in the websites we scan. We’ve been monitoring political websites in our data in the wake of upcoming midterm elections to find the ones with political and potentially fake content.

The scope of this research required experimentation at large scale and here are numbers to give you an idea:

  • Pre-filtering 400M URLs per day with rules to reduce size of data set to 1M URLs per day.
  • Assigning 100 – 500 abstract topics to generate interesting clusters for review.
  • Sentiment analysis on filtered clusters to further narrow down articles with excessive negative content.

Step 1: Pre-filtering

This step involved removing non-interesting spam URLs from the dataset.

It turns out the ranking of URLs based on domains using Zipf’s law is quite effective in this stage. Domains that have the largest count in the clusters turned out to be usual spam links (primarily fake online pharmacy) and could be safely ignored. In addition to that, we filtered out some other non-interesting links based on keywords in site title.

Once we have the filtered dataset, the size reduced drastically to close to 1M URLs per day.

Step 2: Topic modeling

This step involved taking URLs from the previous step, getting the rendered natural language text from the html body and applying Topic modeling to them.

Topic modeling is an unsupervised learning technique to extract abstract topics from a given set of documents. For our task, we used a popular method called Latent Dirichlet Allocation (LDA). This method builds documents as distribution over topics and topics as distribution over words where the distributions are modeled after Dirichlet distributions.

Before applying LDA, there are few important pre-processing tasks required. This helps improve the quality of words used in modeling the distributions. We performed the following tasks on the data:

  • Tokenization and pruning: Break the sentence into tokens (words) and remove tokens that were longer and smaller than certain thresholds. Also, remove common stopwords (the, an, a etc.) This cleaned up low quality and noisy tokens from the data.
  • Lemmatization: Convert different morphological forms to the root word. For example, words `good`, `better`, `best` get converted to `good`
  • Stemming: Convert different forms of a word to its root word. For example, words `claim`, `claimed`, `claims` would be converted to one token `claim`

We used the popular NLTK library in Python to perform the above tasks. Now that we had a clean dataset, we used another popular Python library – gensim, to create the input matrix and perform LDA on it. We tried both bag-of-words and TF-IDF matrices as the input for LDA out of which, TF-IDF gave us the best results.

Topics created

Below are examples of some abstract topics generated from our dataset along with token composition:

  • Topic1: news, fox, trump, politics, entertainment
  • Topic2: democrats, trump, reply, vote, pm
  • Topic3: contribution, actblue, card, gifts, deductible
  • Topic4: information, data, services, business, market

As can be seen from the topics above, Topic4 is different from the rest of the topics, that look like they are likely related to politics. Once such topics are assigned to documents, it becomes easy to cluster them based on Topic IDs.

The cluster examples shown below were created by clustering based on such election/politics related topics. Note that at this stage we don’t know if the sites are serving fake news. Nevertheless, these clusters are interesting and worth highlighting.

1. Fake Daily Mail websites

This cluster consists of thousands of websites mimicking the Daily Mail news site. We found a total of 4107 websites registered between June 2018 and August 2018 on TLDs .cf, .ga, .ml, .tk and .gq.

This slideshow requires JavaScript.

2. Fake Wikipedia websites

Similar to fake Daily Mail websites, we found thousands of fake Wikipedia Articles on the same TLDs mentioned above.

This slideshow requires JavaScript.

 

Step 3: Sentiment analysis with LSTM

Once we have clusters of interest, we classify documents in each to see whether they have a positive or negative sentiment. This helps narrow down potentially bad websites spreading fake news or propaganda, as such sites tend to have overly-negative sentiment.

We chose LSTM for sentiment analysis because of its high accuracy in several NLP related tasks. We used Keras to implement it along with training data obtained from tweets on GOP debate during August 2016.

Here are some of the websites our model categorized with negative sentiment:

This slideshow requires JavaScript.

In fact, upon closer inspection, two of the above sites were found to be hosted on one IP along with few other sites of similar nature.

What’s next

It is worth mentioning that sentiment analysis provides a good filter to detect sites with excessive negative content. However, more work is required to determine whether the content is truly fake. For this research work, we used reputation of hosting infrastructure as one technique for detection.

When it comes to news, it helps to review several sources to confirm key facts, and we’re continuing to enhance the abilities of our system to detect false claims accurately. We believe a combination of source reputation and fact checking with Natural Language Processing techniques is right way to go when it comes to ranking fake news sites for “fakeness”.

 

Domains from this post with deeper insights on CheckPhish

  1. madworldnews.com
  2. patriotpulse.net
  3. renewedright.com
  4. offthewire.com

References

  1. https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
  2. https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
  3. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  4. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Equifax, one of the three major credit bureaus in U.S. made a disclosure on Sept 7th that they suffered a massive data breach on July 29th, 2017. They reported an estimated 143 million consumers may be impacted, making it one the largest breaches in U.S. history.

As security researchers, we’ve been closely monitoring the news since it broke out. In this blog post, we share some early domains that look suspicious and are worth monitoring closely. As we typically see in such breaches, there is an element of bad actors exploiting the situation for their personal gains in the aftermath. Phishing is one such threat that we always expect in the days following the disclosure. Since thescale of the breach is so big and the data at stake is extremely sensitive (SSNs, DOB, Names etc.), it becomes extremely important for everyone to stay vigilant of deceptive phishing links that might be trying steal user’s information.

Equifax’s free credit monitoring: a phishing link that wasn’t

Within couple of hours of the official announcement on Sept 7th, we started receiving queries on RedMarlin’s free phishing lookup tool CheckPhish for a suspicious looking link: https://trustedidpremier.com/eligibility/eligibility.html. Our AI engine marked it clean but we had to dig further as the link had various suspicious characteristics. It was registered few days ago, the domain is hosted on Amazon, has WHOIS information privacy protected and the site is asking for 6-digits of SSN and last name as seen in the image below.

Trusted ID Premier: Equifax's free credit monitoring lookup

Image 1: Equifax’s credit lookup tool to check if you were affected in the breach.

Upon tracing it back, we found the proper chain which links it back to https://www.equifax.com/personal as you can see below. We were relieved to inform users it wasn’t a phishing attempt on them.

Path from Equifax homepage to the trustedidpremier.com link

Various researchers reported that the site https://www.equifaxsecurity2017.com was being marked as phishing by security providers, which is understandable given the suspicious indicators on that site as well. It was registered recently and saw a massive spike in DNS volume and likely caused some of them to mark it as phish. We agree that it is better to be on the safer side and mark something so suspicious as phish proactively until there is enough evidence to prove otherwise.

In addition to the above, we saw reports on Twitter for the trustedidpremier.com site being blocked by Google Chrome, although it seems to be fixed now.

If you wish to check more details on the above links, CheckPhish has more insights into them:

For trustedidpremier.com: https://checkphish.ai/insights/1504820558046/d472758e4de186bf04c66982fdf97e73bf981e25e0297da81f4f60232207c956

For equifaxsecurity2017.com: https://checkphish.ai/insights/1504845916728/310e17fee782fbf677a575cfa991796eb2e1a189f892a842524e09944be64c33

Sample CheckPhish insights page

Image 2: Sample CheckPhish insights page for trustedidpremier.com

At the time of writing this post, at least one engine marked the above two domains as phishing on Virustotal:

For http://trustedidpremier.com: https://www.virustotal.com/#/url/f301a01db2e921d773b13340eb4883d3fb32733cf822f897a032b6ad15fc400d/detection

http://equifaxsecurity2017.com/ https://www.virustotal.com/#/url/99e3eadc2b4b59115b57016b621a014007434ae03662580f910939d87c764597/detection

 

What’s in store next?

As mentioned earlier, we expect phishing attempts to go up in the coming days and weeks. In our daily monitoring of newly registered domains, we saw 77 new ones that look very similar to the ones used by Equifax. They were all registered in last few days. Few examples below:

trustidentitypremiereefx.net
trustidentitypremiereefx.com
equifaxtrustidpremiere.org
equifaxtrustidpremiere.net
equifaxtrustidpremier.org                                                                                                                                                         efxtrustidpremier.net
efxtrustidpremier.com
efxtrustidentitypremiere.net
efxtrustidentitypremiere.com

None of these domains resolve to an IP so far and their WHOIS is privacy protected. The most plausible theory is that they were registered proactively by incident response teams at Equifax before the bad guys get hold of them. Full list of the domains here.

We’re also seeing reports of domain registrations that are deceptively similar to the above but most of them are redirecting to the equifaxsecurity2017.com site. Here is a list of 247 such newly registered domains. Most of these domains are registered on Name.com and look different from the previous list that are hosted on Amazon.

So far, we don’t have any evidence of any of the newly registered sites that we found to be hosting phishing but that’s not unusual as it has only been a few days since the breach announcement.

We’ll keep making updates to this blog post as we gather more information on phishing attacks that we find in the following days. Stay vigilant!

 

Update 1 (2017-09-11): Thanks to the awesome dnstwist tool, we have an un-curated list of several more variants of Equifax domains. Note that this an exhaustive list that contains both legitimate (Equifax owned) domains and several other suspicious ones. Please filter at your end. Complete list here.

My parents were visiting from India and my mother who is very keen on learning new things on the internet wanted to access her bank account online. Having heard about WannaCry Ransomware in the news in India she wanted to know if it safe for her to access her bank account online. I asked her how does she know that she is going to know that she is visiting the real site? She simply said I see a ‘lock’ in the browser. I have happy and terrified at the same moment. Happy because she knew basics about SSL and TLS and terrified because we all in security community have teaching that https protects against everything.

How an attacker can obtain legitimate SSL certificate for any domain

Thanks to free SSL services anyone can get a certificate with no verification.

Getting SSL Certificate for Gmail homograph domain gmạil.com


https://www.sslforfree.com/
https://letsencrypt.org/
https://buy.wosign.com/free/

We went about just doing that. We used LetEncrypt for our example. It took less than 5 minutes from buying to domain to getting an SSL certificate on Ubuntu 16.04. See steps below:


wget https://dl.eff.org/certbot-auto
chmod a+x ./certbot-auto
./certbot-auto certonly --standalone -d xn--gmil-6q5a.com

https certifcate

getting certificate

Then tested quality of certificate from SSLLabs at Qualys and it got a nice B Rating

qualys certificate check

qualys certificate test

Finally you can see that we were successfully able to register Gmail the gmạil.com homograph domain. You can check it out in your browser.

We tested with Firefox, Chrome, Safari, IE and Microsoft Edge (latest versions of each). To our surprise IE was the only browser that expand domain gmạil.com to its punnycode ‘xn--gmil-6q5a.com’.

Trends on phishing attacks that use SSL

At RedMarlin, we are seeing a consistent rise in phishing attacks over https. Our latest figures show nearly 10% of all real-world phishing urls were hosted on https with legitimate SSL certificate. In the month of July 2017, we saw a massive surge in phishing sites on https. This is a worrisome trend because we know users are more likely to click on a phishing url if it is hosted on https.

https phishing trends

Ways to identify these scams

Ways to identify these scams and protect yourself.
As a security community, we should stop telling people that just because they see “https” or “lock” in the browser it is ok to trust the website. We need to educate them more. Here are the few steps one can take if they want to be sure that they are visiting trusted website.
1. Do a domain lookup and see who is the registrar of the domain. Is there a real phone number and address? You can use whois lookup tool like the one provided by DomainTools, to determine if domain is owned actually by the company and not by an imposter. You can clearly see contact information for our site ‘gmạil.com” is not that of Google.
2. If url looks suspicious to you then you can use either Phishtank or CheckPhish to determine whether site is phishing or not.
3. Check who has issued the certificate? Is it a trusted authority?
4. Check who the certificate was issued to. It should have the details of the organization you were expecting for the domain.

Resources

1. Phishtank – a database of phishing urls
2. CheckPhish – an AI based tool to detect phishing in real time
3. Dnstwist – Domain homograph generator

This is more technical version of the post on info-sec magazine