RedMarlin Research Labs Blog

RedMarlin / RedMarlin Research Labs Blog

A chicken dinner in Las Vegas used to cost less than $2.00 and the usual bet at that time was $2.00, so when you won you had enough for the chicken dinner. Hence “Winner, winner, chicken dinner! Source: Quora


Winner winner chicken dinner


It used to be the case when gambling started in Vegas, that it could only be done at casinos. With the proliferation of internet, people can now gamble anywhere.  They can do it at work, home or even on their phones when they’re on the go. According to Statista US, online gambling market is worth $52B. And that makes it really lucrative for bad actors.

The U.S. justice department passed a decision yesterday to mark all online gambling illegal. This is a big step toward curtailing the online criminal activity perpetrated through gambling sites. In the past, FBI reported that such sites are rife with criminal activity like money laundering, wire fraud and various other scams.

Just to give you an idea, in the last six months, our AI engine detected tens of thousands of new gambling sites.


Here are few examples of gambling sites offering casino games online.

This slideshow requires JavaScript.


Our Prediction: Online gambling sites will see massive increase

With DOJ’s decision to mark all online gambling illegal in US, we predict a huge underserved market will be created. This market will be taken over by bad actors who will continue to create such sites by thousands every month.

How AI can help

We believe blocking tens of thousands of gambling site per month would be a losing battle with blacklists. With the help of Deep Learning (NLP), AI can help us analyze the content of the website instead of looking at the links.

At RedMarlin, we’re already a step ahead with our solution to detect these sites in realtime.

If your organization has a need for detecting gambling websites in realtime to protect your employees, RedMarlin’s APIs can help.

In our effort to help all internet users, we also provide a free scanning tool at https://checkphish.ai to detect more than 10 different types of online scams.

Midterm elections are just a few days away and the scourge of fake news and propaganda sites shows no sign of slowing down. In fact, the misinformation such sites have is getting far more challenging to detect with Artificial Intelligence based tools.

In order to highlight the problem, here is a quick quiz. Which one(s) of the following news headlines do you think are fake?

“Meghan Blames Trump For Journalist Being Slaughtered, Gets Nasty Surprise”

“Michael Moore Suggests Trump Orchestrated The Caravan In Order To Win The Midterms”

“Mueller Accuses Opponents Of Offering Women Cash To Make ‘False Claims’ — Is Using Power Of Govt. To Go After Them!”

“Stunning Video Capture Migrant Caravan Being Escorted Onto Luxury Busses To Reach US In Time To Interfere With Election Politics”

As it turns out, none of them is true. The first headline is factually incorrect whereas the second, third, and fourth spin other news to change the original meaning of the story.

The problem of fake news has come to the spotlight only in the last couple of years, but it is getting harder for machines to detect with each new iteration. The language now is often a spin or misrepresentation of true facts rather than being factually incorrect, as it was in the past.

In this blog post, we look at how such websites are created and propagated to the masses and highlight deep learning and unsupervised learning techniques we’ve built at RedMarlin to detect them at scale.

In case you were wondering, here is what the websites with the above headlines look like:

This slideshow requires JavaScript.


How new fake news sites get traffic

At RedMarlin, we process links from 400 million+ spam and gray mail each day. As part of our pipeline, we store images and text of each website we scan. This provides us with the flexibility of doing extensive research on large volumes of information on demand.

Email and social media are the main sources of links to fake news sites. We observed several of these sites gaining traffic in the last six months (via redirect links). For example, blackeyepolitics[.]com had as many as 15 other websites redirecting to it, with over 5,000 unique tracking links sent via email in six months.

We also observed a strong correlation between traffic detected by our system and the popularity of the site as ranked on Alexa, showing that these sites are receiving a significant number of visitors and have effective methods of propagating content.

Some common characteristics we see in such sites:

  • There is no information to identify ownership on the site (About Us, Office Address, etc.)
  • The domain’s information in the WHOIS database is privacy-protected via proxy
  • The hosting IP is either behind a reverse proxy (e.g. CloudFlare) or on cloud services (e.g. AWS, Google)

As you can imagine, it becomes hard to establish ownership and, as a result, authenticity of news articles posted on such sites.

Automated detection methodology

At RedMarlin, we are constantly working on innovative ways to quickly find interesting patterns in the websites we scan. We’ve been monitoring political websites in our data in the wake of upcoming midterm elections to find the ones with political and potentially fake content.

The scope of this research required experimentation at large scale and here are numbers to give you an idea:

  • Pre-filtering 400M URLs per day with rules to reduce size of data set to 1M URLs per day.
  • Assigning 100 – 500 abstract topics to generate interesting clusters for review.
  • Sentiment analysis on filtered clusters to further narrow down articles with excessive negative content.

Step 1: Pre-filtering

This step involved removing non-interesting spam URLs from the dataset.

It turns out the ranking of URLs based on domains using Zipf’s law is quite effective in this stage. Domains that have the largest count in the clusters turned out to be usual spam links (primarily fake online pharmacy) and could be safely ignored. In addition to that, we filtered out some other non-interesting links based on keywords in site title.

Once we have the filtered dataset, the size reduced drastically to close to 1M URLs per day.

Step 2: Topic modeling

This step involved taking URLs from the previous step, getting the rendered natural language text from the html body and applying Topic modeling to them.

Topic modeling is an unsupervised learning technique to extract abstract topics from a given set of documents. For our task, we used a popular method called Latent Dirichlet Allocation (LDA). This method builds documents as distribution over topics and topics as distribution over words where the distributions are modeled after Dirichlet distributions.

Before applying LDA, there are few important pre-processing tasks required. This helps improve the quality of words used in modeling the distributions. We performed the following tasks on the data:

  • Tokenization and pruning: Break the sentence into tokens (words) and remove tokens that were longer and smaller than certain thresholds. Also, remove common stopwords (the, an, a etc.) This cleaned up low quality and noisy tokens from the data.
  • Lemmatization: Convert different morphological forms to the root word. For example, words `good`, `better`, `best` get converted to `good`
  • Stemming: Convert different forms of a word to its root word. For example, words `claim`, `claimed`, `claims` would be converted to one token `claim`

We used the popular NLTK library in Python to perform the above tasks. Now that we had a clean dataset, we used another popular Python library – gensim, to create the input matrix and perform LDA on it. We tried both bag-of-words and TF-IDF matrices as the input for LDA out of which, TF-IDF gave us the best results.

Topics created

Below are examples of some abstract topics generated from our dataset along with token composition:

  • Topic1: news, fox, trump, politics, entertainment
  • Topic2: democrats, trump, reply, vote, pm
  • Topic3: contribution, actblue, card, gifts, deductible
  • Topic4: information, data, services, business, market

As can be seen from the topics above, Topic4 is different from the rest of the topics, that look like they are likely related to politics. Once such topics are assigned to documents, it becomes easy to cluster them based on Topic IDs.

The cluster examples shown below were created by clustering based on such election/politics related topics. Note that at this stage we don’t know if the sites are serving fake news. Nevertheless, these clusters are interesting and worth highlighting.

1. Fake Daily Mail websites

This cluster consists of thousands of websites mimicking the Daily Mail news site. We found a total of 4107 websites registered between June 2018 and August 2018 on TLDs .cf, .ga, .ml, .tk and .gq.

This slideshow requires JavaScript.

2. Fake Wikipedia websites

Similar to fake Daily Mail websites, we found thousands of fake Wikipedia Articles on the same TLDs mentioned above.

This slideshow requires JavaScript.


Step 3: Sentiment analysis with LSTM

Once we have clusters of interest, we classify documents in each to see whether they have a positive or negative sentiment. This helps narrow down potentially bad websites spreading fake news or propaganda, as such sites tend to have overly-negative sentiment.

We chose LSTM for sentiment analysis because of its high accuracy in several NLP related tasks. We used Keras to implement it along with training data obtained from tweets on GOP debate during August 2016.

Here are some of the websites our model categorized with negative sentiment:

This slideshow requires JavaScript.

In fact, upon closer inspection, two of the above sites were found to be hosted on one IP along with few other sites of similar nature.

What’s next

It is worth mentioning that sentiment analysis provides a good filter to detect sites with excessive negative content. However, more work is required to determine whether the content is truly fake. For this research work, we used reputation of hosting infrastructure as one technique for detection.

When it comes to news, it helps to review several sources to confirm key facts, and we’re continuing to enhance the abilities of our system to detect false claims accurately. We believe a combination of source reputation and fact checking with Natural Language Processing techniques is right way to go when it comes to ranking fake news sites for “fakeness”.


Domains from this post with deeper insights on CheckPhish

  1. madworldnews.com
  2. patriotpulse.net
  3. renewedright.com
  4. offthewire.com


  1. https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
  2. https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
  3. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  4. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Equifax, one of the three major credit bureaus in U.S. made a disclosure on Sept 7th that they suffered a massive data breach on July 29th, 2017. They reported an estimated 143 million consumers may be impacted, making it one the largest breaches in U.S. history.

As security researchers, we’ve been closely monitoring the news since it broke out. In this blog post, we share some early domains that look suspicious and are worth monitoring closely. As we typically see in such breaches, there is an element of bad actors exploiting the situation for their personal gains in the aftermath. Phishing is one such threat that we always expect in the days following the disclosure. Since thescale of the breach is so big and the data at stake is extremely sensitive (SSNs, DOB, Names etc.), it becomes extremely important for everyone to stay vigilant of deceptive phishing links that might be trying steal user’s information.

Equifax’s free credit monitoring: a phishing link that wasn’t

Within couple of hours of the official announcement on Sept 7th, we started receiving queries on RedMarlin’s free phishing lookup tool CheckPhish for a suspicious looking link: https://trustedidpremier.com/eligibility/eligibility.html. Our AI engine marked it clean but we had to dig further as the link had various suspicious characteristics. It was registered few days ago, the domain is hosted on Amazon, has WHOIS information privacy protected and the site is asking for 6-digits of SSN and last name as seen in the image below.

Trusted ID Premier: Equifax's free credit monitoring lookup

Image 1: Equifax’s credit lookup tool to check if you were affected in the breach.

Upon tracing it back, we found the proper chain which links it back to https://www.equifax.com/personal as you can see below. We were relieved to inform users it wasn’t a phishing attempt on them.

Path from Equifax homepage to the trustedidpremier.com link

Various researchers reported that the site https://www.equifaxsecurity2017.com was being marked as phishing by security providers, which is understandable given the suspicious indicators on that site as well. It was registered recently and saw a massive spike in DNS volume and likely caused some of them to mark it as phish. We agree that it is better to be on the safer side and mark something so suspicious as phish proactively until there is enough evidence to prove otherwise.

In addition to the above, we saw reports on Twitter for the trustedidpremier.com site being blocked by Google Chrome, although it seems to be fixed now.

If you wish to check more details on the above links, CheckPhish has more insights into them:

For trustedidpremier.com: https://checkphish.ai/insights/1504820558046/d472758e4de186bf04c66982fdf97e73bf981e25e0297da81f4f60232207c956

For equifaxsecurity2017.com: https://checkphish.ai/insights/1504845916728/310e17fee782fbf677a575cfa991796eb2e1a189f892a842524e09944be64c33

Sample CheckPhish insights page

Image 2: Sample CheckPhish insights page for trustedidpremier.com

At the time of writing this post, at least one engine marked the above two domains as phishing on Virustotal:

For http://trustedidpremier.com: https://www.virustotal.com/#/url/f301a01db2e921d773b13340eb4883d3fb32733cf822f897a032b6ad15fc400d/detection

http://equifaxsecurity2017.com/ https://www.virustotal.com/#/url/99e3eadc2b4b59115b57016b621a014007434ae03662580f910939d87c764597/detection


What’s in store next?

As mentioned earlier, we expect phishing attempts to go up in the coming days and weeks. In our daily monitoring of newly registered domains, we saw 77 new ones that look very similar to the ones used by Equifax. They were all registered in last few days. Few examples below:

equifaxtrustidpremier.org                                                                                                                                                         efxtrustidpremier.net

None of these domains resolve to an IP so far and their WHOIS is privacy protected. The most plausible theory is that they were registered proactively by incident response teams at Equifax before the bad guys get hold of them. Full list of the domains here.

We’re also seeing reports of domain registrations that are deceptively similar to the above but most of them are redirecting to the equifaxsecurity2017.com site. Here is a list of 247 such newly registered domains. Most of these domains are registered on Name.com and look different from the previous list that are hosted on Amazon.

So far, we don’t have any evidence of any of the newly registered sites that we found to be hosting phishing but that’s not unusual as it has only been a few days since the breach announcement.

We’ll keep making updates to this blog post as we gather more information on phishing attacks that we find in the following days. Stay vigilant!


Update 1 (2017-09-11): Thanks to the awesome dnstwist tool, we have an un-curated list of several more variants of Equifax domains. Note that this an exhaustive list that contains both legitimate (Equifax owned) domains and several other suspicious ones. Please filter at your end. Complete list here.