The majority of news published online presents one or more images or videos, which make the news more easily consumed and therefore more attractive to huge audiences. As a consequence, news with catchy multimedia content can be spread and get viral extremely quickly. Unfortunately, the availability and sophistication of photo editing software are erasing the line between pristine and manipulated content. Given that images have the power of bias and influence the opinion and behavior of readers, the need of automatic techniques to assess the authenticity of images is straightforward. This paper aims at detecting images published within online news that have either been maliciously modified or that do not represent accurately the event the news is mentioning. The proposed approach composes image forensic algorithms for detecting image tampering, and textual analysis as a verifier of images that are misaligned to textual content. Furthermore, textual analysis can be considered as a complementary source of information supporting image forensics techniques when they falsely detect or falsely ignore image tampering due to heavy image postprocessing. The devised method is tested on three datasets. The performance on the first two shows interesting results, with F1-score generally higher than 75%. The third dataset has an exploratory intent; in fact, although showing that the methodology is not ready for completely unsupervised scenarios, it is possible to investigate possible problems and controversial cases that might arise in real-world scenarios.
Images and video-audio sequences have traditionally been considered a gold standard of truth, as the process of altering or creating fake content was restricted to researchers and skilled users. With the development of tools and editing software that make the forgery process almost automatic and easy, even for nonprofessionals, this is no longer true. Not only the process of altering digital content became easier in the last years, but also the process of creating and sharing it. With more than 3 billion of users active on social media, it has been recently estimated that 3.2 billion images are shared every day, and 300 hours of video per minute are uploaded to YouTube.
In case of high-impact events, such as terrorist attacks or natural disasters, the uploaded images and videos are publicly visible and spread quickly, within seconds. The phenomenon, in which ordinary man and women are able to document events that were once the domain of professional documentary makers [
This is a serious threat, as it has been proven that visual content can affect public opinion and sentiments [
Examples of fake images of concern of this work.
Miscontextualized image appeared on the cover of the New York Post
Tampered image appeared on the Hamevaser front page
Given all the negative consequences that the distribution of harmful and/or fake content can cause, the need of dedicated techniques for preserving the dependability of digital media is evident. To this end, the research community recently proposed a novel task within the MediaEval benchmarking initiative (
Starting from a preliminary version of the approach of detecting fake content in tweets presented at MediaEval 2016 [
Our main contribution is twofold: (i) the development of a methodology for discriminating between real and fake images, consisting of image forensics techniques and textual analysis, and (ii) the collection of realistic datasets which are dedicated for testing the applicability of such proposed method in unsupervised scenarios.
It has been acknowledged that there is no single image forensics method that works universally because each method is designed to detect specific traces based on its own assumption; it is therefore wise to fuse multiple output from many forensics techniques [
The rest of this paper is structured as follows. Section
In literature, there is a prominent research regarding the detection of fake news, especially in social media [
Nevertheless, attempts to verify multimedia content in online news have also been made. In [
Following, we present a brief overview of image forensic and textual analysis techniques that have been used in the literature.
Image forensic techniques traditionally employed by journalists [
Therefore automatic techniques able to assess whether or not a multimedia content is original and to assess which regions are most likely to be modified are needed. Image manipulation is typically classified as either splicing (transferring an object from an image and injecting it into another) or copy-move (copying an object from the same image to a different position). These manipulations normally leave digital traces that forensics methods try to detect. Image retouching, for instance, contrast enhancement, edge sharpening, or color filtering, is not considered in paper since these modifications do not alter semantic content and thus techniques targeting such modifications are not included in our study.
Since JPEG is one of the most common formats of digital images, vast research has focused on several ways to exploit traces left by the JPEG compression process. For instance, different methods have been proposed to determine whether an image was previously JPEG compressed [
Image manipulations also disrupt Photo Response Non-Uniformity (PRNU), a sort of camera fingerprint that is supposed to be present in every pristine image. PRNU can be therefore used as a useful clue to detect image forgeries [
By the advent of deep learning and the amount of available data, image manipulation detection can be solved through Deep Neural Networks (DNNs). The feature extraction task is no more required since DNNs can perform feature extraction and classification through end-to-end process. Highly promising results have been recently achieved, for instance [
Nevertheless, many of the aforementioned techniques are not always suitable for real cases, where altered images are strongly processed by the social media sites they are uploaded on [
Natural language processing techniques exploit search engines, text corpus visualizations, and a variety of applications in order to filter, sort, retrieve, and generally handle text. Such techniques are typically used to tackle the challenging problem of modeling the semantic similarity between text documents. This task relies fundamentally on similarity measures, for which a variety of different approaches have been developed. Some simpler techniques include word-based, keyword-based, and n-gram measures [
Traditionally, text similarity measurements leverage Term Frequency-Inverse Document Frequency (TF-IDF) to model text documents as term frequency vectors. Then, the similarity between text documents is computed by using cosine similarity or Jaccard’s similarity. However, it seems unlikely that many occurrences of a term in a document always carry the same significance of a single occurrence. A common modification is to use instead the logarithm of the term frequency, which assigns a weight to the term frequency [
A connected textual technique is the so-called sentiment analysis, which is used to systematically identify, extract, quantify, and study affective states and subjective information. In general, this technique aims to determine the attitude of a writer with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. Sentiment polarity text-classification is a challenging task, as determining the right set of keywords is not trivial, although attempts to determine the polarity of sentiments of web pages and news articles achieved a precision varying between 75% and 95%, depending on the nature of the data [
The method discussed here was developed to discriminate between real and fake images associated with news articles. The proposed approach uses the framework presented in [
Images of concern of this paper are not only those that have been somehow altered, but also images that do not reflect accurately the event described in the news, such as pictures taken at a different time and/or place than the one described, or wrongly depicting other event facets.
Given the duality of the discrimination task, we isolate two subproblems and solve them separately before experimenting different techniques to merge the two methodologies.
The first problem consists of deciding whether any manipulation has been performed on the multimedia content from an image forensics point of view. Three different strategies were applied to tackle this problem, namely, classical image forensics techniques, a method based on statistical features of rich models [
The first devised approach applies each of the classical image forgery detection algorithms listed in Figure
Outline of the algorithm for image tampering detection for classical methods.
Each heatmap generated by the described algorithms is then fed to an algorithm that computes the Region of Interest (ROI) of the map, i.e., the region that is more likely to contain tampering, by dividing the image in blocks and finding the one with the maximum variation.
Meaningful statistics (e.g., mean, variance, minima and maxima) are then extracted for that region. These values are finally combined to generate a
The second approach to identify tampered images is an adaptation of Splicebuster [
Splicebuster works by extracting local features related to the cooccurrences of quantized high-passed residuals of the image. Then, these features are modeled under two classes, i.e., pristine and tampered, by Expectation-Maximization. The final result is a probability map indicating likelihood of each pixel under pristine model. However, we do not have a ground truth that indicates which area of an image is tampered, and our aim is just to provide a “yes/no” answer with an associated probability that can be then combined with textual analysis. Therefore, we devised the methodology in Figure
Outline of the algorithm for image tampering detection for Splicebuster-based method.
The third approach to identify tampered images is based on CNNs, since recently they have been proved to be extremely efficient to solve this type of problem. However, CNNs generally require large labeled dataset for training and thus are hard to be applied in our particular case. Therefore, we leverage the pretrained network in [
Outline of the algorithm for image tampering detection for CNN based method.
The second problem that we analyze in this paper is how to understand whether an image is coherent with the topic described in the text of the article in which it is inserted.
The approach chosen to tackle this problem allows extracting meaningful values from texts associated with the image under test. This approach, as can be seen in Figure from texts extracted from the news articles related to the event supposedly depicted by the image and from the texts retrieved online using each image connected to an event as pivot.
Outline of the algorithm for textual analysis.
The former type of texts are extracted from manually retrieved news articles, which are meant to contain all the words describing the event at stake. By comparing these words with the ones extracted by texts automatically retrieved using the image as pivot, we should be able to detect discrepancies between the event in the news and the story the image is truthfully telling about.
To retrieve text by image, we adopt to use Google Reverse Search. This search engine allows retrieving all online resources that are supposed to contain a given image. Therefore, if the image has been taken before the event to which it is associated, it is likely that articles or resources connected to the first appearance of the image will be collected. Similarly, for tampered images it might be possible to pinpoint pages stating that the image is suspicious.
After retrieving all the text, we proceed to textual feature extraction. First of all the texts associated with the image or retrieved through Google Reverse Search are analyzed to extract the most important words using either TF-IDF, STF-IDF technique or a simple counter.
TF-IDF (Equation (
A common improvement to the previous technique is STF-IDF (sublinear term frequency-inverse document frequency), which assigns a weight given by (
In this task, we also consider to use a simple counter instead of the two commonly used techniques. This was done to evaluate the performance of a rather naïve technique in comparison with more sophisticated ones.
The result of this step, irrespective of which of the three described techniques is used, is a vector of words and word frequencies, as the number of occurrences is normalized by the total number of words. Part of this list of frequencies is used to form the final vector used for classification, to which similarity and possibly sentiment analysis are concatenated.
The similarity is computed by either Cosine or Jaccard’s similarity between frequency vectors (
Cosine and Jaccard’s similarity can be computed either for the whole vector or only for a subset of the vector, e.g., the top 100 highly rated words.
Finally, basic sentiment analysis techniques are used to analyze people’s reaction to the image, which can possibly imply that the image is fake. This is done by analyzing documents retrieved for each image to detect keywords that highlight the feelings toward that image. The computation of sentiment analysis allows extracting, for each text folder associated with each image belonging to an event, three measures: (i) the number of positive words in the text; (ii) the number of the negative words in the text; and (iii) the number of words that are likely to be associated with fake images.
The first two measures are computed by comparing the image’s vector of words to a list of words for positive and negative sentiments proposed in [
We conduct a series of testing to determine which combinations of the aforementioned features are worth investigating. During this phase, seven feature sets (FS1 to FS7) are identified. Details on how these feature sets were computed can be seen in Table
Best performing feature sets for textual analysis.
Feature set | Relevant words extraction | Number relevant words | Similarity measure | Sentiment analysis | Top frequencies |
---|---|---|---|---|---|
FS1 | TF-IDF | All | Cosine | ||
FS2 | TF-IDF | All | Cosine | X | X |
FS3 | STF-IDF | All | Cosine | X | X |
FS4 | STF-IDF | Top 100 | Cosine | X | X |
FS5 | Counter | Top 100 | Cosine | X | |
FS6 | TF-IDF | Top 100 | Jaccard’s | X | X |
In FS3, for instance, STF-IDF scores of all the words are extracted from the retrieved text to form the vector
Sections
We, first of all, train two classifiers separately using respectively image forensics features and textual features. The probability outputs from two classifiers are as follows:
These two probabilities are assigned a weight (
Algorithm for image forensics and textual analysis features combination.
The values of
In this section we are presenting the results obtained for each of the three datasets described in the followings. The evaluation is performed in terms of Precision, Recall, and F1-score (see Equation (
The approaches discussed in Section
In general, as can be seen in detail in Table
Statistics of the collected datasets.
Dataset | Events | Articles | Images | Real images | Fake images |
---|---|---|---|---|---|
MediaEval2016 (training) | 15 | 170 | 380 | 186 | 194 |
MediaEval2016 (test) | 25 | 91 | 98 | 50 | 48 |
BuzzFeedNews | 6 | 56 | 49 | 31 | 18 |
CrawlerNews | 13 | 130 | 246 | 223 | 23 |
A second dataset,
Finally, a third dataset is created to investigate the performance and the weaknesses of the devised approach when applied to an unsupervised real case. This dataset, now on referred to as
Outline of the approach used to create the CrawlerNews dataset.
To create this dataset, crawls are performed on Google News, a platform that provides useful and timely news in an aggregated fashion, allowing reaching content from many sources simultaneously. The rationale of choosing Google News is due to the fact that many people nowadays turn to such platforms to retrieve information on outbreaking events. Outsell research firm conducted a survey in 2010 revealing that 57% of users turn to digital sources to read news. Among these, more than a half of consumers are more likely to turn to an aggregator rather than to a newspaper site or to other sites [
Since our aim is to analyze news related to high-impact events, we designed a framework to filter news by using five crawlers for five versions of Google News, namely, the Australian, Irish, British, American, and Canadian. These crawlers are responsible for the retrieval of articles (which will be all in English) that are assigned to the Top Stories section of the appropriate version of Google News. To group news related to the same event, we use a similarity threshold on the words in the titles.
Given the news grouped by event extracted by each national crawler, high-impact events were extracted by assuming that such events will be reported world-wide, and thus belonging to the intersection of news acquired by the single crawlers. After a cleaning phase that allows removing duplicated or broken news links, the text of the articles is extracted. Likewise, the news page is parsed to detect and extract only images that are relevant to the article, and not advertisements, logos, or images related to other suggested news. This is done by analyzing the position of the image within the text through the
In the following sections, results obtained for the three datasets are presented.
This dataset was decisive to be able to evaluate the general performances of the devised approach as well as to run preliminary tests that allowed verifying whether the image forensic features were appropriate and which textual features were the best performing ones.
For image forensic features, the three methods also used in [
Results obtained for forensic features on the
This new ground truth was also used to evaluate the performance of the methods based on Splicebuster and CNN. As can be seen in Figure
The similar trend for the three methods is probably caused by two factors. Firstly, Splicebuster and the CNN were originally designed to provide a tampering map. The prediction based on the tampering map is still an open problem. We resort this problem into local feature extraction on the suspected region. This probably discards global information of the tampering map. Secondly, results are very much affected by the quality of online images, which are subject to strong compression and low resolution that might prevent the algorithms from finding tampering traces.
Various tests were run also for textual analysis by combining different textual features. In general, with the analysis of text it was possible to reach an F1-score higher than 70% in most of the cases, which suggests that this type of features might be more suitable to detect fake images.
The results obtained for some of the best performing sets of textual features (listed in Table
Results obtained with classical image forensics features combined with different sets of textual features. Results are reported for different weights for
Random Forest, as can be seen, starts to produce results better than 70% in terms of F1-score from the point in which 70% weight is given to the image forensic features, and 30% to the textual features. The F1-score then keeps rising up until the
For logistic regression, the F1-score tends to rise more slowly for most of the textual feature sets. As can be seen in Figure
Similar observations can be made for the combination of textual features with Splicebuster and CNN based image forensics. In fact, the trend of the curves is analogous to the ones in Figure
Results obtained on
Classical | Splicebuster | CNN | |
---|---|---|---|
Image Forensics | 64% | 65% | 66% |
Textual Analysis | 73% | 73% | 73% |
Combination 40%-60% | 74% | 74% | 73% |
Results obtained on
Classical | Splicebuster | CNN | |
---|---|---|---|
Image Forensics | 68% | 68% | 68% |
Textual Analysis | 73% | 73% | 73% |
Combination 10%-90% | 75% | 75% | 74% |
The obtained results suggest that, despite textual analysis is more suitable for the authenticity discrimination of images, the combination with image forensics features is in general able to outperform their disjoint usage.
The main purpose of tests run on
Results obtained on
Classical | Splicebuster | CNN | |
---|---|---|---|
Image Forensics | 75% | 70% | 73% |
Textual Analysis | 64% | 64% | 64% |
Combination 40%-60% | 75% | 66% | 75% |
Results obtained on
Classical | Splicebuster | CNN | |
---|---|---|---|
Image Forensics | 77% | 76% | 78% |
Textual Analysis | 84% | 84% | 84% |
Combination 10%-90% | 86% | 87% | 86% |
Even though feature sets did not behave exactly the same over the two dataset, it is still possible to say that results are good enough to prove that they are not the result of overfitting, as also for this dataset the F1-score is frequently higher than 70% for image forensics, textual analysis, and their combination.
Finally, the methodologies described are applied on an experimental dataset collected through a web crawler as described in Section
To better understand these results, part of the dataset (the 13 events in Table
Results obtained on
Classical | Splicebuster | CNN | |
---|---|---|---|
Image Forensics | 92% | 88% | 92% |
Textual Analysis | 67% | 67% | 67% |
Combination 10%-90% | 85% | 80% | 85% |
Results obtained on
Classical | Splicebuster | CNN | |
---|---|---|---|
Image Forensics | 93% | 93% | 93% |
Textual Analysis | 87% | 87% | 87% |
Combination 40%-60% | 88% | 87% | 88% |
It appears that some events are harder to analyze than others. Among these it is possible to list events related to technology, movies, political events, or gossip. For instance, for the launch of new products or movies or conferences about technological topics the news might contain renderings, graphs, or even logos of the firms at stake that can be misclassified as fake both by the image and textual forensic algorithms. In these cases the misclassification resulting from the image forensic is due to the fact that most images contained in this type of events are computer generated, and not actual pictures. Although some techniques to discriminate between computer generated and natural pictures exists [
Political events and gossip might also lead to misclassifications, as they might use satirical images or stock photos of the politicians or people involved. Although this analysis is beyond the scope of this work, it is interesting to note that for political events choosing a particular picture over another can sensibly bias the opinion of a reader. Some of the examples above can be said to belong also to the class of problems related to extracted images, since, as already said, the three classes can overlap. Other types of images frequently misclassified are schema, maps, and images related to the places where the events at stake occurred.
The last class of problems is caused by the quality of the extracted texts and that might lead to misclassification during the textual analysis. In fact, during this phase noise can be introduced as Google Reverse Search interprets an image with the general concepts of
Other problems might be due to other types of noise introduced by the search. For instance, one of the images (
On the contrary, an image that was correctly predicted as fake is a demonstrative image, frequently used in association with articles related to the prevention of blood clots during flights. This image has been used in an article (
In general, although most of the images were correctly predicted as real, the performance of the devised methodology on this dataset is hard to be evaluated since the process of labeling is not trivial. Therefore, the methodology cannot be said to be ready for real-world scenarios, but this experiment is still important to gain insight on possible issues and controversies related to the problem, which might be used in future works to improve the state of the art.
The objective of this work is to exploit state-of-the-art techniques to assess image authenticity and relevancy with respect to the news article to which it is associated. This task is extremely important due to the amount of images uploaded everyday online. Those images can get viral within seconds in case of high-impact events. The devised methodology is able to perform rather well on this task, thanks to a combination of image forensics and textual analysis techniques, reaching an F1-score frequently higher than 70%.
Moreover, the analysis performed on a dataset created through a web crawler allows gaining insight on a number of problems that might arise when, instead of using ad hoc datasets, we look into more complex, unsupervised scenarios. Some of these observations suggest the need of more sophisticated techniques to extract text associated with the images, as they are crucial to the correct classification.
In general, the analysis of the last dataset highlighted that methodologies at the state of the art, including the one presented here, present some critical issues when applied to real-world scenarios. To be able to overcome these issues, a possible solution might be the creation of new and bigger dataset with a careful labeling that would also favor a better exploitation of the power of deep learning based approaches.
Part of the data can be found at the following link:
The authors declare that they have no conflicts of interest.