Case study: Insight4News

Project Title: Insight4news

Project Description: Insight4News is a system that connects news articles to social conversations, as echoed in microblogs such as Twitter. Insight4News tracks feeds from mainstream media, e.g., BBC, Irish Times, and extracts relevant topics that summarize the tweet activity around each article, recommends relevant hashtags, and presents complementary views and statistics on the tweet activity, related news articles, and timeline of the story with regard to Twitter reaction. While many systems tap on the social knowledge of Twitter to help users stay on top of the information wave, none is available for connecting news to relevant Twitter content on a large scale, in real time, with high precision and recall. Insight4News builds on our award winning Twitter topic detection approach and several machine learning components, to deliver news in a social context.


Social media networks are an increasingly important resource for researchers. The content posted on social media offers insights into a wide range of areas such as public health, information flow, political developments and social behaviour. Content posted on social media networks has the added attraction of accessibility: since the material has already been published, it falls outside the remit of higher education ethics boards and is largely unfettered by issues relating to consent, reuse, third party sharing, storage and other constraints that apply to data collected in more traditional ways. To date, there is little no regulation in this area and many researchers are working in the dark, hoping that their research will not infringe user rights, become vulnerable to emerging legislation or fall foul of the courts.

In their work on the Insight4News project, researchers identified three areas where ethics come into play and where researchers and research subjects may be vulnerable.


  1. Collecting public data from twitter and news articles

A proportion of public statements on social media are later deleted. What happens when the deleted data has already been captured and republished by data researchers? Unless the individual poster approaches the research team directly and requests deletion, the material is likely to survive. Deletion is at the discretion of the researcher in any case. Insight4News has received, and complied with, a number of requests for deletion of material. If the volume if requests increased, it would become impractical for the team to service every request. The result? It is likely that the research, currently available to the public and other researchers to access, would have to be taken offline.

Other types of content that compromise research output include posts that identify or have personal implications for the poster, ‘right to be forgotten’ cases, articles that were later retracted by the publisher and potentially libellous content.

However, if research outputs are taken offline, then the results and techniques are not available to be tested and verified by other researchers, which is germane to the scientific method.

Do the social media networks themselves have a role to play? Quora, an information-sharing website, doesn’t allow researchers to use the data it generates. This offers protection to users, but starves the research community of vital data. Social media analysis has many potentially positive applications from disease prevention to earthquake detection.

A framework that protects the user and the researcher would be welcome.

Key question: There is a conflict of rights between social media users who have a right to delete content and a right to be forgotten, and researchers who archive and use social media data for research purposes – research that could be of great societal benefit. Can this conflict be resolved? If not, which party should be prioritised?


  1. The problem of A/B testing

A/B testing (or ‘split testing’) is the practice of running two versions of a web page to test different user reactions. For example, if the page with the larger font gets a better conversion rate (or ‘click rate’), that is the version that will be used. A/B tests are common in commercial market research.

Typically, with a new research system researchers will look for a change in user interaction between the old and new interface. Naturally the test depends on user ‘ignorance’ – if the user is informed of the change, his/her interaction will be influenced by that foreknowledge, skewing the result.

Researchers suggest that there is an ethical dimension to A/B testing in data science research involving live websites or social media accounts, like Insight4News. When the team uses a ‘bot’ account (an account that is automated, rather than managed by a real person) to test responses to, for example, the inclusion of hashtags, it is necessary for the user to believe that she or he is interfacing with a real person. When people follow, like and comment on content generated by a bot, could this represent an infringement on the rights of the social media user? The insertion of a disclaimer would however, change the behaviour of the user and devalue it, from a research perspective. There are currently no rules to prevent ‘bot’ sites or accounts masquerading as ‘real’. Would the introduction of such a rule impede social media research?

Currently the public are being ‘misled’ on a grand scale – it is estimated that up to 50 per cent of all Twitter accounts are automated. Researchers aware of this, the public is not. Is public education the answer? Would a social media literacy programme for schools help to give social media users more agency?

Key question: Twitter bots are a valuable research tool, but many Twitter users are unaware that they are interacting with a bot rather than a real person. This lack of awareness is an important aspect of the research as the researchers are studying user interactions and reactions, however, is it ethically acceptable to mislead users in this way? Commercial entities are likely to continue to practice once it is within the law – shouldn’t researchers be able to do likewise if the research can contribute to public good?


  1. Social media analysis and ‘fake news’

The Insight4News project captures, categorises and disseminates news. There is no verification process for news published online. Google, Twitter, Facebook and other social media providers are all moving to verify news output but it is not known when these verification processes will emerge or how reliable they will be.

In the meantime, social media analysts and news media analysts run the risk of reissuing false news. Researchers raise the concern that data researchers might unwittingly contribute to the spread of propaganda, false information, clickbait etc. A verified content stamp applied by the social media provider would be very useful. For now, researchers typically rely on the guidelines of the information provider.  However, it is likely that many researchers are not in fact familiar with the terms and conditions of data use laid out by the social media networks from which they mine much of their data.

Key question: Without a content verification tool, Insight4News runs the risk of widening the reach of fake news stories. Is there an obligation on the part of social media companies to provide verification tools?

Key question: Should social media networks be obliged to make their terms and conditions more user friendly for both users and researchers?

Leave a Reply

Your email address will not be published. Required fields are marked *