ML to create and distribute short-form context for long-lasting news cycles alongside articles. Context Cards Find us at https://context-cards.com Problem ● Misinformation spreads faster than the debunks 🚀 ● There is an under-used/forgotten database of already-existing mythbusters and explainers ☝ ● Without formal institutional memory about how issues have evolved over the years, stories can lack historical context. Opportunity ● News packaged with context and/or data can increase newsroom trust as a source of reliable information 🤝 ● Proactive debunking rather than reacting to mis/disinformation 🚫 ● Context Cards offers internal timelines and context that has been approved by an editor, speeding up editorial processes ✅ Why Solution Extract (and serve to audience) context ... from the topic or news cycle ... that the article is talking about Detailed case study on discovery of solution at https://context-cards.com What is context In the previous iteration of JournalismAI, Clwstwr, Deutsche Welle, Il Sole 24 Ore, and Maharat Foundation, identified SIXTY user needs questions that audiences have from journalism. Find them at ModularJournalism.com Find us at https://context-cards.com User Experience Detailed case study on discovery of solution at https://context-cards.com What Find us at https://context-cards.com Nudges Get the audience to ask questions Introduction Q-1017: Can you tell me what happened in very few words? ● Headline of the topic ● 3-line description of the topic ● Follow button What... Find us at https://context-cards.com What... Find us at https://context-cards.com Timeline Q-1010: What has got us here? ● List of related stories in descending order What... Find us at https://context-cards.com Expert Speak Q-1008: What do key people say? Q-1029: How many points of view are there on this topic? ● Pull quotes from articles ● Viewpoints of different parties ● List of opinion articles What... Find us at https://context-cards.com Data Q-1010: What has got us here? ● Data snippets ● Charts from related articles ● Articles tagged as data dives What... Find us at https://context-cards.com FAQs Google’s ‘People Also Ask’ What... Find us at https://context-cards.com Mentions Q-1004: Who is it about? Q-1005: Where did it happen? Q-1028: Who is involved? Option #2: GPT-3 Step 3: Prepare to publish to audiences Option #1: Task specific models Step 2: Generate text for context cards How Find us at https://context-cards.com Topic Modeling Option #3: Google's T5 and FLAN-T5 Newscards Step 1: Find relevant articles from archive Algorithms we tried: Top2Vec and BERTopic. Out of the box, BERTopic gave higher accuracy (52.97%) than Top2Vec (50%). Refining accuracy. ● We tried new embedding methods for BERTopic and Top2Vec. Doc2Vec embeddings with Top2Vec gave us higher accuracy. ● Most of the failed cases belonged to one sub-product (ETimes) ● Articles in Entertainment section usually can’t be clustered into specific news cycles or meaningful topic ● Implemented Top2Vec algo with joint embeddings of words and documents using Doc2Vec model without Etimes stories and achieved accuracy of 73.55%. Data for training. We trained on 70,730 articles from TOI. Data for testing: Historically, editors were tagging related articles manually in our CMS. Algorithms: Top2Vec and BERTopic. BERTopic gave higher accuracy (52.97%) than Top2Vec (50%). Refining accuracy. ● We tried new embedding methods for BERTopic and Top2Vec. Doc2Vec embeddings with Top2Vec gave us higher accuracy. ● Most of the failed cases belonged to one sub-product (ETimes) ● Articles in Entertainment section usually can’t be clustered into specific news cycles or meaningful topic ● Implemented Top2Vec algo with joint embeddings of words and documents using Doc2Vec model without Etimes stories and achieved accuracy of 73.55%. How Step 1: Topic Modeling Detailed case study on topic modeling at https://context-cards.com Dashboard. ● We’ve set up an editorial team to review the output of the algorithm. ● Editors can also mark a story as False Positives, i.e., stories that the algorithm said is part of a topic but isn’t. ● Editors can also fix False Negatives, i.e. stories that the Editor feels is part of the topic but the algorithm did not catch it. They do this by adding the story’s ID. How Step 1: Topic Modeling... Detailed case study on topic modeling at https://context-cards.com Mentions Use Spacy to get Named Entity Recognition and used Wikipedia to reduce noise. FAQs Questgen.ai is a ready API to achieve this. However, it accepts only 1000-words per input. Hence, we’ll build our own. Data Train Spacy to pull out data snippets List all charts used within stories in topic. Timeline List stories in topic in descending order. OR run topic modeling to find events and summarize them. How Step 2: Task specific small language models Expert Speak In previous iteration of JournalismAI, Guardian wrote a model to extract quotes from articles. Data 🤔 It can extract data snippets. Requires sufficient fact-checking. FAQs 👍 It is able to generate good FAQs and answer them. Mentions ✅ It can extract entities and write bios for them. However, in its attempt to contextual, the bios are parochial. Expert Speak ✅ It can extract quotes and viewpoints of the various parties involved. Timeline 🤔 While it generates a timeline, it tends to get dates wrong. Requires sufficient fact-checking. How Step 2: GPT-3 Conclusion GPT-3 is a good option to get POC of context-cards up. We’ll need to build in editorial review for fact-checking and refinement before it gets published to audiences! How Step 2: GPT-3 Detailed evaluation on GPT3 at https://context-cards.com