PDRI/USAID-DRG/ISSER Policy Brief 2: High Frequency Tracking of Civic Space Utilizing Domestic News Scraping and Large Language Model Classification
August 30, 2024
This policy brief was prepared by Donald A. Moratz, Jeremy Springman, Serkant Adiguzel, Zung-Ru Lin, Diego Romero, Hanling Su, Jitender Swami, Rethis Togbedji Gansey, Mateo Villamizar-Chaparro, and Erik Wibbels from the University of Pennsylvania, Sabanci University, Utah State University, and Duke University. It forms part of a series of policy briefs released by PDRI in cooperation with these institutions.
Key Takeaways
- Enhanced Tracking of Civic Space: Traditional indicators, such as those provided by the Varieties of Democracy (V-DEM) project, offer annual updates on civic space, which can be insufficient for capturing rapid changes. The Machine Learning for Peace (MLP) infrastructure, using customized scrapers and large language model (LLM) classification, provides monthly measures of civic space, enabling a more nuanced and timely understanding of civic dynamics.
- Challenges with International Media: International media often fails to cover the full spectrum of civic space activities due to selective reporting and a focus on high-profile events. Domestic news sources, while offering a more complete picture, require rigorous curation to ensure data reliability and accuracy.
- Importance of Human Curation: Automated tools like GDELT and the Internet Archive offer broad coverage but tend to be less accurate and less comprehensive compared to well-curated domestic news scraping. Effective data collection necessitates significant human oversight to manage inconsistencies and ensure the accuracy of the data.
Introduction
In the context of rapid political and social changes, including protests, legal changes, and political unrest, timely and accurate tracking of civic space is essential. While traditional indicators such as those from V-DEM provide annual updates, they may not reflect the fast-paced nature of civic space developments. The MLP infrastructure aims to address this gap by employing advanced techniques to track civic space more frequently.
The Challenge
- Inadequate International Media Coverage: International media sources often present an incomplete view of civic space due to limited reporting on less dramatic but significant events. This incomplete coverage can skew perceptions of the state of civic space.
- Complexities in Domestic News Scraping: Domestic news scraping involves collecting data from national media outlets, which requires addressing issues such as varying publication formats, website structures, and the frequency of news updates. Automated tools can struggle with these complexities, making human supervision crucial for maintaining data quality.
Our Approach
- Customized News Scraping: Our approach involves developing custom scrapers to collect articles from widely-read national media sources in 62 countries, spanning nearly 40 languages. This method ensures a broad and accurate representation of civic space activities across different contexts.
- Human Supervision and Curation: Human oversight plays a critical role in managing the variability in publication volumes and website structures. This supervision helps in identifying and correcting errors, such as misclassified articles or technical issues, ensuring that the data collected is both accurate and reliable.
Findings
- Performance Comparison with Automated Tools: Our analysis shows that the MLP approach offers superior coverage and accuracy compared to automated tools like GDELT and the Internet Archive. For example, our data collection on Bangladeshi news sources demonstrates more comprehensive and precise coverage compared to these automated sources.
- Discrepancies Between Domestic and International Coverage: We observe a low correlation between domestic and international media coverage of civic space events. International media tends to emphasize high-profile incidents such as acts of violence, while domestic media provides a more comprehensive view, including legal actions, elections, and everyday civic activities.
- Volume of International Coverage: Although countries with substantial international media coverage, such as Turkey and India, show a higher degree of correlation between domestic and international reports, significant discrepancies persist. This indicates that even in well-covered countries, domestic reporting may still offer unique insights not captured by international media.
Policy Implications
- Data-Driven Policy Making: Policymakers should consider the limitations of international media data and utilize comprehensive, human-curated domestic news data for a more accurate assessment of civic space. The MLP approach offers a more detailed and timely view of civic dynamics, which is crucial for informed decision-making.
- Investment in Enhanced Data Collection: To improve the accuracy and comprehensiveness of civic space tracking, future investments should focus on methods that combine automated tools with human supervision. This approach will help address the limitations of existing data sources and provide a more reliable basis for policy and analysis.
Acknowledgements
This study was funded under the Swift Expertise and Grounded Analytics Task Order by the United States Agency for International Development (USAID) Democracy, Human Rights, and Governance Bureau. We acknowledge the support and collaboration of PDRI, ISSER, and the participating research institutions.