**Session Date/Time:** 23 Jul 2025 07:30 # maprg ## Summary This MapRG session focused on the increasing issue of crawler traffic, particularly concerning AI crawlers, and its impact on website infrastructure and content access. Presentations covered measurements of crawler behavior, the effectiveness of blocking mechanisms, the experiences of infrastructure operators like the IETF and Wikimedia, and forward-looking approaches like IndexNow for more efficient crawling. The discussion highlighted the need for better standards, cooperation between crawlers and content providers, and sustainable models for knowledge sharing. ## Key Discussion Points * **Crawler Traffic Volume and Impact:** Automated traffic accounts for a large share of web traffic (30-50%) with AI crawlers being a major driver. This traffic is increasingly impacting website infrastructure and user experience, especially for high-interest events. Wikimedia noted a 50% increase in bandwidth usage since January 2024 largely due to bots. * **Blocking Mechanisms and Their Limitations:** Robots.txt and active blocking are used by content creators to control crawler access, but both have limitations. Robots.txt is voluntary and easily circumvented, while active blocking can impact legitimate users. Cloudflare's AI crawler blocking feature relies on user agent strings, which can be spoofed. * **Misuse of HTTP Status Codes:** Websites often misuse HTTP status codes to signal crawler refusals, making it difficult for crawlers to interpret and react appropriately. * **Crawler Respect for Robots.txt:** Most well-known crawlers respect robots.txt, but exceptions exist, particularly among third-party AI assistant crawlers. * **Centralization and Its Impact:** Large hosting providers enforcing unified blocking policies can create blind spots for crawlers, affecting a large number of domains. * **IndexNow Protocol:** IndexNow, a protocol for real-time push notifications of content updates, aims to reduce redundant crawling and improve content freshness. 50% of newly indexed URLs on Bing originate from IndexNow. * **The Dichotomy of Open Access and Sustainability:** Maintaining open access to content while ensuring sustainable infrastructure is a key challenge for organizations like Wikimedia. * **Need for Sustainable Knowledge-as-a-Service Model:** The meeting discussed the need for systemic, dynamic approaches to maintain preferential access for humans and mission-oriented traffic, while managing the impact of bots. ## Decisions and Action Items * **Follow-up on IndexNow:** Encourage more discussion about IndexNow and its potential for wider adoption and standardization within the community. * **Continue discussions:** Participants agreed to continue discussions about crawler behavior, blocking effectiveness, and alternative solutions for content access and sustainability. ## Next Steps * Further explore standardization efforts for crawler identification and communication of crawling preferences. * Investigate best practices for content providers to manage crawler traffic and protect their infrastructure. * Promote collaboration between crawlers and content providers to develop sustainable models for knowledge sharing and content reuse. --- **Session Date/Time:** 25 Jul 2025 07:30 # maprg ## Summary The Measurement Analysis for Protocols Research Group (maprg) meeting featured six presentations on various measurement studies related to network protocols, including Starlink performance, Anycast routing, and QUIC implementation. The discussions covered performance metrics, measurement methodologies, and potential optimizations. ## Key Discussion Points * **Starlink Performance:** Analysis of latency and packet loss in Starlink networks, revealing geographical variations and temporal trends. Ground station placement and user numbers were considered as contributing factors. The discussion highlighted the black-box nature of Starlink and the need for further investigation into its internal workings. Ping measurements showing packet loss were debated for their representativeness of overall traffic. * **Starlink Content Delivery:** Examination of content delivery over Starlink, focusing on the interaction between terrestrial CDNs and LEO networks. Inconsistencies in ground infrastructure and satellite deployment were identified as potential sources of performance variations. A proposal for space-based CDNs was mentioned. * **Anycast Routing Measurement:** Presentation of a tool for measuring Anycast RTTs and inferring optimal routing paths. The tool combines catchment analysis with unicast probing to identify suboptimal routing and potential performance improvements. * **QUIC Handshake Optimization (Instant Acknowledgement):** Investigation into the impact of instant acknowledgements on QUIC connection performance. The results showed that instant acknowledgements can improve performance under certain conditions, but can also lead to worse performance in the presence of packet loss. Real-world CDN deployment of this optimization was also presented. * **QUIC Performance Comparison:** Comparative analysis of QUIC performance with and without HTTP. The results revealed significant differences in throughput and CPU utilization depending on the QUIC implementation, HTTP version, and traffic generator. Offloading features were also found to have a significant impact on performance. * **QUIC Pacing Strategies:** Study of the pacing behavior of different QUIC implementations. The results showed significant variations in inter-packet gaps and burst lengths. A kernel patch was proposed to improve pacing when using Generic Segmentation Offload (GSO). The ideal pacing strategy in wired vs. wireless networks was debated. ## Decisions and Action Items * **Hackathon Champion:** The group is seeking a champion to participate in the hackathon in Montreal in November. * **Tool Availability:** The Anycast RTT measurement tool is available upon request with the intention of making it public this year. * **QUIC Implementers Feedback:** Encourage future speakers to reach out to implementers when researching protocol implementations. * **Future Presentations:** Encourage community to present data produced from the Anycast RTT measurement tool at future sessions. ## Next Steps * Further investigation into Starlink performance variations, including the impact of ground station placement, user numbers, and technology changes. * Continued development and evaluation of content delivery strategies for LEO networks, including the space-based CDN proposal. * Public release of the Anycast RTT measurement tool. * Further research into QUIC implementation performance, including the impact of different congestion control algorithms, traffic generators, and hardware offloading features. * Explore collaboration opportunities to improve the proposed Kernel patch to improve QUIC pacing.