Python .cursorrules Cursor AI prompt file: Web Scraping Tips

CSS Custom Cursor .cursorrules prompt file

100

John Young

Save

Saved

NEW

CSS Custom Cursor .cursorrules prompt file

Python Web Scraping:

Key Principles:
- Write concise, modular scraping functions with accurate examples using Python and libraries like requests and BeautifulSoup.
- Prefer iteration and modularization to avoid duplication.
- Use descriptive variable names that indicate intent (e.g., is_valid_url, has_next_page).
- Organize scripts with lowercase and underscores for filenames (e.g., scrapers/my_scraper.py).
- Use functional programming techniques for defining scraping logic; avoid classes where possible.

Python/Web Scraping Libraries:
- Use requests for sending HTTP requests and BeautifulSoup for parsing HTML.
- Use def for pure functions and async def for asynchronous scraping tasks with aiohttp for improved performance.
- Use type hints for all function signatures to ensure clarity.
- Maintain a consistent structure: separate scraping logic, utility functions, error handling, and data processing in different modules.

Error Handling and Edge Cases:
- Handle invalid URLs, timeouts, and unexpected HTML structures at the beginning of functions.
- Use early returns for handling errors (e.g., invalid responses or missing HTML elements) to avoid deeply nested conditions.
- Reserve the main parsing logic for the "happy path" to improve readability.
- Avoid unnecessary else blocks—use the if-return pattern for cleaner flow control.
- Implement error logging with logging to capture failed requests or parsing issues.
- Use retries with backoff mechanisms (e.g., tenacity) for handling transient network issues or rate-limiting responses.

Dependencies:
- requests, BeautifulSoup4, and optionally lxml for HTML parsing.
- aiohttp for asynchronous scraping when dealing with large volumes of data.
- logging for detailed error logging and monitoring.

Scraping-Specific Guidelines:
- Use functions for individual tasks: sending requests, parsing HTML, handling pagination, and saving data.
- Prefer asynchronous scraping for I/O-bound tasks like making multiple requests to external servers.
- Rely on tools like lxml for faster HTML parsing and aiohttp for high concurrency in asynchronous scraping.
- Minimize inline parsing logic—use utility functions to handle common tasks like extracting links, text, or images.

Performance Optimization:
- Minimize blocking I/O by using asynchronous libraries for web scraping (e.g., aiohttp).
- Implement caching strategies for avoiding redundant requests and speeding up repeat operations.
- Use efficient data parsing and extraction with libraries like lxml for larger datasets.
- Apply lazy loading techniques to avoid unnecessary data fetching.
- Handle rate limits and request throttling by respecting robots.txt files and introducing delays between requests.

Key Conventions:
- Use modular functions and adhere to single-responsibility principles in scraping logic.
- Ensure scalability by using non-blocking, asynchronous flows for high-concurrency scraping tasks.
- Structure scripts for readability, maintainability, and performance, with clear separation of concerns.
- Refer to the documentation of requests, BeautifulSoup, aiohttp, and lxml for best practices on HTTP requests, HTML parsing, and efficient scraping.sss

stack:

CSS Custom Cursor .cursorrules prompt file

John Young

About .cursorrules prompt file

What you can build

Web Scraper Template Generator: Develop a tool that generates Python web scraping templates based on user input. It should allow users to specify the target website, data to extract, and the desired format, then generate a modular script using requests, BeautifulSoup, and aiohttp with best practices in error handling and performance optimization.
Async Scraping Scheduler: Create an application that schedules asynchronous web scraping tasks using aiohttp to handle data extraction from multiple websites concurrently. It should allow users to configure scraping frequency, retries with backoff, and respect for rate limits.
Error-Resilient Scraping Framework: Build a Python framework that incorporates robust error handling and logging mechanisms for web scraping. The framework should automatically manage invalid URLs, timeouts, and missing data elements, using a logging system to track issues and suggested fixes.
Scraping Dashboard with Monitoring: Develop a web application that provides a live dashboard for monitoring the status and performance of web scraping operations. It should visualize metrics like request rates, errors, and data volumes, and allow users to manage and configure individual scraping tasks.
Personalized Web Scraping Service: Offer a subscription-based service where users can define specific scraping tasks, and the service automatically performs these tasks asynchronously, caching results to minimize redundant data requests, and delivering the data in a user-friendly format.
Modular Data Extraction Tool: Create a Python package providing a suite of utility functions designed for modular data extraction, such as functions for extracting links, text, and images. It should integrate seamlessly with BeautifulSoup and lxml for efficient data parsing.
Web Scraping Learning Platform: Construct an educational platform that teaches users how to write modular, efficient web scraping scripts using Python. The platform should provide interactive tutorials covering libraries like requests, BeautifulSoup, aiohttp, and lxml, with real-world examples and exercises.
Dynamic Content Scraper: Develop a tool that can scrape dynamically loaded content with asynchronous requests, using aiohttp to effectively handle JavaScript-heavy websites. The tool should include a feature for simulating user interactions needed to load content.
Automated Rate Limiter for Web Scraping: Create a utility that automatically respects websites' robots.txt files and implements rate limits and request throttling strategies to avoid being banned while scraping.
Cached Web Scraping API: Offer an API service that performs web scraping tasks with built-in caching to provide fast response times for repeated requests. The API should allow users to specify the URL and the data they wish to extract, returning the cleaned and parsed data.

Benefits

Emphasizes modular, functional programming with pure functions, avoiding classes for scraping logic, enhancing code reusability and readability.
Encourages error handling through early returns and logging, improving robustness by managing common errors like invalid URLs and timeouts.
Advocates for performance optimizations using asynchronous libraries and caching strategies, ensuring high efficiency and scalability in web scraping tasks.

Synopsis

Developers building Python web scrapers will benefit from this prompt by creating efficient, scalable, and maintainable scraping scripts adhering to best practices and modular principles.

Overview of .cursorrules prompt

The .cursorrules file provides guidelines and best practices for Python web scraping, emphasizing modular, concise, and efficient code design. It covers key principles such as using descriptive variable names, organizing scripts, and favoring functional programming. It suggests using specific libraries like requests for HTTP requests and BeautifulSoup for HTML parsing. The file also highlights error handling strategies, dependencies, and performance optimization techniques, including the use of asynchronous libraries like aiohttp for improved performance. Additionally, it emphasizes key conventions for creating scalable and maintainable web scraping scripts.

CSS Custom Cursor .cursorrules prompt file

CSS Custom Cursor .cursorrules prompt file

CSS Custom Cursor .cursorrules prompt file

About .cursorrules prompt file

What you can build

Benefits

Synopsis

Overview of .cursorrules prompt

Tags

pandas numpy Jupyter .cursorrules prompt file

CSS Custom Cursor .cursorrules prompt file

Python GitHub Setup .cursorrules prompt file

.cursorrules file Cursor AI Python FastAPI API

TypeScript Next.js React .cursorrules prompt file

FastAPI .cursorrules prompt file guide

Next.js React Tailwind .cursorrules prompt file

Python Pytest Typer .cursorrules prompt file

Python Pytest Typer .cursorrules prompt file

Next.js Tailwind CSS Obsidian Plugin .cursorrules prompt file

Next.js TypeScript Clerk Stripe Vercel Setup .cursorrules prompt file

React TypeScript Care Project .cursorrules prompt file

Elixir Code Guidelines .cursorrules prompt file

Angular Novo Elements .cursorrules prompt file

Kubernetes MkDocs Documentation .cursorrules prompt file

Next.js Material UI Tailwind CSS .cursorrules prompt file

Next.js Vercel Supabase .cursorrules prompt file

Next.js Vercel TypeScript .cursorrules prompt file

Cursor Tip

Cursor Tip

Cursor Tip

new tutorial: how to get the most out of Cursor, the AI-powered

I built a Slack clone in just 5 minutes using OpenAI's o1-previe

Build AI Voice Notes App: Cursor AI Tutorial Guide

Create a Chrome Extension Fast: Cursor AI Tutorial

Build ChatGPT with Python: Cursor AI Tutorial Guide

Cursor AI Tutorial: Easy Loading Page with Shadcn Components