Table of Contents
Last update on
This technical article details the development and optimization of a web crawler capable of processing over 15,000 web pages per minute using NodeJS, Puppeteer, and BullMQ. We explain how the Remove Unused CSS (RUCSS) feature functions and enhances web performance by eliminating unnecessary CSS, improving load times, and boosting key performance metrics. We also describe how we addressed the initial challenges, including inefficient processing and stability issues, by leveraging Puppeteer for resource collection, customizing JavaScript libraries, and implementing a queuing system with BullMQ. Lastly, we outline how we achieved operational excellence aimed at maintaining high-quality standards, enabling rapid innovation, and efficient customer support.
What is RUCSS?
The Remove Unused CSS (RUCSS) feature is designed to eliminate all CSS and stylesheets not used on a webpage, retaining only the CSS necessary for each page. Once the optimization is applied, your website will only deliver the needed CSS for the specific page a user requests, making the page load much quicker!
On average, this optimization reduced the size of the delivered CSS by 76%!
Benefits of RUCSS
- Reduced Page Size: Minimizes the overall size of the web page.
- Fewer HTTP Requests: Reduces the number of CSS stylesheets that need to be loaded.
- Faster Load Times: Improves the loading speed of the page.
- Enhanced Core Web Vitals: Boosts key performance metrics such as Largest Contentful Paint (LCP), First Contentful Paint (FCP), and Cumulative Layout Shift (CLS).
- Elimination of Render-Blocking CSS: Prevents CSS from delaying the rendering of the page.
Identifying Unused CSS: a Surgical Strike
The process of removing unused CSS is very complex: for every CSS rule we remove, we must ensure it won’t ever be needed on that page; otherwise, the layout could break! You need to take into consideration a lot of factors to ensure the unused CSS is properly identified:
- JavaScript Interactions: JavaScript dynamically changes the DOM and CSS. For example, a popup might appear after a button click, introducing new DOM elements and CSS rules. Handling these interactions is crucial for accurately identifying used CSS, and there are many more sliders, off-canvas menus, and galleries.
- CSS and HTML Variations: WordPress themes and plugins introduce endless variations of CSS rules. Handling nested CSS, edge cases, and even syntax errors is essential. Accurate HTML parsing is required to identify and remove unused CSS effectively.
- Viewport sizes: Some CSS styles are applied under some conditions on the screen size. To ensure CSS can be correctly served to mobile and desktop users, we provide used CSS for various screen sizes and deliver only the relevant one for each visit. Additionally, specific optimization rules apply to responsive CSS styles.
RUCSS as a SaaS Solution
Processing page resources and generating used CSS is resource-intensive. Performing these tasks directly on customer websites can slow them down and may not be feasible on all servers and hostings. Therefore, some of WP Rocket’s most advanced features, such as Removed Unused CSS, are powered by our SaaS. Those optimizations are performed on our servers upon requests from the WP Rocket plugin, and the results are then applied to websites automatically. This approach has some great benefits for customers:
- Perform resource-intensive optimization and reduce the load on user’s servers.
- Deliver enhancements as soon as possible and rapidly tackle feedback.
- Proactively observe and fix issues and errors without waiting for a support ticket.
- Ensure all users benefit from the latest versions without any client-side update.
- Iterate quickly and deploy improvements without needing to update the WP Rocket plugin.
What Does it Look Like Behind the Scenes?
WP Rocket SaaS is visiting and optimizing up to 20k pages per minute, processing their CSS and above-the-fold images! To do so, we are constantly operating thousands of web browsers across ~40 different servers to serve all our user requests within 2 minutes. To achieve this, our team is leveraging state-of-the-art technologies including NodeJS, Django, Redis, CockroachDB & Kubernetes and operating them with group.One teams.
The Key Technical Challenges
1. Collecting Page Resources
We extensively use Puppeteer, a Node library that provides a high-level API to control headless Chrome or Chromium browsers. This tool is essential for our resource collection strategy thanks to its ability to render web pages just like a real user would, ensuring that all dynamic elements are captured. It offers many powerful features that we rely on to optimize the process, such as viewport size control, request interceptions, and so on.
2. Processing Page Resources
Since WordPress themes and plugins utilize a wide variety of CSS and JavaScript, we needed a robust solution to process these files.
After testing many methods to process the page resources, such as Webpack, PostCSS, and CleanCSS, we finally decided to fork and maintain our own customized JavaScript library to better suit our specific requirements. This includes handling edge cases, nested CSS rules, and syntax problems found in numerous WordPress themes and plugins to ensure the optimization can be performed reliably for all our users.
3. Fine-tuning With Our Team’s Expertise
Processing CSS can be very tricky and the risk is high to break the page layout if not enough care is given into the process. Our team has extensive knowledge of popular WordPress themes and page builders, as well as constant feedback from hundreds of thousands of users about compatibility with the latest WordPress trends. Therefore, we developed a dynamic safelisting system for our teammates to directly update CSS processing rules of our SaaS in real-time to continuously adapt our optimizations and increase compatibility every day.
4. Managing a Flow of Thousands of Pages Per Minute
To handle the flow of 15,000 web pages per minute, we implemented a queuing system using BullMQ. This system offers:
- Asynchronous Processing to offload the optimization process from the plugin to an asynchronous backend.
- Load Balancing and Concurrency Management: Distribute tasks across multiple servers, allowing the SaaS to handle a large number of URLs efficiently and to easily scale.
- Reliability and Fault Tolerance: All submitted tasks and their results are saved once they are available. This provides resiliency in case one of our servers fails or if the user’s website is not able to retrieve the results immediately.
- Prioritization: BullMQ queues can be used to prioritize tasks, ensuring that more important tasks are processed first. For instance, the optimization of home pages is prioritized to allow our users to immediately see the improvements on their homepage and test it right away with Pagespeed!
Operational Excellence
Building a scalable and reliable system for SaaS is essential for customers’ ease and efficiency and for handling the growing number of users and the increasing volume of page processing requests. Here’s an in-depth look at the steps that allowed us to achieve such an excellence.
1. Scalable Architecture
WP Rocket is used on millions of websites worldwide at any time. Therefore, our SaaS must remain available and perform as expected at all times. The service is hosted on more than 40 physical servers located in different regions. They are orchestrated with Kubernetes to ease deployment, scaling, and management of the container’s lifecycles. We rely on powerful features such as liveness probes, rolling updates, and automatic scaling to bring resiliency to the service and maintain uptime in all circumstances. Additionally, a custom gateway service performs job management and enforces API security (rate-limiting, authentication, etc.).
2. Monitoring & Alerting
All those services and servers are constantly monitored and observed with technical metrics through Prometheus and Grafana, as well as functional metrics, to ensure the service is stable and provides the best performance to our users. In-depth monitoring allows us to ensure the success rate stays high while keeping job duration low and optimization efficiency best-in-class. Alerts from tools like Prometheus and Metabase allow us to react in real-time in case something is not behaving as expected.
3. Observability to Help Our Customers
While we work hard toward making our SaaS easy to use for our customers, some difficulties can occur in very complex setups with firewalls, security rules, or WordPress websites with many conflicting plugins or themes, for instance. Thanks to the SaaS approach, our support team is equipped with internal tooling through Metabase to access data related to a given user, and observe their optimization jobs and results, hence allowing us to quickly identify the possible issues. They are then able to repeat, tweak, and fine-tune and get real-time feedback with PostMan and Metabase. As a result, they can help struggling customers efficiently and even adjust the SaaS optimization rules directly to unlock the users.
4. Automated Testing & Continuous Integration
Maintaining high-quality standards while keeping a significant delivery rate can be a challenge for engineers. We solve this problem with reliable automation that each code change goes through. From automated testing to one-button deployment in production, we also leverage mirroring, shadow release mechanisms, and progressive release rollouts. All those powerful approaches allow our engineering team to thrive and keep innovating without putting the quality of the service at risk. Most of this is automated so that we don’t even need to think about it and focus on what matters: building the best features for our users!
Wrapping up
To summarize, developing and optimizing the web crawler using NodeJS, Puppeteer, and BullMQ has led to a highly efficient and scalable solution for processing over 15,000 web pages per minute. By addressing initial challenges and leveraging advanced tools and methodologies, we have created a robust system that significantly enhances web performance through the Remove Unused CSS (RUCSS) feature. Continuous integration, automated testing, and a focus on scalability and reliability ensure that our service remains top-notch, providing users with faster load times and improved overall web experiences.
 
			
			 
		        		 
		        		