Lawtee News

Handcrafted RSSHub Route Failed

Featured image of post Handcrafted RSSHub Route Failed

A few days ago, I snagged a free .news domain from NameCheap and registered hy2.news. Since I didn’t have much use for it, I decided to set up a personal news feed page. Since it’s a dynamic page, I brought out Wordpress and used an RSS plugin to make it happen. The page is live at: Hyruo News

The page is up, but sourcing the RSS feed became an issue. So, I started by trying to scrape the Nodeseek forum, where I’ve been active recently. After hours of effort, I still failed.


Handcrafting an RSSHub Route

The official RSSHub development tutorial is a bit jumpy, but the main process is as follows.

Preliminary Work

  1. Clone the RSSHub repository (might take over half an hour if your computer is slow) git clone https://github.com/DIYgod/RSSHub.git
  2. Install the latest version of Node.js (version must be greater than 22) Node.js Official Site
  3. Install dependencies pnpm i
  4. Run pnpm run dev

Developing the Route

Developing the route is relatively simple. Open the RSSHub\lib\routes directory, create a new folder under it, such as nodeseek, and then add two files namespace.ts and custom.ts in that folder.

  1. namespace.ts File

This file can be copied from the official tutorial or just copied from another folder in the lib\routes directory and modified. Example:

1
2
3
4
5
6
7
import type { Namespace } from '@/types';

export const namespace: Namespace = {
 name: 'nodeseek',
 url: 'nodeseek.com',
 lang: 'zh-CN',
};
  1. custom.ts File

This is the main file for developing the route. The filename can be named according to the target website’s structure. Just look at other folders to get the idea. The difficulty lies in the specific content. Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import { Route } from '@/types';
import { load } from 'cheerio';
import { parseDate } from '@/utils/parse-date';
import logger from '@/utils/logger';
import puppeteer from '@/utils/puppeteer';
import cache from '@/utils/cache';

export const route: Route = {
 path: '/user/:userId',
 categories: ['bbs'],
 example: '/nodeseek/user/1',
 parameters: { userId: 'User ID, e.g., 1' },
 features: {
 requireConfig: false,
 requirePuppeteer: true, // Enable Puppeteer
 antiCrawler: true, // Enable anti-crawler
 supportBT: false,
 supportPodcast: false,
 supportScihub: false,
 },
 radar: [
 {
 source: ['nodeseek.com/space/:userId'],
 target: '/user/:userId',
 },
 ],
 name: 'NodeSeek User Topics',
 maintainers: ['Your Name'],
 handler: async (ctx) => {
 const userId = ctx.req.param('userId');
 const baseUrl = 'https://www.nodeseek.com';
 const userUrl = `${baseUrl}/space/${userId}#/discussions`;

 // Import Puppeteer utility class and initialize browser instance
 const browser = await puppeteer();
 // Open a new tab
 const page = await browser.newPage();

 // Set request headers
 await page.setExtraHTTPHeaders({
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
 Referer: baseUrl,
 Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
 });

 // Visit the target link
 logger.http(`Requesting ${userUrl}`);
 await page.goto(userUrl, {
 waitUntil: 'networkidle2', // Wait for the page to fully load
 });

 // Simulate scrolling the page (if needed)
 await page.evaluate(() => {
 window.scrollBy(0, window.innerHeight);
 });

 // Wait for the post list to load
 await page.waitForSelector('a[href^="/post-"]', { timeout: 7000 });

 // Get the HTML content of the page
 const response = await page.content();
 const $ = load(response);

 // Extract the post list
 let items = $('a[href^="/post-"]')
 .toArray()
 .map((item) => {
 const $item = $(item);
 const title = $item.find('span').text().trim();
 const link = `${baseUrl}${$item.attr('href')}`;
 return {
 title,
 link,
 };
 });

 // Exclude two fixed links in the footer
 const excludedLinks = ['/post-6797-1', '/post-6800-1'];
 items = items.filter((item) => !excludedLinks.includes(new URL(item.link).pathname));

 // Extract up to 15 posts
 items = items.slice(0, 15);

 // Print the extracted post list
 console.log('Extracted post list:', items); // Debug info

 // If the post list is empty, the dynamic content might not have loaded
 if (items.length === 0) {
 throw new Error('Failed to retrieve post list, please check the page structure');
 }

 // Get the content of each post
 items = await Promise.all(
 items.map((item) =>
 cache.tryGet(item.link, async () => {
 // Open a new tab
 const postPage = await browser.newPage();

 // Set request headers
 await postPage.setExtraHTTPHeaders({
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
 Referer: baseUrl,
 });

 // Visit the post link
 logger.http(`Requesting ${item.link}`);
 await postPage.goto(item.link, {
 waitUntil: 'networkidle2', // Wait for the page to fully load
 });

 // Get the post content
 const postHtml = await postPage.content();
 const $post = load(postHtml);

 item.description = $post('article.post-content').html();
 item.pubDate = parseDate($post('time').attr('datetime'));

 // Close the post page
 await postPage.close();

 return item;
 })
 )
 );

 // Close the browser instance
 await browser.close();

 // Return RSS data
 return {
 title: `NodeSeek User ${userId} Topics`,
 link: userUrl,
 item: items,
 };
 },
};

Summary of Main Issues

In short, the main reason for the failure in handcrafting the Nodeseek route was the inability to bypass Nodeseek’s anti-crawling measures and Cloudflare’s protection.

The extreme method RSSHub can use is to simulate browser behavior with Puppeteer to counter crawling. However, machine-simulated behavior is easily detected by platforms like Cloudflare.

During local testing, I had about a 50% success rate, and that was with the latest version of Puppeteer. If I used the Puppeteer version from RSSHub’s official dependencies, the success rate was less than 10%. Considering that submitting to RSSHub requires double review, the success rate is just too low.

For now, I have to reluctantly give up.

RSSHub unable to retrieve post list

Blocked by anti-crawling measures, unable to access page information

Blocked by Cloudflare

When It Rains, It Pours

This morning, I woke up to find that the sky had fallen—6 of my free *.US.KG domains were down because the parent domain stopped resolving.

Then, at noon, the sky fell again. The .news domain I had just set up was suspended by the registrar. They sent me an email asking for an explanation of why my personal information changed during the registration process, and then forcibly redirected the NS to an IP address no one recognizes.

Well, that’s my own fault. I messed up during registration.When I directly used the browser’s auto-fill form program, I accidentally filled in all the real information. Then, when I tried to change it back, an error occurred as soon as I made the changes.

PS: In the afternoon, *.us.kg returned to normal. But I feel like I won’t love it anymore.