Optimize The Title And Description Extraction Method #29

Closed
opened 2024-12-14 15:41:27 +00:00 by owner · 1 comment
Owner

Right now, to extract the title and description, the whole HTML is downloaded, and passed around to the various methods. And the worst part is that the title and description extraction works by running the WHOLE HTML through RegEX. That's not really great, as pictured in the screenshot beneath. When the scanner starts to extract the title and description, the CPU gets quite toasty. And it can take some time, depending on the size of the HTML.

image

So I propose that the HTML get cut off at the place where either the title or description tag ends, so we only process the necessary HTML.

Right now, to extract the title and description, the whole HTML is downloaded, and passed around to the various methods. And the worst part is that the title and description extraction works by running the WHOLE HTML through RegEX. That's not really great, as pictured in the screenshot beneath. When the scanner starts to extract the title and description, the CPU gets quite toasty. And it can take some time, depending on the size of the HTML. <img width="405" alt="image" src="attachments/ed0a5698-fbfa-44e4-b8c3-c69c020977cd"> So I propose that the HTML get cut off at the place where either the title or description tag ends, so we only process the necessary HTML.
Author
Owner

More screenshots to showcase the effect.

image image
More screenshots to showcase the effect. <img width="1093" alt="image" src="attachments/225290d0-7d1f-4872-89f1-ada0f713fe08"> <img width="939" alt="image" src="attachments/96409518-0588-48c7-8e94-c333f6244ee0">
owner added the
enhancement
label 2024-12-14 15:43:00 +00:00
owner added this to the Minimize GC Pressure milestone 2024-12-14 15:43:03 +00:00
owner self-assigned this 2024-12-14 15:43:07 +00:00
owner added reference OptimizeTitleAndDescriptionExtraction 2024-12-23 13:10:15 +00:00
owner closed this issue 2024-12-23 13:10:17 +00:00
Sign in to join this conversation.
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: owner/RSE#29
No description provided.