Optimize The Title And Description Extraction Method #29

New Issue

owner · 2024-12-14T15:41:27Z

owner commented

2024-12-14 15:41:27 +00:00

Right now, to extract the title and description, the whole HTML is downloaded, and passed around to the various methods. And the worst part is that the title and description extraction works by running the WHOLE HTML through RegEX. That's not really great, as pictured in the screenshot beneath. When the scanner starts to extract the title and description, the CPU gets quite toasty. And it can take some time, depending on the size of the HTML.

So I propose that the HTML get cut off at the place where either the title or description tag ends, so we only process the necessary HTML.

Right now, to extract the title and description, the whole HTML is downloaded, and passed around to the various methods. And the worst part is that the title and description extraction works by running the WHOLE HTML through RegEX. That's not really great, as pictured in the screenshot beneath. When the scanner starts to extract the title and description, the CPU gets quite toasty. And it can take some time, depending on the size of the HTML. <img width="405" alt="image" src="attachments/ed0a5698-fbfa-44e4-b8c3-c69c020977cd"> So I propose that the HTML get cut off at the place where either the title or description tag ends, so we only process the necessary HTML.

image.png

17 KiB

owner commented

2024-12-14 15:42:29 +00:00

More screenshots to showcase the effect.

More screenshots to showcase the effect. <img width="1093" alt="image" src="attachments/225290d0-7d1f-4872-89f1-ada0f713fe08"> <img width="939" alt="image" src="attachments/96409518-0588-48c7-8e94-c333f6244ee0">

image.png

20 KiB

image.png

15 KiB

owner added the

enhancement

label 2024-12-14 15:43:00 +00:00

owner added this to the Minimize GC Pressure milestone 2024-12-14 15:43:03 +00:00

owner self-assigned this 2024-12-14 15:43:07 +00:00

owner added reference OptimizeTitleAndDescriptionExtraction

2024-12-23 13:10:15 +00:00

owner closed this issue