This paper introduces FoCUS (Forum Crawler Under Supervision), a supervised web-scale forum crawler designed to efficiently gather relevant forum content with minimal overhead. Forums, despite their varied layouts and software packages, share similar implicit navigation paths via specific URL types that connect entry pages to thread pages. FoCUS reduces the forum crawling problem to recognizing these URL types and learns accurate regular expression patterns of navigation paths from an automatically created training set using results from weak page type classifiers. These robust classifiers, trained with as few as five annotated forums, can be applied to a large set of unseen forums. Test results demonstrate FoCUS's effectiveness, achieving over 98% effectiveness and 97% coverage across forums using over 150 different software packages.
-
Jiang, Jingtian, Nenghai Yu, and Chin-Yew Lin. "Focus: learning to crawl web forums." Proceedings of the 21st International Conference on World Wide Web. 2012.
-
Members
- JoelR
- IC Essentials
- Steph40
- Analog
- Como
- Reydev
- DawPi
- Square Wheels
- eivindsimensen
- PrettyPixels
- onlyME
- Adriano Faria
- master963
- Madhouse
- StevenM
- TomCat
- YourSharona
- 666wicked666
- opentype
- N700
- ZLTRGO
- JoeyM
- Labi
- Maxius
- pat
- A Zayed
- Kirill Gromov
- Thesis
- Split
- Drufuss
- muovar
- PPlanet
- Kane
- Daniel N
- Labis
- envy
- Sinistra
- Ryan
- V0RT3X
- Synergy
- Matt
- terabyte
- Mesharsky
- Astronis
- dottbuff
- aLEX49566
- bernhara
- markel
- Charlie Feigel
- Richard Arch
- isvans
- TheJimmo
- Drew Dowdell
- burnyourfeelings
- devvfck
- Jon Erickson
- adik
- kmk
- Voyage
- MichaelR
- sulervo
- LemonGrenade
- PalmersRightPeg
- We are Borg
- MissB
- Patreon Lukazuki
- Mitsuru
- ali hagi
- Hong98
- ijinxcxx
- ijinxcxx4k
- ButterflyPixel
- Luki
- flrn
- rivi235
- Aleksandar Markovic
- BEASTBOOSTER
- GazzaGarratt
- Roblox County DOJ Roleplay
- HDiddy
- Dilip
- Destructor
- AnonDoggo
- Dani Onvlee
- Anthony Feng
- MythonPonty
- GrantHorizons
- abobader
- eliteone
- ArashDev
- Brian
- Cory McElroy
- Videoflicks
- Empire
- Nebulous
- aXenDev
- ITV
- Denis Dyack
- Claudia999
- Ticaga