JC's ABC search bot

JC's ABC search bot


This directory contains the search bot for JC's Tune Finder. This is a program that looks for web sites with files in the ABC music notation. When it finds one, it extracts assorted interesting musical information from the file, and adds it to the Tune Finder's database. Meanwhile, people are connecting and asking about tunes, downloading them, or requesting them in any of the several output formats that the Finder supplies.

Some of the interesting things here:

abcbot
is the search bot itself. It's a perl program that uses the URLs file as a list of starting points and places to avoid. Also, abcbot expects a list of hosts on the command line, and it scans only those sites. It keeps data for each host in a file in the hst/ directory.
ABCsearch
queries several big search sites for "ABC music notation". This is the best set of keywords that we've found to locate ABC sites. About 20% of our sites were found this way. It writes info for each into a file in the add directory.
Hosts2html
rebuilds the index files in ndx/ These are the files used by the Tune Finder. The index files are rebuilt after a search run. They may also be rebuilt at any time if there are problems, or if we've done a special scan of one or more sites.
Makefile
is a conventional unix makefile. We use it to drive the search process, which is started by hand once a month or so.
NewURL
takes a URL and does a scan for ABC files with the URL as a starting point. If the scan is successful, the URL should usually be added to URLs.
Summary.txt
is a file showing statistics from the most recent scans, one line per host. This usually includes "false positive" hosts from the big search sites, so we can see how successful ABCsearch was.
UpdateHosts
is a script that drives the search run. We normally call abcbot for a single host. UpdateHosts has the task of keeping track of which hosts have been scanned, and starting up abcbot for each host. It uses both the hst/ and add directories to decide which hosts to scan.
cgi/abc/
is a copy of the Tune Finder's CGI directory. In the copy, nothing is executable, so you can look at the code.
findrobotstxt
is a little program to list the URLs of all robots.txt files in our ABC hosts. It may be slow, because it queries every host, and sometimes they don't respond. So run it in the background.
hst/
contains one file per host, listing all the interesting files and ABC tunes on that host.
lck/
contains lockfiles for hosts, so that we don't get two programs trying to scan the same host.
ndx/
is our copy of the index files used by the Tune Finder. The Tune Finder actually works out of ../ndx/, so the files are linked there after we've verified them.
sh/
is assorted small scripts that are useful here.
stat/
contains past statistics files of various types. It should probably be purged occasionally.
webcat
is a program that does downloads. This is a separate program because of an intractable problem: TCP connections to web servers sometimes block permanently in the connect() call. This happens on several OSs, and there seems to be no solution. So abcbot calls webcat as a subprocess. If webcat doesn't respond or exit within the timeout period, we kill it and go on. It's a bit of a waste of cpu time, but it solves the problem.

This is an ABC directory listing. Any ABC files encountered will be expanded to show its tunes. Clicking on a file name simply returns the file as usual. Clicking on the other links converts the file and sends the results. Links next to a tune name return only that one tune.
SSel downloads multiple tunes (if Sel above checked)
GGet downloads entire file from remote system.
TTXT returns selected ABC tune as type "text/plain".
AABC returns selected ABC tune as type "text/vnd.abc".
PPS returns tune in PostScript format.
EEPS returns tune in Encapsulated PostScript format.
PPDF returns tune in Portable Document Format.
GGIF returns tune in Graphics Interchange Format.
PPNG returns tune in Portable Network Graphics format.
MMIDIreturns tune in Musical Instrument Digital Interface format.
Get --- -- --- --- --- --- ---- ABCbot
Get --- -- --- --- --- --- ---- ABCdirs
Get --- -- --- --- --- --- ---- ABChosts
Get --- -- --- --- --- --- ---- ABCsearch
Get --- -- --- --- --- --- ---- ABCsearch-trillian
Get --- -- --- --- --- --- ---- ABCsummary
ABC TXT PS --- PDF GIF PNG ---- Angle_Cain.abc
ABC TXT PS EPS PDF GIF PNG MIDI 1: Angle Cain Tune
Get --- -- --- --- --- --- ---- Avoid
Get --- -- --- --- --- --- ---- Backup.pm
Get --- -- --- --- --- --- ---- BadSite
Get --- -- --- --- --- --- ---- BadURLs
ABC TXT PS --- PDF GIF PNG ---- Bourree_dEgletons_F.abc
ABC TXT PS EPS PDF GIF PNG MIDI 1: Bourrée d'Egletons
Get --- -- --- --- --- --- ---- CGI_Lite.pm
Get --- -- --- --- --- --- ---- CTitle.pm
Get --- -- --- --- --- --- ---- Changes
Get --- -- --- --- --- --- ---- CheckNulls
Get --- -- --- --- --- --- ---- Cleanup
Get --- -- --- --- --- --- ---- DT.pm
Get --- -- --- --- --- --- ---- DelNulls
Get --- -- --- --- --- --- ---- Download
ABC TXT PS --- PDF GIF PNG ---- ErikaDamianisBirthdayReelThang_D.abc
ABC TXT PS EPS PDF GIF PNG MIDI 25: Erika Damiani's Birthday Reel thang
Get --- -- --- --- --- --- ---- FranksSites.msg
Get --- -- --- --- --- --- ---- FranksURLs
ABC TXT PS --- PDF GIF PNG ---- Giant_steps_1.abc
ABC TXT PS EPS PDF GIF PNG MIDI 1: Giant steps
Get --- -- --- --- --- --- ---- GraphLog
Get --- -- --- --- --- --- ---- H
Get --- -- --- --- --- --- ---- HTMLdir.pm
Get --- -- --- --- --- --- ---- HTMLenc.pm
Get --- -- --- --- --- --- ---- HTTPcon.pm
Get --- -- --- --- --- --- ---- Hangups
Get --- -- --- --- --- --- ---- HolyroodHouse.abccat
Get --- -- --- --- --- --- ---- HolyroodHouse.abcext
Get --- -- --- --- --- --- ---- HostInit
Get --- -- --- --- --- --- ---- HostStatDiffs
Get --- -- --- --- --- --- ---- HostStatDiffs-20060911
Get --- -- --- --- --- --- ---- HostStatDiffs-20061009
Get --- -- --- --- --- --- ---- HostStatDiffs-20061121
Get --- -- --- --- --- --- ---- HostStatDiffs-20061216
Get --- -- --- --- --- --- ---- HostStatDiffs-20070118
Get --- -- --- --- --- --- ---- HostStatDiffs-20070219
Get --- -- --- --- --- --- ---- HostStatDiffs-20070317
Get --- -- --- --- --- --- ---- HostStatDiffs-20070416
Get --- -- --- --- --- --- ---- HostStatDiffs-20070512
Get --- -- --- --- --- --- ---- HostStatDiffs-20070611
Get --- -- --- --- --- --- ---- HostStatDiffs-20070711
Get --- -- --- --- --- --- ---- HostStatDiffs-20070810
Get --- -- --- --- --- --- ---- HostStatDiffs-20070918
Get --- -- --- --- --- --- ---- HostStatDiffs-20071012
Get --- -- --- --- --- --- ---- HostStatDiffs-20071111
Get --- -- --- --- --- --- ---- HostStatDiffs-20071210
Get --- -- --- --- --- --- ---- HostStatDiffs-20080111
Get --- -- --- --- --- --- ---- HostStatDiffs-20080226
Get --- -- --- --- --- --- ---- HostStatDiffs-20080323
Get --- -- --- --- --- --- ---- HostStatDiffs-20080410
Get --- -- --- --- --- --- ---- HostStatDiffs-20080510
Get --- -- --- --- --- --- ---- HostStatDiffs-20080511
Get --- -- --- --- --- --- ---- HostStats
Get --- -- --- --- --- --- ---- HostStatsData-20060713
Get --- -- --- --- --- --- ---- HostStatsData-20060812
Get --- -- --- --- --- --- ---- HostStatsData-20060911
Get --- -- --- --- --- --- ---- HostStatsData-20061009
Get --- -- --- --- --- --- ---- HostStatsData-20061121
Get --- -- --- --- --- --- ---- HostStatsData-20061216
Get --- -- --- --- --- --- ---- HostStatsData-20070118
Get --- -- --- --- --- --- ---- HostStatsData-20070220
Get --- -- --- --- --- --- ---- HostStatsData-20070317
Get --- -- --- --- --- --- ---- HostStatsData-20070416
Get --- -- --- --- --- --- ---- HostStatsData-20070512
Get --- -- --- --- --- --- ---- HostStatsData-20070611
Get --- -- --- --- --- --- ---- HostStatsData-20070711
Get --- -- --- --- --- --- ---- HostStatsData-20070810
Get --- -- --- --- --- --- ---- HostStatsData-20070918
Get --- -- --- --- --- --- ---- HostStatsData-20071012
Get --- -- --- --- --- --- ---- HostStatsData-20071111
Get --- -- --- --- --- --- ---- HostStatsData-20071210
Get --- -- --- --- --- --- ---- HostStatsData-20080111
Get --- -- --- --- --- --- ---- HostStatsData-20080226
Get --- -- --- --- --- --- ---- HostStatsData-20080323
Get --- -- --- --- --- --- ---- HostStatsData-20080410
Get --- -- --- --- --- --- ---- HostStatsData-20080510
Get --- -- --- --- --- --- ---- HostStatsData-20080511
Get --- -- --- --- --- --- ---- Hosts2html
Get --- -- --- --- --- --- ---- InitHosts
Get --- -- --- --- --- --- ---- KillSite
Get --- -- --- --- --- --- ---- KnownProblems
Get --- -- --- --- --- --- ---- Lfind
Get --- -- --- --- --- --- ---- ListSplit
Get --- -- --- --- --- --- ---- Ln
Get --- -- --- --- --- --- ---- Makefile
Get --- -- --- --- --- --- ---- NTT
Get --- -- --- --- --- --- ---- NewURL
Get --- -- --- --- --- --- ---- NewURLs
Get --- -- --- --- --- --- ---- NewUpdateHost
Get --- -- --- --- --- --- ---- Next
Get --- -- --- --- --- --- ---- Previous
Get --- -- --- --- --- --- ---- SPO.htm
Get --- -- --- --- --- --- ---- Scan
Get --- -- --- --- --- --- ---- Scan2004
Get --- -- --- --- --- --- ---- Scan2005
Get --- -- --- --- --- --- ---- Scan2006
Get --- -- --- --- --- --- ---- Scan2007
Get --- -- --- --- --- --- ---- SlowSites
Get --- -- --- --- --- --- ---- Summary.txt
Get --- -- --- --- --- --- ---- SummaryDiffs
Get --- -- --- --- --- --- ---- TODO
Get --- -- --- --- --- --- ---- TTcount
Get --- -- --- --- --- --- ---- TestSearch1
Get --- -- --- --- --- --- ---- ToDo
Get --- -- --- --- --- --- ---- TuneBotRun
Get --- -- --- --- --- --- ---- TuneListForm.html
Get --- -- --- --- --- --- ---- TuneSearch
Get --- -- --- --- --- --- ---- TuneSearch1
Get --- -- --- --- --- --- ---- TuneURLs
Get --- -- --- --- --- --- ---- U
Get --- -- --- --- --- --- ---- UP
Get --- -- --- --- --- --- ---- URL
Get --- -- --- --- --- --- ---- URLdata.pm
Get --- -- --- --- --- --- ---- URLhref.pm
Get --- -- --- --- --- --- ---- URLopen.pm
Get --- -- --- --- --- --- ---- URLs
Get --- -- --- --- --- --- ---- URLtrim.pm
Get --- -- --- --- --- --- ---- Uadd
Get --- -- --- --- --- --- ---- Uhst
Get --- -- --- --- --- --- ---- UpdateAllHosts
Get --- -- --- --- --- --- ---- UpdateHost
Get --- -- --- --- --- --- ---- UpdateHosts
Get --- -- --- --- --- --- ---- V.pm
Get --- -- --- --- --- --- ---- Vopt.pm
Get --- -- --- --- --- --- ---- aa em
Get --- -- --- --- --- --- ---- abc-dir.html
Get --- -- --- --- --- --- ---- abcCode.pm
Get --- -- --- --- --- --- ---- abcCode.pm-oban
Get --- -- --- --- --- --- ---- abcCode.pm-trillian
Get --- -- --- --- --- --- ---- abcbot
Get --- -- --- --- --- --- ---- abccat
Get --- -- --- --- --- --- ---- abcextract
Get --- -- --- --- --- --- ---- abcinhtml
Get --- -- --- --- --- --- ---- add/
Get --- -- --- --- --- --- ---- bad/
Get --- -- --- --- --- --- ---- badfiles
Get --- -- --- --- --- --- ---- badsite
Get --- -- --- --- --- --- ---- badsites
Get --- -- --- --- --- --- ---- cache22333.data
Get --- -- --- --- --- --- ---- cache8837.data
Get --- -- --- --- --- --- ---- cfg/
Get --- -- --- --- --- --- ---- cfghost.pm
Get --- -- --- --- --- --- ---- cfgload.pm
Get --- -- --- --- --- --- ---- cgi/
Get --- -- --- --- --- --- ---- cgilocal.pm
Get --- -- --- --- --- --- ---- cmd/
Get --- -- --- --- --- --- ---- del/
Get --- -- --- --- --- --- ---- ffind
Get --- -- --- --- --- --- ---- findrobotstxt
Get --- -- --- --- --- --- ---- gmtime
Get --- -- --- --- --- --- ---- grepbot
Get --- -- --- --- --- --- ---- greplogs
Get --- -- --- --- --- --- ---- greptunes
Get --- -- --- --- --- --- ---- gsl
Get --- -- --- --- --- --- ---- gt
Get --- -- --- --- --- --- ---- hosts-20070611
Get --- -- --- --- --- --- ---- hst/
Get --- -- --- --- --- --- ---- hstadrs
Get --- -- --- --- --- --- ---- hstadrs.txt
Get --- -- --- --- --- --- ---- hstat
Get --- -- --- --- --- --- ---- ht
Get --- -- --- --- --- --- ---- htmlfiles
Get --- -- --- --- --- --- ---- htmlsubs.pm
Get --- -- --- --- --- --- ---- htmltext
Get --- -- --- --- --- --- ---- http/
Get --- -- --- --- --- --- ---- http_size
Get --- -- --- --- --- --- ---- httpcat
Get --- -- --- --- --- --- ---- httpget
Get --- -- --- --- --- --- ---- httpnew
Get --- -- --- --- --- --- ---- httptest
Get --- -- --- --- --- --- ---- httpurge
Get --- -- --- --- --- --- ---- hzcat
Get --- -- --- --- --- --- ---- kendy-cgilocal.pm
Get --- -- --- --- --- --- ---- lck/
Get --- -- --- --- --- --- ---- listbadsites
Get --- -- --- --- --- --- ---- log/
Get --- -- --- --- --- --- ---- minya-cgilocal.pm
Get --- -- --- --- --- --- ---- namesubs.pm
Get --- -- --- --- --- --- ---- ndx/
Get --- -- --- --- --- --- ---- new/
Get --- -- --- --- --- --- ---- newsites
Get --- -- --- --- --- --- ---- nul/
Get --- -- --- --- --- --- ---- obs/
Get --- -- --- --- --- --- ---- old/
Get --- -- --- --- --- --- ---- outtune.pm
Get --- -- --- --- --- --- ---- pf
Get --- -- --- --- --- --- ---- rename
Get --- -- --- --- --- --- ---- renamecachefiles
Get --- -- --- --- --- --- ---- rmTT
Get --- -- --- --- --- --- ---- rmwc
Get --- -- --- --- --- --- ---- rsdir
Get --- -- --- --- --- --- ---- rsdir.kendy
Get --- -- --- --- --- --- ---- save/
Get --- -- --- --- --- --- ---- scandata
Get --- -- --- --- --- --- ---- sh/
Get --- -- --- --- --- --- ---- showbadsites
Get --- -- --- --- --- --- ---- stat/
Get --- -- --- --- --- --- ---- tags
Get --- -- --- --- --- --- ---- tmp/
Get --- -- --- --- --- --- ---- todel
Get --- -- --- --- --- --- ---- toold
Get --- -- --- --- --- --- ---- trabc
Get --- -- --- --- --- --- ---- transpose_abc.pl
Get --- -- --- --- --- --- ---- trillian-cgilocal.pm
Get --- -- --- --- --- --- ---- ts
Get --- -- --- --- --- --- ---- ut
Get --- -- --- --- --- --- ---- utf8test
Get --- -- --- --- --- --- ---- w3cat
Get --- -- --- --- --- --- ---- webcat
Get --- -- --- --- --- --- ---- zapdir