compiling data from databases...

herbycanopy

Baseband Member
Messages
22
Aloha, I am attempting to make a small web page where I can enter a list of urls and it will take certain data about those pages and make a list that has all the data in it.

In more detail...

I am a avid reader of fanfiction and have a list of about 500 books that I have or need to read. I am looking to be able to take that list of urls and have it grab the summary information about each of those stories and compile a list for me.

I am not asking anyone to do this for me but, since I only know HTML, it would be great if someone was willing to point me in the right direction and give me some helpful information. I would help if I knew what "language" would be the easiest to do this in...it would be really great if some one would give me an idea of the process that I would need code to do this...and it would be nice to have a few key terms that research to be able to get this done with out having to learn ALL the functions of said "language".

Example:
I am wanting to take links list this...
Code:
http://www.fanfiction.net/s/5185204/1/Harry_Potter_and_the_Untold_Story
http://www.fanfiction.net/s/5806561/1/Give_Yourself_Away
etc.

And have it print out a list that will list things like Story name, author, world count, ratting, etc. Here is an example of what it would look like...
Code:
http://www.fanfiction.net/book/Harry_Potter/

Any help and ideas would be great...
Herby
 
Thanks for the tips on things to research people as it turns out imacros seems like it might work best.
 
You have awoken me from my posting slumber, so thanks.


What you are asking about could be extremely complicated depending on how robust you want the solution to be.

The core functionality you are looking for is referred to as "Screen-scraping". There is a right way and a wrong way to do this (the wrong way will get the IP address you run it from blocked in some cases).

Essentially what you are asking about is a rudimentary web crawler.

Here is my advice to you:

1.) Don't make the script that searches the web-pages a web-page (if that makes sense).
2.) Instead, make this part of your "project" a program (executable) that runs on a computer and goes out to those websites, grabs the info and updates a database (Java is free and there are many powerful IDE's that will help you get started quickly).
Other possible alternatives include: c++,.NET(IDE is not free), php and Ruby. There are others but these are the languages I would recommend, not necessarily in any order.
3.) Once you have the database populated you SHOULD use a web-interface to display the findings that have been placed in the database.


As for HOW to do it. That is far beyond the scope of my post but here are some links to get you pointed in the right direction.
http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping

http://www.4guysfromrolla.com/webtech/070601-1.shtml <-- Love this one.

http://devcity.net/Articles/48/1/screen_scrape.aspx

http://www.ibm.com/developerworks/xml/library/j-jtp03225.html

Keep in mind Screen Scraping is NOT a basic task to perform.

Here is a list of already written screen-scrapers that you could potentially use to populate your database.

http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java

Hope this helps.

!!!
Please keep in mind that in some cases, this can be considered an "unfavorable" activity at best when it is done incorrectly or for malicious reasons.
 
I have been using iRobot for my screen scrapping, exporting that to a XML file then using XSL to convert all that information into a readable format. Once that is done I copy that to MS Word then print to a PDF file.

It may not be the fastest way of doing this but it really is not that bad because iRobot was a free and easy to learn program and XSL took me about 10 minutes to learn using the tutorials at 3wschool.

This has worked so well for me that I have made my list go from 500+ books to 50,000+ books, though I am looking for a nice way to save them all as pdf's right now...lol. There is program called Ficfiction downloader that works for that but you have to do them one at a time and the and I can not find a non-buggy macro program plus then you can not use your computer when it is running...lol. All-in-all this is alot of work just so that I can read them with my mp3 player.

This is all I had to code...
Code:
<xsl:for-each select="books/Book">
	<font color="#0000FF"><xsl:value-of select="Story"/></font> - <font color="#FF0000"><xsl:value-of select="Author"/></font><br />
	<xsl:value-of select="Summary"/><br />
    <font color="#CCCCCC"><xsl:value-of select="Crossover"/> - <xsl:value-of select="Genre"/> - <xsl:value-of select="Ships"/><br />	
    Chapters: <xsl:value-of select="Chapters"/> Word Count: <xsl:value-of select="Words"/> Rating: <xsl:value-of select="Rated"/> Reviews: <xsl:value-of select="Rated"/><br />
    Updated: <xsl:value-of select="Updated"/> Published: <xsl:value-of select="Published"/></font><br />
    <br />
</xsl:for-each>

Though I will admit that I really still do not understand what XSL-FO is for...lol.
 
Perhaps jumping in a bit late, but:

.NET(IDE is not free)
There are free IDEs out there - express edition of visual studio is free, and there's 3rd party ones like sharpdevelop which are also free.

Another thing to note is that screen scraping is notoriously unreliable - fine if you just want to grab things once, but don't expect to say run the program again next year and have the same books turn up. If the HTML even changes slightly, depending on how you've coded the thing, it could still throw everything off enough to make the information unusable...
 
Yeah that is very true, I found that out the hard way already then the site I was using changed the html for the search results. Though with iRobot it only took me about 10 minutes to find the change and fix it.
 
Back
Top Bottom