|
|
#1 |
|
Baseband Member
Join Date: Feb 2010
Posts: 22
|
Aloha, I am attempting to make a small web page where I can enter a list of urls and it will take certain data about those pages and make a list that has all the data in it.
In more detail... I am a avid reader of fanfiction and have a list of about 500 books that I have or need to read. I am looking to be able to take that list of urls and have it grab the summary information about each of those stories and compile a list for me. I am not asking anyone to do this for me but, since I only know HTML, it would be great if someone was willing to point me in the right direction and give me some helpful information. I would help if I knew what "language" would be the easiest to do this in...it would be really great if some one would give me an idea of the process that I would need code to do this...and it would be nice to have a few key terms that research to be able to get this done with out having to learn ALL the functions of said "language". Example: I am wanting to take links list this... Code:
http://www.fanfiction.net/s/5185204/1/Harry_Potter_and_the_Untold_Story http://www.fanfiction.net/s/5806561/1/Give_Yourself_Away etc. Code:
http://www.fanfiction.net/book/Harry_Potter/ Herby |
|
|
|
|
|
#2 |
|
Baseband Member
Join Date: Feb 2010
Posts: 22
|
Thanks for the tips on things to research people as it turns out imacros seems like it might work best.
|
|
|
|
|
|
#3 |
|
In Runtime
|
You have awoken me from my posting slumber, so thanks.
What you are asking about could be extremely complicated depending on how robust you want the solution to be. The core functionality you are looking for is referred to as "Screen-scraping". There is a right way and a wrong way to do this (the wrong way will get the IP address you run it from blocked in some cases). Essentially what you are asking about is a rudimentary web crawler. Here is my advice to you: 1.) Don't make the script that searches the web-pages a web-page (if that makes sense). 2.) Instead, make this part of your "project" a program (executable) that runs on a computer and goes out to those websites, grabs the info and updates a database (Java is free and there are many powerful IDE's that will help you get started quickly). Other possible alternatives include: c++,.NET(IDE is not free), php and Ruby. There are others but these are the languages I would recommend, not necessarily in any order. 3.) Once you have the database populated you SHOULD use a web-interface to display the findings that have been placed in the database. As for HOW to do it. That is far beyond the scope of my post but here are some links to get you pointed in the right direction. http://en.wikipedia.org/wiki/Data_sc...creen_scraping http://www.4guysfromrolla.com/webtech/070601-1.shtml <-- Love this one. http://devcity.net/Articles/48/1/screen_scrape.aspx http://www.ibm.com/developerworks/xm...-jtp03225.html Keep in mind Screen Scraping is NOT a basic task to perform. Here is a list of already written screen-scrapers that you could potentially use to populate your database. http://www.manageability.org/blog/st...ritten-in-java Hope this helps. !!! Please keep in mind that in some cases, this can be considered an "unfavorable" activity at best when it is done incorrectly or for malicious reasons.
__________________
**Official Self-proclaimed glorified excessive (insert additional adjectives here) post editor/modifier. Edit = Best feature ever http://www.twitter.com/xDaevax |
|
|
|
|
|
#4 |
|
Baseband Member
Join Date: Feb 2010
Posts: 22
|
I have been using iRobot for my screen scrapping, exporting that to a XML file then using XSL to convert all that information into a readable format. Once that is done I copy that to MS Word then print to a PDF file.
It may not be the fastest way of doing this but it really is not that bad because iRobot was a free and easy to learn program and XSL took me about 10 minutes to learn using the tutorials at 3wschool. This has worked so well for me that I have made my list go from 500+ books to 50,000+ books, though I am looking for a nice way to save them all as pdf's right now...lol. There is program called Ficfiction downloader that works for that but you have to do them one at a time and the and I can not find a non-buggy macro program plus then you can not use your computer when it is running...lol. All-in-all this is alot of work just so that I can read them with my mp3 player. This is all I had to code... Code:
<xsl:for-each select="books/Book">
<font color="#0000FF"><xsl:value-of select="Story"/></font> - <font color="#FF0000"><xsl:value-of select="Author"/></font><br />
<xsl:value-of select="Summary"/><br />
<font color="#CCCCCC"><xsl:value-of select="Crossover"/> - <xsl:value-of select="Genre"/> - <xsl:value-of select="Ships"/><br />
Chapters: <xsl:value-of select="Chapters"/> Word Count: <xsl:value-of select="Words"/> Rating: <xsl:value-of select="Rated"/> Reviews: <xsl:value-of select="Rated"/><br />
Updated: <xsl:value-of select="Updated"/> Published: <xsl:value-of select="Published"/></font><br />
<br />
</xsl:for-each>
|
|
|
|
|
|
#5 | |
|
Site Team
Join Date: Jul 2009
Posts: 2,629
|
Perhaps jumping in a bit late, but:
Quote:
Another thing to note is that screen scraping is notoriously unreliable - fine if you just want to grab things once, but don't expect to say run the program again next year and have the same books turn up. If the HTML even changes slightly, depending on how you've coded the thing, it could still throw everything off enough to make the information unusable...
__________________
Save the whales, feed the hungry, free the mallocs. |
|
|
|
|
|
|
#6 |
|
Baseband Member
Join Date: Feb 2010
Posts: 22
|
Yeah that is very true, I found that out the hard way already then the site I was using changed the html for the search results. Though with iRobot it only took me about 10 minutes to find the change and fix it.
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|