This howto will explain how to get Nutch, Nutch-Gui, Sun JDK & Tomcat 6.0.16 working on Centos 5.x or 6.x while maintaining a normally functioning Centos system. Currently, Centos 5.x ships with Tomcat 5.5, however, while it does run, there are problems with the default install of this version that results in errors which are undocumented and persistent at this time. If you have information or believe that these errors have been addressed and can point to a fix, please use the contact form on this website to let us know. The following instructions allow for easy removal of any software installed through following this howto by either using “rpm -e foo.rpm” or “rm -rf /opt/foo” returning your system to its original state.
Applicable to Centos Versions:
- Centos 5.x
- Centos 6.x
Requirements
Explanation of requirements.
- Root or sudo access with appropriate privileges to the system you intend to install on.
- A server preferably on a high-speed network.
- Sun JDK rpm.bin.
- Tomcat 6 rpm.
- Nutch 1.0 tar.gz.
- Nutch-Gui 0.2 tar.gz.
Doing the Work
Basic description of what will be done and what is expected.
- Install a few dependencies:
- Download & install the latest Sun JDK rpm.bin:
- Download & install Tomcat 6:
- Download & install Nutch 1.0:
- Edit /etc/profile:
- Configure Nutch to fetch URLs:
- Nutch “deepcrawler” script:
- Fetch URLs with Nutch via command line:
- Download & install Nutch-Gui 0.2:
1 |
sudo yum install ant xml-commons-apis ant-trax |
1 2 3 4 5 6 7 8 9 10 |
<strong>Go here:</strong> <a class="external-link" href="http://java.sun.com/javase/downloads/index.jsp">http://java.sun.com/javase/downloads/index.jsp</a><strong> Get the following: Java SE Development Kit (JDK) 32bit (approx. 73.98MB) JDK 6 Update 16 (or the latest update, the version is important in setting your JAVA_HOME path variable)</strong> <strong>Once downloaded install using the following:</strong> chmod +x jdk-6u16-linux-i586-rpm.bin; ./jdk-6u16-linux-i586-rpm.bin <strong>answer "yes" to the EULA</strong> sudo rpm -ivh jdk* sun* |
1 2 3 4 5 6 |
<a class="external-link" href="http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.noarch.rpm">http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.noarch.rpm</a> <a class="external-link" href="http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.src.rpm">http://www.webdroid.org:8080/archives/tomcat-package/tomcat-6.0.16-0.src.rpm</a> (<strong>provided for reference</strong>) <strong>Once downloaded install the rpm with the following command:</strong> sudo rpm -ivh tomcat-6.0.16-0.noarch.rpm (<strong>this installs entirely into /opt/tomcat and can be removed with: rpm -e tomcat</strong>) sudo vi /opt/tomcat/conf/tomcat-env.sh (<strong>set: JAVA_HOME="/usr/java/jre1.6.0_16"</strong>) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<strong>Dowmload Nutch 1.0 from a mirror here:</strong> <a class="external-link" href="http://www.apache.org/dyn/closer.cgi/lucene/nutch/">http://www.apache.org/dyn/closer.cgi/lucene/nutch/</a> sudo cp nutch-1.0.tar.gz /opt; cd /opt && tar xvfz nutch-1.0.tar.gz; cd nutch-1.0 sudo ant sudo ant war (<strong>this creates the "build" directory</strong>) sudo ln -s /opt/nutch-1.0/build/nutch.xml /opt/tomcat/conf/Catalina/localhost/nutch.xml (<strong>modify the property "searcher.dir" to: /opt/nutch-1.0/crawl/ & the docBase attribute to the full path of your nutch-1.0 war file: docBase="nutch.war" path="/opt/tomcat/webapps/"</strong>) sudo ant sudo ant war (<strong>this compiles with the new build/nutch.xml file</strong>) sudo cp build/nutch-1.0.war /opt/tomcat/webapps/nutch.war (<strong>a .war file is a zip/jar file known as a "web archive" or war file, it is uncompressed when tomcat is started</strong>) |
1 2 3 4 5 6 7 8 9 10 11 12 |
<strong>Add these lines just above: # ksh workaround</strong> sudo vi /etc/profile ##Tomcat 6 / Java## JAVA_HOME="/usr/java/jdk1.6.0_16" export JAVA_HOME CATALINA_HOME="/opt/tomcat" export CATALINA_HOME NUTCH_JAVA_HOME="/usr/java/jdk1.6.0_16" export NUTCH_JAVA_HOME ##End Tomcat 6 / Java## |
1 2 3 4 5 6 7 8 9 10 11 |
cd /opt/nutch-1.0; sudo mkdir urls (<strong>Make a flat text file in here called "seed" and create a list of urls to be crawled, with each url on a new separate line: http://www.example.com</strong>) sudo vi conf/nutch-default.xml <strong>Edit the following:</strong> http.agent.name <value>My Spider</value> http.robots.agents <value>My Spider</value> http.agent.description <value>My Bot</value> http.agent.url <value>http://www.example.com</value> http.agent.email <value>admin@example.com</value> <strong>all other values remain as default, do not attempt to alter them unless you have a backup and/or you know what you're doing.</strong> |
1 2 |
Put <a title="deepcrawler" class="internal-link" href="/scripts/deepcrawler">this script</a> in /opt/nutch-1.0/bin<br />chmod +x deepcrawler <strong>Note: This script assumes the urls you plan to inject are stored in /opt/nutch-1.0/urls/seed and will create a new dir in: /opt/nutch-1.0/crawl1 to store the new crawl.</strong> |
1 2 3 4 5 |
<strong>If you do not alter the deepcrawler script it will most likely run for many days or weeks depending on the amount of urls you inject, so you'll want to run it in screen so you can detach and reattach to check progress. </strong>screen -S nutch sudo service tomcat start cd /opt/nutch-1.0; su -c "bin/deepcrawler" |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
<strong>Note: if you use the script provided above, you can skip the GUI altogether. Download Nutch-Gui 0.2 from:</strong> <a class="external-link" href="http://github.com/101tec/nutch/downloads">http://github.com/101tec/nutch/downloads</a> sudo cp nutch-gui-0.2.tar.gz /opt; cd /opt && tar xvfz nutch-gui-0.2.tar.gz; cd nutch-gui-0.2 sudo ant clean package cd build/nutch-gui-0.2 sudo cp nutch-gui-0.2.war /opt/tomcat/webapps/nutch-gui.war <strong>unsecured quick test method, to assure it's working:</strong> su -c "bin/nutch admin /opt/nutch-1.0 50060" http://example.com:50060/general <strong>more secure password protection:</strong> sudo vi conf/nutchguiUsers.properties (<strong>edit the following information: user=password, admin, where user is the usename, password is the password you want, and admin is the role</strong>) screen -S nutch-gui (<strong>since we'll probably run it for a while</strong>) su -c "bin/nutch admin /opt/nutch-1.0 50060 —secure" http://example.com:50060/general |
Troubleshooting / How To Test
Explanation troubleshooting basics and expectations.
- Make sure the required packages are installed and JAVA_HOME path variable is set in /etc/profile:
- Set Tomcat to start on boot:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
rpm -q tomcat jdk ant xml-commons-apis ant-trax; echo $JAVA_HOME tomcat-6.0.16-0 jdk-1.6.0_16-fcs ant-1.6.5-2jpp.2 xml-commons-apis-1.3.02-0.b2.7jpp.10 ant-trax-1.6.5-2jpp.2 /usr/java/jdk1.6.0_16 Replace "localhost" with your machines IP <strong>Try accessing Tomcat here: http://localhost:8080/ Try accessing Nutch here: http://localhost:8080/nutch/ Try accessing Nutch-Gui here: http://localhost:50060/general</strong> |
1 2 |
sudo chkconfig --level 2345 tomcat on; chkconfig --list | grep tomcat tomcat 0:off 1:off 2:on 3:on 4:on 5:on 6:off |
Common problems and fixes
Describe common problems here, include links to known common problems if on another site
More Information
Any additional information or notes.
Disclaimer
We test this stuff on our own machines, really we do. But you may run into problems, if you do, come to #centoshelp on irc.freenode.net
Added Reading
Last Modified: 22 Apr, 2020 at 16:44:56