Prerequisite :
1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME
https://github.com/LucidWorks/spark-solr
set SS_HOME=d:\spark-solr-master
2) Download maven and set following env variables
set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
3) open a command prompt, make sure m2_home is set and and added to path. Run command
cd %SS_HOME%
mvn install -DskipTests
This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.
4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)
https://github.com/freissmann/SolrWithSparks
Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master
The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions
Open the pom.xml and add following if not already specified :
- Install Spark 1.5.X (see here for details http://prasi82.blogspot.com/2015/11/setup-spark-on-windows.html )
- Install zookeeper and Solr 5.3.X
- If you have solr 4.X you will need to upgrade your solr 4.X indexed data to Solr 5.X.
1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME
https://github.com/LucidWorks/spark-solr
set SS_HOME=d:\spark-solr-master
2) Download maven and set following env variables
set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
3) open a command prompt, make sure m2_home is set and and added to path. Run command
cd %SS_HOME%
mvn install -DskipTests
This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.
4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)
https://github.com/freissmann/SolrWithSparks
Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master
The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions
Open the pom.xml and add following if not already specified :
If you are running eclipse with JAVA_HOME pointing to a JRE path instead of JDK path, you will get 1.6 tools.jar missing error. To resolve this error, restart eclipse with following changes in your eclipse_home\eclipse.ini file :
See link below for additional info :
http://stackoverflow.com/questions/11118070/buiding-hadoop-with-eclipse-maven-missing-artifact-jdk-toolsjdk-toolsjar1
package de.blogspot.qaware.spark;
import com.lucidworks.spark.SolrRDD;
import de.blogspot.qaware.spark.common.ContextMaker;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import java.util.Arrays;
public class SparkSolrJobApp {
private static final String ZOOKEEPER_HOST_AND_PORT = "zkhost:zkport";
private static final String SOLR_COLLECTION = "collection1";
private static final String QUERY_ALL = "*:*";
public static void main(String[] args) throws Exception {
String zkHost = ZOOKEEPER_HOST_AND_PORT;
String collection = SOLR_COLLECTION;
String queryStr = QUERY_ALL;
JavaSparkContext javaSparkContext = ContextMaker.makeJavaSparkContext("Querying Solr");
SolrRDD solrRDD = new SolrRDD(zkHost, collection);
final SolrQuery solrQuery = SolrRDD.toQuery(queryStr);
JavaRDD
JavaRDD
Object possibleTitle = doc.get("title");
String title = possibleTitle != null ? possibleTitle.toString() : "";
return Arrays.asList(title);
}).filter(s -> !s.isEmpty());
System.out.println("\n# of found titles: " + titleNumbers.count());
// Now use schema information in Solr to build a queryable SchemaRDD
SQLContext sqlContext = new SQLContext(javaSparkContext);
// Pro Tip: SolrRDD will figure out the schema if you don't supply a list of field names in your query
DataFrame tweets = solrRDD.asTempTable(sqlContext, queryStr, "documents");
// SQL can be run over RDDs that have been registered as tables.
DataFrame results = sqlContext.sql("SELECT * FROM documents where id LIKE 'one%'");
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
JavaRDD
System.out.println("\n\n# of documents where 'id' starts with 'one': " + resultsRDD.count());
javaSparkContext.stop();
}
}
5) Build Assembly Jar of your code :
Modify pom.xml to build the assembly jar of your example code .. i.e. jar with all the dependencies included :
Open a command prompt , make sure m2_home is set as indicated step2) and "%m2_home%\bin"included in the path.
cd D:\SolrWithSparks-master
mvn clean
mvn package -DskipTests
This will build the jar of your code with all its dependency classes included in it in the "target" folder.
6) Run solr spark example :
Before you run this program, you will need to start your zookeeper and solrcloud shards.
Go to command prompt and run following :
set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
set java_home=d:\apps\Java\jdk1.8.0_65
set jre_home=%java_home%\jre
set jdk_home=%JAVA_HOME%
set path=%java_home%\bin;%path%
set SPARK_HOME=D:\udu\hk\spark-1.5.1
set SPARK_CONF_DIR=%SPARK_HOME%\conf
call %SPARK_HOME%\bin\load-spark-env.cmd
%spark_home%\bin\spark-submit.cmd --class org.sparkexample.SparkSolrJobApp file://d:/spark-solr-master/target/spark-test-id-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Here note that you will need to specify the path to your spark job assembly jar starting with "file://" and use '/' as path separator instead of '\'
After the jar path you may specify any additional parameters required by your spark job.
7) Troubleshooting :
If you get Connection time out error, make sure that in your SPARK_HOME\conf\spark-env.cmd :
SPARK_LOCAL_IP is set.
set SPARK_LOCAL_IP=127.0.0.1
REM set ip address of spark master
REM set SPARK_MASTER_IP=127.0.0.1
If you get error of following type :
Exception Details:
Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:
bci: @62
flags: { }
locals:
{ 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' }Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:
bci: @62
flags: { }
locals:
stack:
{ 'org/apache/http/impl/client/DefaultHttpClient' }
Bytecode:
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)
Download the following jars (just google for it) and copy them to your %hadoop_home%\share\hadoop\common\lib folder.
httpclient-4.4.1.jar
httpcore-4.4.1.jar
See here for details : https://issues.apache.org/jira/browse/SOLR-7948
8) FAQ
How to build solr Queries in Spark Job :
https://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0/client/java/solrj/test/org/apache/solr/client/solrj/SolrExampleTests.java
How to sort a JavaRDD
http://stackoverflow.com/questions/27151943/sortby-in-javardd
How to invoke a spark job from Java Program :
https://github.com/spark-jobserver/spark-jobserver
http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-td13426.html