Prasad's Technical Diary
Spark Solr Integration on Windows
Prerequisite :
1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME
https://github.com/LucidWorks/spark-solr
set SS_HOME=d:\spark-solr-master
2) Download maven and set following env variables
set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
3) open a command prompt, make sure m2_home is set and and added to path. Run command
cd %SS_HOME%
mvn install -DskipTests
This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.
4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)
https://github.com/freissmann/SolrWithSparks
Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master
The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions
Open the pom.xml and add following if not already specified :
- Install Spark 1.5.X (see here for details http://prasi82.blogspot.com/2015/11/setup-spark-on-windows.html )
- Install zookeeper and Solr 5.3.X
- If you have solr 4.X you will need to upgrade your solr 4.X indexed data to Solr 5.X.
1) Download the spark-solr sources from following link and unzip to a location e.g. d:\spark-solr-master. Lets call this location SS_HOME
https://github.com/LucidWorks/spark-solr
set SS_HOME=d:\spark-solr-master
2) Download maven and set following env variables
set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
3) open a command prompt, make sure m2_home is set and and added to path. Run command
cd %SS_HOME%
mvn install -DskipTests
This will build the spark-solr-1.0.SNAPSHOT.jar in %SS_HOME%\target location.
4) Download the spark solr java sample from here ; (you will need Eclipse with maven 'm2eclipse' plugin)
https://github.com/freissmann/SolrWithSparks
Import it in eclipse using Import .. > Existing Maven Project > Browse to the downloaded and unzipped example directory e.g. D:\SolrWithSparks-master
The SparkSolrJobApp.java sample needs JDK1.8 for lambda functions
Open the pom.xml and add following if not already specified :
If you are running eclipse with JAVA_HOME pointing to a JRE path instead of JDK path, you will get 1.6 tools.jar missing error. To resolve this error, restart eclipse with following changes in your eclipse_home\eclipse.ini file :
See link below for additional info :
http://stackoverflow.com/questions/11118070/buiding-hadoop-with-eclipse-maven-missing-artifact-jdk-toolsjdk-toolsjar1
package de.blogspot.qaware.spark;
import com.lucidworks.spark.SolrRDD;
import de.blogspot.qaware.spark.common.ContextMaker;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import java.util.Arrays;
public class SparkSolrJobApp {
private static final String ZOOKEEPER_HOST_AND_PORT = "zkhost:zkport";
private static final String SOLR_COLLECTION = "collection1";
private static final String QUERY_ALL = "*:*";
public static void main(String[] args) throws Exception {
String zkHost = ZOOKEEPER_HOST_AND_PORT;
String collection = SOLR_COLLECTION;
String queryStr = QUERY_ALL;
JavaSparkContext javaSparkContext = ContextMaker.makeJavaSparkContext("Querying Solr");
SolrRDD solrRDD = new SolrRDD(zkHost, collection);
final SolrQuery solrQuery = SolrRDD.toQuery(queryStr);
JavaRDD
JavaRDD
Object possibleTitle = doc.get("title");
String title = possibleTitle != null ? possibleTitle.toString() : "";
return Arrays.asList(title);
}).filter(s -> !s.isEmpty());
System.out.println("\n# of found titles: " + titleNumbers.count());
// Now use schema information in Solr to build a queryable SchemaRDD
SQLContext sqlContext = new SQLContext(javaSparkContext);
// Pro Tip: SolrRDD will figure out the schema if you don't supply a list of field names in your query
DataFrame tweets = solrRDD.asTempTable(sqlContext, queryStr, "documents");
// SQL can be run over RDDs that have been registered as tables.
DataFrame results = sqlContext.sql("SELECT * FROM documents where id LIKE 'one%'");
// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
JavaRDD
System.out.println("\n\n# of documents where 'id' starts with 'one': " + resultsRDD.count());
javaSparkContext.stop();
}
}
5) Build Assembly Jar of your code :
Modify pom.xml to build the assembly jar of your example code .. i.e. jar with all the dependencies included :
Open a command prompt , make sure m2_home is set as indicated step2) and "%m2_home%\bin"included in the path.
cd D:\SolrWithSparks-master
mvn clean
mvn package -DskipTests
This will build the jar of your code with all its dependency classes included in it in the "target" folder.
6) Run solr spark example :
Before you run this program, you will need to start your zookeeper and solrcloud shards.
Go to command prompt and run following :
set M2_HOME=D:\apps\apache-maven-3.0.5
set path=%M2_HOME%\bin;%PATH%;
set java_home=d:\apps\Java\jdk1.8.0_65
set jre_home=%java_home%\jre
set jdk_home=%JAVA_HOME%
set path=%java_home%\bin;%path%
set SPARK_HOME=D:\udu\hk\spark-1.5.1
set SPARK_CONF_DIR=%SPARK_HOME%\conf
call %SPARK_HOME%\bin\load-spark-env.cmd
%spark_home%\bin\spark-submit.cmd --class org.sparkexample.SparkSolrJobApp file://d:/spark-solr-master/target/spark-test-id-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Here note that you will need to specify the path to your spark job assembly jar starting with "file://" and use '/' as path separator instead of '\'
After the jar path you may specify any additional parameters required by your spark job.
7) Troubleshooting :
If you get Connection time out error, make sure that in your SPARK_HOME\conf\spark-env.cmd :
SPARK_LOCAL_IP is set.
set SPARK_LOCAL_IP=127.0.0.1
REM set ip address of spark master
REM set SPARK_MASTER_IP=127.0.0.1
If you get error of following type :
Exception Details:
Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:
bci: @62
flags: { }
locals:
{ 'org/apache/solr/common/params/SolrParams', 'org/apache/http/conn/ClientConnectionManager', 'org/apache/solr/common/params/ModifiableSolrParams', 'org/apache/http/impl/client/DefaultHttpClient' }Location:
org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient; @62: areturn
Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, stack[0]) is not assignable to 'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
Current Frame:
bci: @62
flags: { }
locals:
stack:
{ 'org/apache/http/impl/client/DefaultHttpClient' }
Bytecode:
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)
0000000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0000010: 0099 001e b200 05bb 0007 59b7 0008 1209
0000020: b600 0a2c b600 0bb6 000c b900 0d02 00bb
0000030: 0011 592b b700 124e 2d2c b800 102d b0
Stackmap Table:
append_frame(@47,Object127)
Download the following jars (just google for it) and copy them to your %hadoop_home%\share\hadoop\common\lib folder.
httpclient-4.4.1.jar
httpcore-4.4.1.jar
See here for details : https://issues.apache.org/jira/browse/SOLR-7948
8) FAQ
How to build solr Queries in Spark Job :
https://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0/client/java/solrj/test/org/apache/solr/client/solrj/SolrExampleTests.java
How to sort a JavaRDD
http://stackoverflow.com/questions/27151943/sortby-in-javardd
How to invoke a spark job from Java Program :
https://github.com/spark-jobserver/spark-jobserver
http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-td13426.html
Migrating Solr 4.X Index data to Solr 5.X index
To upgrade your solr 4.X indexed data to Solr 5.X , run the following command
java -cp D:\solr-5.3.1\server\solr-webapp\webapp\WEB-INF\lib\* org.apache.lucene.index.IndexUpgrader D:\solr-4.4.0\example\solr\collection1\data\index
Here assuming "D:\solr-4.4.0\awc\solr\collection1\data\index" is the directory that contains indexed data that you want to upgrade to Solr 5.3.X.
After this you can copy your Solr 4.4.X colleciton / core directory (e.g. D:\solr-4.4.0\example\solr\collection1 in above command) to Solr 5.3.X home directory.
After copying to Solr5.3.X home directory, you will need to make few changes in the schema.xml and solrConfig.xml of your collection :
In your collection\conf\solrConfig.xml, comment out following :
In your collection\conf\schema.xml, change following, field values to include "Trie" Prefix
java -cp D:\solr-5.3.1\server\solr-webapp\webapp\WEB-INF\lib\* org.apache.lucene.index.IndexUpgrader D:\solr-4.4.0\example\solr\collection1\data\index
Here assuming "D:\solr-4.4.0\awc\solr\collection1\data\index" is the directory that contains indexed data that you want to upgrade to Solr 5.3.X.
After this you can copy your Solr 4.4.X colleciton / core directory (e.g. D:\solr-4.4.0\example\solr\collection1 in above command) to Solr 5.3.X home directory.
After copying to Solr5.3.X home directory, you will need to make few changes in the schema.xml and solrConfig.xml of your collection :
In your collection\conf\solrConfig.xml, comment out following :
In your collection\conf\schema.xml, change following, field values to include "Trie" Prefix
Setup Apache Spark On Windows
download spark and unzip E.g
spark_Home=D:\udu\hk\spark-1.5.1
spark needs hadoop jars. download hadoop binaries for windows (hadoop 2.6.0) from
http://www.barik.net/archive/2015/01/19/172716/
unzip hadoop at some locaiton e.g.
hadoop_home=D:\udu\hk\hadoop-2.6.0
If your java_home or hadoop_home path contains space charcters in it ' ', you will need to convert to path to short paths:
- Create a batch script with following contents
@ECHO OFF
echo %~s1
- Run the above batch script file from java_home directory to get the short path for java-home
- Run the above batch script file from hadoop_home directory to get the short path for hadoop_home
set java_home=short path obtained from above command
set hadoop_home=short path obatained from above command.
Run following command and copy the classpath generated by the command for next step
%HADOOP_HOME%\bin\hadoop classpath
under spark_home\conf, create a file named "spark-env.cmd" like below
@echo off
set HADOOP_HOME=D:\Utils\hadoop-2.7.1
set PATH=%HADOOP_HOME%\bin;%PATH%
set SPARK_DIST_CLASSPATH=
on Command prompt
cd %spark_home%\bin
set SPARK_CONF_DIR=%SPARK_HOME%\conf
load-spark-env.cmd
spark-shell.cmd //To start spark shell
spark-submit.cmd
Refer below to create a spark word count example
http://www.robertomarchetto.com/spark_java_maven_example
To run a spark job (written using Java) word count example from above URL
spark-submit --class org.sparkexample.WordCount --master local[2] your_spark_job_jar Any_additional_parameters_needed_by_your_job_jar
References :
http://stackoverflow.com/questions/30906412/noclassdeffounderror-com-apache-hadoop-fs-fsdatainputstream-when-execute-spark-s
https://blogs.perficient.com/multi-shoring/blog/2015/05/07/setup-local-standalone-spark-node/
http://nishutayaltech.blogspot.com/2015/04/how-to-run-apache-spark-on-windows7-in.html
How to split a solr core into mutiple shards on different machine
download zookeeper
make a copy of conf\zk_sample.cfg as zoo.cfg
creaeta zkdata folder.
in zoo.cfg, uncomment datadir and provide path to zkdata folder (replace '\' in the path with '/' )
Modify zk_home/bin/zk_env.cmd to replace ~dp0.. with actual paths like below
REM assuming Zookeeper is installed at : D:\zk1
set ZOOCFGDIR=D:\zk1\conf
set ZOO_LOG_DIR=D:\zk1
set ZOO_LOG4J_PROP=INFO,CONSOLE
set CLASSPATH=%ZOOCFGDIR%
SET CLASSPATH=D:\zk1\*;D:\zk1\lib\*;%CLASSPATH%
SET CLASSPATH=D:\zk1\build\classes;D:\zk1\build\lib\*;%CLASSPATH%
set ZOOCFG=%ZOOCFGDIR%\zoo.cfg
Lets assume you have a stand alone solr with "collection1" core and we want to convert this into soldCloud cluster and split this core into multiple shards on diff machine.
1) First start zookeepr, run following command from zk_home/bin path:
zkServer.cmd
2) Next upload & link your standalone core config to Zookeeper (Assuming Solr is installed at : D:\solr-4.4.0\mySolr1)
make sure to use '/' as path seprate instead of '\' for the confdir path below
cd D:\solr-4.4.0\mySolr1\cloud-scripts
zkcli.bat -zkhost localhost:2181 -cmd upconfig -confdir D:/solr-4.4.0/mySolr1/solr/collection1/conf -confname collection1cfg
zkcli.bat -zkhost localhost:2181 -cmd linkconfig -collection collection1 -confname collection1cfg
3) Next start your solr with following command (assuming zookeeper is running on localhost:2181) and solr is installed at D:\solr-4.4.0\mySolr1
java -Xmx2g -Djetty.port=8983 -DzkHost=localhost:2181 -Dsolr.solr.home=D:\solr-4.4.0\mySolr1\solr -jar start.jar
4) Go to localhost:8989/solr and verify the # of documents in your core with following URL:
http://localhost:8983/solr/collection1/select?q=*:*&rows=0
make a note of numFound value in the response returned.
Also verify that following URL shows the "collection1" core we have as "shard1"
http://localhost:8983/solr/#/~cloud
5) Next in the browser run the following command to split your core "collection1" into multiple shards (assuming shard1 is the name your shard and collection1 is the name of your core)
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1&async=5
This should generated 2 shards / cores
collection1_shard1_1_replica1
collection1_shard1_0_replica1
6) Next on another machine (e.g. machine2) , start solr with following command (assuming zookeeper is running on <>:2181 (replace "machine1" with name or ip address of machine1)
and solr on machine 2 is installed at : D:/solr-4.4.0/mySolr2
java -Xmx2g -Djetty.port=7574 -DzkHost=machine1:2181 -Dsolr.solr.home=D:/solr-4.4.0/mySolr2/solr -jar start.jar
This will start solr on port 7574 on machine 2
7) On machine2, Run following command to create a new Core (collection1_shard1_1_core2) specifying the collection=collection1 (same as original core name on machine1 before splitting into shards)
and shard=shard1_1 (same as what shown on http://localhost:8983/solr/#/~cloud page for the shard that you want to move to machine2)
http://localhost:7574/solr/admin/cores?action=CREATE&collection=collection1&shard=shard1_1&name=collection1_shard1_1_core2
8) now shutdown solr (not zookeeper) running on machine1 by pressing ctrl + C in the running solr window on machine 1.
Move the shard1_1_replica1 folder from solr installation, to some other location outside SOLR_HOME\solr directory.
And re-run the solr on machine 1
java -Xmx2g -Djetty.port=8983 -DzkHost=localhost:2181 -Dsolr.solr.home=D:\solr-4.4.0\mySolr1\solr -jar start.jar
Verify now shard on machine 2 is shown as shard leader for shard1_1.
We have now sucessfully converted a single core named 'collection1' into a distributed collection 'collection1' with shard 'shard1_0' on original machine1 and moved shard1_1 to another machine2.
make a copy of conf\zk_sample.cfg as zoo.cfg
creaeta zkdata folder.
in zoo.cfg, uncomment datadir and provide path to zkdata folder (replace '\' in the path with '/' )
Modify zk_home/bin/zk_env.cmd to replace ~dp0.. with actual paths like below
REM assuming Zookeeper is installed at : D:\zk1
set ZOOCFGDIR=D:\zk1\conf
set ZOO_LOG_DIR=D:\zk1
set ZOO_LOG4J_PROP=INFO,CONSOLE
set CLASSPATH=%ZOOCFGDIR%
SET CLASSPATH=D:\zk1\*;D:\zk1\lib\*;%CLASSPATH%
SET CLASSPATH=D:\zk1\build\classes;D:\zk1\build\lib\*;%CLASSPATH%
set ZOOCFG=%ZOOCFGDIR%\zoo.cfg
Lets assume you have a stand alone solr with "collection1" core and we want to convert this into soldCloud cluster and split this core into multiple shards on diff machine.
1) First start zookeepr, run following command from zk_home/bin path:
zkServer.cmd
2) Next upload & link your standalone core config to Zookeeper (Assuming Solr is installed at : D:\solr-4.4.0\mySolr1)
make sure to use '/' as path seprate instead of '\' for the confdir path below
cd D:\solr-4.4.0\mySolr1\cloud-scripts
zkcli.bat -zkhost localhost:2181 -cmd upconfig -confdir D:/solr-4.4.0/mySolr1/solr/collection1/conf -confname collection1cfg
zkcli.bat -zkhost localhost:2181 -cmd linkconfig -collection collection1 -confname collection1cfg
3) Next start your solr with following command (assuming zookeeper is running on localhost:2181) and solr is installed at D:\solr-4.4.0\mySolr1
java -Xmx2g -Djetty.port=8983 -DzkHost=localhost:2181 -Dsolr.solr.home=D:\solr-4.4.0\mySolr1\solr -jar start.jar
4) Go to localhost:8989/solr and verify the # of documents in your core with following URL:
http://localhost:8983/solr/collection1/select?q=*:*&rows=0
make a note of numFound value in the response returned.
Also verify that following URL shows the "collection1" core we have as "shard1"
http://localhost:8983/solr/#/~cloud
5) Next in the browser run the following command to split your core "collection1" into multiple shards (assuming shard1 is the name your shard and collection1 is the name of your core)
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1&async=5
This should generated 2 shards / cores
collection1_shard1_1_replica1
collection1_shard1_0_replica1
6) Next on another machine (e.g. machine2) , start solr with following command (assuming zookeeper is running on <
and solr on machine 2 is installed at : D:/solr-4.4.0/mySolr2
java -Xmx2g -Djetty.port=7574 -DzkHost=machine1:2181 -Dsolr.solr.home=D:/solr-4.4.0/mySolr2/solr -jar start.jar
This will start solr on port 7574 on machine 2
7) On machine2, Run following command to create a new Core (collection1_shard1_1_core2) specifying the collection=collection1 (same as original core name on machine1 before splitting into shards)
and shard=shard1_1 (same as what shown on http://localhost:8983/solr/#/~cloud page for the shard that you want to move to machine2)
http://localhost:7574/solr/admin/cores?action=CREATE&collection=collection1&shard=shard1_1&name=collection1_shard1_1_core2
8) now shutdown solr (not zookeeper) running on machine1 by pressing ctrl + C in the running solr window on machine 1.
Move the shard1_1_replica1 folder from solr installation, to some other location outside SOLR_HOME\solr directory.
And re-run the solr on machine 1
java -Xmx2g -Djetty.port=8983 -DzkHost=localhost:2181 -Dsolr.solr.home=D:\solr-4.4.0\mySolr1\solr -jar start.jar
Verify now shard on machine 2 is shown as shard leader for shard1_1.
We have now sucessfully converted a single core named 'collection1' into a distributed collection 'collection1' with shard 'shard1_0' on original machine1 and moved shard1_1 to another machine2.
Java Regx to parse RGBA Color Values
String keywords_color_regex = "^[a-z]*$";
String hex_color_regex = "^#[0-9a-f]{3}([0-9a-f]{3})?$";
String rgb_color_regex = "^rgb\\(\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*\\)$";
String rgba_color_regex = "^rgba\\(\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*(0|[1-9]\\d?|1\\d\\d?|2[0-4]\\d|25[0-5])\\s*,\\s*((0.[1-9])|[01])\\s*\\)$";
String hsl_color_regex = "^hsl\\(\\s*(0|[1-9]\\d?|[12]\\d\\d|3[0-5]\\d)\\s*,\\s*((0|[1-9]\\d?|100)%)\\s*,\\s*((0|[1-9]\\d?|100)%)\\s*\\)$";
Source : http://stackoverflow.com/questions/12385500/regex-pattern-for-rgb-rgba-hsl-hsla-color-coding
Mock it..
If you try to mock a method that returns primitive types, eclipse gives following compiler Error :
The expression of type boolean is boxed into Boolean
to get rid of this error :
Globally: Window-> Preferences: Java-> Compiler-> Errors / Warnings-> Potential programming problems-> Boxing and unboxing conversions.
For the project: Properties: Java Compiler-> Errors / Warnings-> further as well.
Source :
http://translate.google.co.in/translate?hl=en&sl=ru&u=http://www.sql.ru/forum/1059132/mockito-strannaya-konstrukciya-vnutri-when&prev=search
The expression of type boolean is boxed into Boolean
to get rid of this error :
Globally: Window-> Preferences: Java-> Compiler-> Errors / Warnings-> Potential programming problems-> Boxing and unboxing conversions.
For the project: Properties: Java Compiler-> Errors / Warnings-> further as well.
Source :
http://translate.google.co.in/translate?hl=en&sl=ru&u=http://www.sql.ru/forum/1059132/mockito-strannaya-konstrukciya-vnutri-when&prev=search
Subscribe to:
Posts (Atom)