<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p>I'll take this conversation private..</p>
<p>--Ned.<br>
</p>
<div class="moz-cite-prefix">On 4/26/16 8:53 PM, Kun Chen wrote:<br>
</div>
<blockquote
cite="mid:CAPTVxyShGtbBnhnNKdcm1m5wtGQckyKvf7t9EJ0Ce+XY35gTiA@mail.gmail.com"
type="cite">
<div dir="ltr">Hi,
<div><br>
<div>Thanks for the quick response.</div>
<div><br>
</div>
<div>I tried the pth way of starting coverage, it's working
for the driver process, and still not the worker process.</div>
<div><br>
</div>
<div>And I tried to patch the coverage source code into
printing message or writing something into a local file when
it construct the Coverage instance ( of course from the
coverage.process_startup() ), and the result is:</div>
<div><br>
</div>
<div>1. printing will got a java exception after spark-submit
like the following</div>
<div>
<div>java.lang.IllegalArgumentException: port out of
range:1668247142</div>
<div><span class="" style="white-space:pre">        </span>at
java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)</div>
<div><span class="" style="white-space:pre">        </span>at
java.net.InetSocketAddress.<init>(InetSocketAddress.java:185)</div>
<div><span class="" style="white-space:pre">        </span>at
java.net.Socket.<init>(Socket.java:241)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.scheduler.Task.run(Task.scala:88)</div>
<div><span class="" style="white-space:pre">        </span>at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)</div>
<div><span class="" style="white-space:pre">        </span>at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)</div>
<div><span class="" style="white-space:pre">        </span>at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)</div>
<div><span class="" style="white-space:pre">        </span>at
java.lang.Thread.run(Thread.java:745)</div>
</div>
<div><br>
</div>
<div>2. writing a local file will have no effect at all,
though writing file in my customized rdd map function will
work. like the following</div>
<div><br>
</div>
<div>
<div>import os</div>
<div>from multiprocessing import *</div>
<div>pid = current_process().pid</div>
<div><br>
</div>
<div>def handle(sc, file, ofile):</div>
<div> rd = sc.textFile(file)</div>
<div> rd.map(mysub).saveAsTextFile(ofile)</div>
<div><br>
</div>
<div>def mysub(row):</div>
<div> print 'from mapper process {0}'.format(pid)</div>
<div> print 'env :
{0}'.format(os.getenv('COVERAGE_PROCESS_START'))</div>
<div> f = open('/home/kunchen/{0}.txt'.format(pid), 'a')</div>
<div> f.writelines([row])</div>
<div> f.close()</div>
<div><br>
</div>
<div> return row.replace(',',' ').replace('.','
').replace('-',' ').lower()</div>
<div><br>
</div>
<div class="gmail_extra">I'm quite new to Spark and not sure
how the worker process are executed. Anyone ever tried to
tackle this problem?</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Apr 27, 2016 at 3:00 AM,
<span dir="ltr"><<a moz-do-not-send="true"
href="mailto:testing-in-python-request@lists.idyll.org"
target="_blank">testing-in-python-request@lists.idyll.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px
0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Send
testing-in-python mailing list submissions to<br>
<a moz-do-not-send="true"
href="mailto:testing-in-python@lists.idyll.org"
target="_blank">testing-in-python@lists.idyll.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web,
visit<br>
<a moz-do-not-send="true"
href="http://lists.idyll.org/listinfo/testing-in-python"
rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>
or, via email, send a message with subject or body
'help' to<br>
<a moz-do-not-send="true"
href="mailto:testing-in-python-request@lists.idyll.org"
target="_blank">testing-in-python-request@lists.idyll.org</a><br>
<br>
You can reach the person managing the list at<br>
<a moz-do-not-send="true"
href="mailto:testing-in-python-owner@lists.idyll.org"
target="_blank">testing-in-python-owner@lists.idyll.org</a><br>
<br>
When replying, please edit your Subject line so it is
more specific<br>
than "Re: Contents of testing-in-python digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. how to generate coverage info for pyspark
applications (Kun Chen)<br>
2. Re: how to generate coverage info for pyspark
applications<br>
(Ned Batchelder)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Tue, 26 Apr 2016 21:01:02 +0800<br>
From: Kun Chen <<a moz-do-not-send="true"
href="mailto:kunchen@everstring.com" target="_blank">kunchen@everstring.com</a>><br>
Subject: [TIP] how to generate coverage info for
pyspark applications<br>
To: <a moz-do-not-send="true"
href="mailto:testing-in-python@lists.idyll.org"
target="_blank">testing-in-python@lists.idyll.org</a><br>
Message-ID:<br>
<<a moz-do-not-send="true"
href="mailto:CAPTVxySrrmtV7kqYap0JJUnrctR5ifspXCwjtPFL1TCodjDdcQ@mail.gmail.com"
target="_blank">CAPTVxySrrmtV7kqYap0JJUnrctR5ifspXCwjtPFL1TCodjDdcQ@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Hi, all<br>
<br>
I tried to run a simple pyspark application on spark
in local mode, and was<br>
hoping to get the coverage data file generated
somewhere for future use.<br>
<br>
0. I put the following lines at the head of<br>
/usr/lib/python2.7/sitecustomize.py<br>
import coverage<br>
coverage.process_startup()<br>
<br>
1. I set the following env variable in ~/.bashrc<br>
export
COVERAGE_PROCESS_START=/home/kunchen/git/es-signal/.coveragerc<br>
<br>
2. the config file
'/home/kunchen/git/es-signal/.coveragerc' has
following<br>
content<br>
[run]<br>
parallel = True<br>
concurrency = multiprocessing<br>
omit =<br>
*dist-packages*<br>
*pyspark*<br>
*spark-1.5.2*<br>
cover_pylib = False<br>
data_file = /home/kunchen/.coverage<br>
<br>
3. I put ci3.py and test.py both<br>
in /home/kunchen/Downloads/software/spark-1.5.2 ( my
spark home )<br>
<br>
4. in my spark home, I ran the following command to
submit and run the code.<br>
spark-submit --master local --py-files=ci3.py test.py<br>
<br>
<br>
6. after the application finished, I got two coverage
files in /home/kunchen<br>
.coverage.kunchen-es-pc.31117.003485<br>
.coverage.kunchen-es-pc.31176.826660<br>
<br>
but according to the process id in the file names and
the content of those<br>
files, none of them was generated by the spark worker
process(or thread?<br>
not sure here).<br>
<br>
My question is what I have to do to get the coverage
data of the code being<br>
executed by the spark workers?<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a moz-do-not-send="true"
href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0001.htm"
rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0001.htm</a>><br>
-------------- next part --------------<br>
A non-text attachment was scrubbed...<br>
Name: ci3.py<br>
Type: text/x-python<br>
Size: 369 bytes<br>
Desc: not available<br>
URL: <<a moz-do-not-send="true"
href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0002.py"
rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0002.py</a>><br>
-------------- next part --------------<br>
A non-text attachment was scrubbed...<br>
Name: test.py<br>
Type: text/x-python<br>
Size: 231 bytes<br>
Desc: not available<br>
URL: <<a moz-do-not-send="true"
href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0003.py"
rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0003.py</a>><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Tue, 26 Apr 2016 11:32:38 -0400<br>
From: Ned Batchelder <<a moz-do-not-send="true"
href="mailto:ned@nedbatchelder.com" target="_blank">ned@nedbatchelder.com</a>><br>
Subject: Re: [TIP] how to generate coverage info for
pyspark<br>
applications<br>
To: <a moz-do-not-send="true"
href="mailto:testing-in-python@lists.idyll.org"
target="_blank">testing-in-python@lists.idyll.org</a><br>
Message-ID: <<a moz-do-not-send="true"
href="mailto:c4f7fd43-60e8-b863-fb1c-862d301ae9a0@nedbatchelder.com"
target="_blank">c4f7fd43-60e8-b863-fb1c-862d301ae9a0@nedbatchelder.com</a>><br>
Content-Type: text/plain; charset="windows-1252";
Format="flowed"<br>
<br>
I don't know anything about spark, so I'm not sure how
it starts up its<br>
workers. My first suggestion would be to use the .pth
method of<br>
starting coverage in subprocesses rather than the
sitecustomize<br>
technique, and see if that works better.<br>
<br>
--Ned.<br>
<br>
<br>
On 4/26/16 9:01 AM, Kun Chen wrote:<br>
> Hi, all<br>
><br>
> I tried to run a simple pyspark application on
spark in local mode,<br>
> and was hoping to get the coverage data file
generated somewhere for<br>
> future use.<br>
><br>
> 0. I put the following lines at the head of<br>
> /usr/lib/python2.7/sitecustomize.py<br>
> import coverage<br>
> coverage.process_startup()<br>
><br>
> 1. I set the following env variable in ~/.bashrc<br>
> export
COVERAGE_PROCESS_START=/home/kunchen/git/es-signal/.coveragerc<br>
><br>
> 2. the config file
'/home/kunchen/git/es-signal/.coveragerc' has<br>
> following content<br>
> [run]<br>
> parallel = True<br>
> concurrency = multiprocessing<br>
> omit =<br>
> *dist-packages*<br>
> *pyspark*<br>
> *spark-1.5.2*<br>
> cover_pylib = False<br>
> data_file = /home/kunchen/.coverage<br>
><br>
> 3. I put ci3.py and test.py both<br>
> in /home/kunchen/Downloads/software/spark-1.5.2 (
my spark home )<br>
><br>
> 4. in my spark home, I ran the following command
to submit and run the<br>
> code.<br>
> spark-submit --master local --py-files=ci3.py
test.py<br>
><br>
><br>
> 6. after the application finished, I got two
coverage files in<br>
> /home/kunchen<br>
> .coverage.kunchen-es-pc.31117.003485<br>
> .coverage.kunchen-es-pc.31176.826660<br>
><br>
> but according to the process id in the file names
and the content of<br>
> those files, none of them was generated by the
spark worker process(or<br>
> thread? not sure here).<br>
><br>
> My question is what I have to do to get the
coverage data of the code<br>
> being executed by the spark workers?<br>
><br>
><br>
><br>
> _______________________________________________<br>
> testing-in-python mailing list<br>
> <a moz-do-not-send="true"
href="mailto:testing-in-python@lists.idyll.org"
target="_blank">testing-in-python@lists.idyll.org</a><br>
> <a moz-do-not-send="true"
href="http://lists.idyll.org/listinfo/testing-in-python"
rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>
<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a moz-do-not-send="true"
href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/faa13f8a/attachment-0001.htm"
rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/faa13f8a/attachment-0001.htm</a>><br>
<br>
------------------------------<br>
<br>
_______________________________________________<br>
testing-in-python mailing list<br>
<a moz-do-not-send="true"
href="mailto:testing-in-python@lists.idyll.org"
target="_blank">testing-in-python@lists.idyll.org</a><br>
<a moz-do-not-send="true"
href="http://lists.idyll.org/listinfo/testing-in-python"
rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>
<br>
<br>
End of testing-in-python Digest, Vol 111, Issue 9<br>
*************************************************<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
testing-in-python mailing list
<a class="moz-txt-link-abbreviated" href="mailto:testing-in-python@lists.idyll.org">testing-in-python@lists.idyll.org</a>
<a class="moz-txt-link-freetext" href="http://lists.idyll.org/listinfo/testing-in-python">http://lists.idyll.org/listinfo/testing-in-python</a>
</pre>
</blockquote>
<br>
</body>
</html>