<div dir="ltr">Hi, <div><br><div>Thanks for the quick response.</div><div><br></div><div>I tried the pth way of starting coverage, it&#39;s working for the driver process, and still not the worker process.</div><div><br></div><div>And I tried to patch the coverage source code into printing message or writing something into a local file when it construct the Coverage instance ( of course from the coverage.process_startup() ), and the result is:</div><div><br></div><div>1. printing will got a java exception after spark-submit like the following</div><div><div>java.lang.IllegalArgumentException: port out of range:1668247142</div><div><span class="" style="white-space:pre">        </span>at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)</div><div><span class="" style="white-space:pre">        </span>at java.net.InetSocketAddress.&lt;init&gt;(InetSocketAddress.java:185)</div><div><span class="" style="white-space:pre">        </span>at java.net.Socket.&lt;init&gt;(Socket.java:241)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.scheduler.Task.run(Task.scala:88)</div><div><span class="" style="white-space:pre">        </span>at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)</div><div><span class="" style="white-space:pre">        </span>at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)</div><div><span class="" style="white-space:pre">        </span>at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)</div><div><span class="" style="white-space:pre">        </span>at java.lang.Thread.run(Thread.java:745)</div></div><div><br></div><div>2. writing a local file will have no effect at all, though writing file in my customized rdd map function will work. like the following</div><div><br></div><div><div>import os</div><div>from multiprocessing import *</div><div>pid = current_process().pid</div><div><br></div><div>def handle(sc, file, ofile):</div><div>    rd = sc.textFile(file)</div><div>    rd.map(mysub).saveAsTextFile(ofile)</div><div><br></div><div>def mysub(row):</div><div>    print &#39;from mapper process {0}&#39;.format(pid)</div><div>    print &#39;env : {0}&#39;.format(os.getenv(&#39;COVERAGE_PROCESS_START&#39;))</div><div>    f = open(&#39;/home/kunchen/{0}.txt&#39;.format(pid), &#39;a&#39;)</div><div>    f.writelines([row])</div><div>    f.close()</div><div><br></div><div>    return row.replace(&#39;,&#39;,&#39; &#39;).replace(&#39;.&#39;,&#39; &#39;).replace(&#39;-&#39;,&#39; &#39;).lower()</div><div><br></div><div class="gmail_extra">I&#39;m quite new to Spark and not sure how the worker process are executed. Anyone ever tried to tackle this problem?</div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 27, 2016 at 3:00 AM,  <span dir="ltr">&lt;<a href="mailto:testing-in-python-request@lists.idyll.org" target="_blank">testing-in-python-request@lists.idyll.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Send testing-in-python mailing list submissions to<br>

        <a href="mailto:testing-in-python@lists.idyll.org" target="_blank">testing-in-python@lists.idyll.org</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="http://lists.idyll.org/listinfo/testing-in-python" rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>

or, via email, send a message with subject or body &#39;help&#39; to<br>

        <a href="mailto:testing-in-python-request@lists.idyll.org" target="_blank">testing-in-python-request@lists.idyll.org</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:testing-in-python-owner@lists.idyll.org" target="_blank">testing-in-python-owner@lists.idyll.org</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than &quot;Re: Contents of testing-in-python digest...&quot;<br>

<br>

<br>

Today&#39;s Topics:<br>

<br>

   1. how to generate coverage info for pyspark applications (Kun Chen)<br>

   2. Re: how to generate coverage info for pyspark applications<br>

      (Ned Batchelder)<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Tue, 26 Apr 2016 21:01:02 +0800<br>

From: Kun Chen &lt;<a href="mailto:kunchen@everstring.com" target="_blank">kunchen@everstring.com</a>&gt;<br>

Subject: [TIP] how to generate coverage info for pyspark applications<br>

To: <a href="mailto:testing-in-python@lists.idyll.org" target="_blank">testing-in-python@lists.idyll.org</a><br>

Message-ID:<br>

        &lt;<a href="mailto:CAPTVxySrrmtV7kqYap0JJUnrctR5ifspXCwjtPFL1TCodjDdcQ@mail.gmail.com" target="_blank">CAPTVxySrrmtV7kqYap0JJUnrctR5ifspXCwjtPFL1TCodjDdcQ@mail.gmail.com</a>&gt;<br>

Content-Type: text/plain; charset=&quot;utf-8&quot;<br>

<br>

Hi, all<br>

<br>

I tried to run a simple pyspark application on spark in local mode, and was<br>

hoping to get the coverage data file generated somewhere for future use.<br>

<br>

0. I put the following lines at the head of<br>

/usr/lib/python2.7/sitecustomize.py<br>

import coverage<br>

coverage.process_startup()<br>

<br>

1. I set the following env variable in ~/.bashrc<br>

export COVERAGE_PROCESS_START=/home/kunchen/git/es-signal/.coveragerc<br>

<br>

2. the config file &#39;/home/kunchen/git/es-signal/.coveragerc&#39; has following<br>

content<br>

[run]<br>

parallel = True<br>

concurrency = multiprocessing<br>

omit =<br>

    *dist-packages*<br>

    *pyspark*<br>

    *spark-1.5.2*<br>

cover_pylib = False<br>

data_file = /home/kunchen/.coverage<br>

<br>

3. I put ci3.py and test.py both<br>

in /home/kunchen/Downloads/software/spark-1.5.2 ( my spark home )<br>

<br>

4. in my spark home, I ran the following command to submit and run the code.<br>

spark-submit --master local --py-files=ci3.py test.py<br>

<br>

<br>

6. after the application finished, I got two coverage files in /home/kunchen<br>

.coverage.kunchen-es-pc.31117.003485<br>

.coverage.kunchen-es-pc.31176.826660<br>

<br>

but according to the process id in the file names and the content of those<br>

files, none of them was generated by the spark worker process(or thread?<br>

not sure here).<br>

<br>

My question is what I have to do to get the coverage data of the code being<br>

executed by the spark workers?<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: &lt;<a href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0001.htm" rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0001.htm</a>&gt;<br>

-------------- next part --------------<br>

A non-text attachment was scrubbed...<br>

Name: ci3.py<br>

Type: text/x-python<br>

Size: 369 bytes<br>

Desc: not available<br>

URL: &lt;<a href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0002.py" rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0002.py</a>&gt;<br>

-------------- next part --------------<br>

A non-text attachment was scrubbed...<br>

Name: test.py<br>

Type: text/x-python<br>

Size: 231 bytes<br>

Desc: not available<br>

URL: &lt;<a href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0003.py" rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0003.py</a>&gt;<br>

<br>

------------------------------<br>

<br>

Message: 2<br>

Date: Tue, 26 Apr 2016 11:32:38 -0400<br>

From: Ned Batchelder &lt;<a href="mailto:ned@nedbatchelder.com" target="_blank">ned@nedbatchelder.com</a>&gt;<br>

Subject: Re: [TIP] how to generate coverage info for pyspark<br>

        applications<br>

To: <a href="mailto:testing-in-python@lists.idyll.org" target="_blank">testing-in-python@lists.idyll.org</a><br>

Message-ID: &lt;<a href="mailto:c4f7fd43-60e8-b863-fb1c-862d301ae9a0@nedbatchelder.com" target="_blank">c4f7fd43-60e8-b863-fb1c-862d301ae9a0@nedbatchelder.com</a>&gt;<br>

Content-Type: text/plain; charset=&quot;windows-1252&quot;; Format=&quot;flowed&quot;<br>

<br>

I don&#39;t know anything about spark, so I&#39;m not sure how it starts up its<br>

workers.  My first suggestion would be to use the .pth method of<br>

starting coverage in subprocesses rather than the sitecustomize<br>

technique, and see if that works better.<br>

<br>

--Ned.<br>

<br>

<br>

On 4/26/16 9:01 AM, Kun Chen wrote:<br>

&gt; Hi, all<br>

&gt;<br>

&gt; I tried to run a simple pyspark application on spark in local mode,<br>

&gt; and was hoping to get the coverage data file generated somewhere for<br>

&gt; future use.<br>

&gt;<br>

&gt; 0. I put the following lines at the head of<br>

&gt; /usr/lib/python2.7/sitecustomize.py<br>

&gt; import coverage<br>

&gt; coverage.process_startup()<br>

&gt;<br>

&gt; 1. I set the following env variable in ~/.bashrc<br>

&gt; export COVERAGE_PROCESS_START=/home/kunchen/git/es-signal/.coveragerc<br>

&gt;<br>

&gt; 2. the config file &#39;/home/kunchen/git/es-signal/.coveragerc&#39; has<br>

&gt; following content<br>

&gt; [run]<br>

&gt; parallel = True<br>

&gt; concurrency = multiprocessing<br>

&gt; omit =<br>

&gt;     *dist-packages*<br>

&gt;     *pyspark*<br>

&gt;     *spark-1.5.2*<br>

&gt; cover_pylib = False<br>

&gt; data_file = /home/kunchen/.coverage<br>

&gt;<br>

&gt; 3. I put ci3.py and test.py both<br>

&gt; in /home/kunchen/Downloads/software/spark-1.5.2 ( my spark home )<br>

&gt;<br>

&gt; 4. in my spark home, I ran the following command to submit and run the<br>

&gt; code.<br>

&gt; spark-submit --master local --py-files=ci3.py test.py<br>

&gt;<br>

&gt;<br>

&gt; 6. after the application finished, I got two coverage files in<br>

&gt; /home/kunchen<br>

&gt; .coverage.kunchen-es-pc.31117.003485<br>

&gt; .coverage.kunchen-es-pc.31176.826660<br>

&gt;<br>

&gt; but according to the process id in the file names and the content of<br>

&gt; those files, none of them was generated by the spark worker process(or<br>

&gt; thread? not sure here).<br>

&gt;<br>

&gt; My question is what I have to do to get the coverage data of the code<br>

&gt; being executed by the spark workers?<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; testing-in-python mailing list<br>

&gt; <a href="mailto:testing-in-python@lists.idyll.org" target="_blank">testing-in-python@lists.idyll.org</a><br>

&gt; <a href="http://lists.idyll.org/listinfo/testing-in-python" rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>

<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: &lt;<a href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/faa13f8a/attachment-0001.htm" rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/faa13f8a/attachment-0001.htm</a>&gt;<br>

<br>

------------------------------<br>

<br>

_______________________________________________<br>

testing-in-python mailing list<br>

<a href="mailto:testing-in-python@lists.idyll.org" target="_blank">testing-in-python@lists.idyll.org</a><br>

<a href="http://lists.idyll.org/listinfo/testing-in-python" rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>

<br>

<br>

End of testing-in-python Digest, Vol 111, Issue 9<br>

*************************************************<br>

</blockquote></div><br></div></div></div></div>