<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>I'll take this conversation private..</p>

    <p>--Ned.<br>

    </p>

    <div class="moz-cite-prefix">On 4/26/16 8:53 PM, Kun Chen wrote:<br>

    </div>

    <blockquote

cite="mid:CAPTVxyShGtbBnhnNKdcm1m5wtGQckyKvf7t9EJ0Ce+XY35gTiA@mail.gmail.com"

      type="cite">

      <div dir="ltr">Hi, 

        <div><br>

          <div>Thanks for the quick response.</div>

          <div><br>

          </div>

          <div>I tried the pth way of starting coverage, it's working

            for the driver process, and still not the worker process.</div>

          <div><br>

          </div>

          <div>And I tried to patch the coverage source code into

            printing message or writing something into a local file when

            it construct the Coverage instance ( of course from the

            coverage.process_startup() ), and the result is:</div>

          <div><br>

          </div>

          <div>1. printing will got a java exception after spark-submit

            like the following</div>

          <div>

            <div>java.lang.IllegalArgumentException: port out of

              range:1668247142</div>

            <div><span class="" style="white-space:pre">        </span>at

              java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)</div>

            <div><span class="" style="white-space:pre">        </span>at

              java.net.InetSocketAddress.&lt;init&gt;(InetSocketAddress.java:185)</div>

            <div><span class="" style="white-space:pre">        </span>at

              java.net.Socket.&lt;init&gt;(Socket.java:241)</div>

            <div><span class="" style="white-space:pre">        </span>at

org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)</div>

            <div><span class="" style="white-space:pre">        </span>at

org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)</div>

            <div><span class="" style="white-space:pre">        </span>at

org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)</div>

            <div><span class="" style="white-space:pre">        </span>at

org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.rdd.RDD.iterator(RDD.scala:264)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.scheduler.Task.run(Task.scala:88)</div>

            <div><span class="" style="white-space:pre">        </span>at

              org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)</div>

            <div><span class="" style="white-space:pre">        </span>at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)</div>

            <div><span class="" style="white-space:pre">        </span>at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)</div>

            <div><span class="" style="white-space:pre">        </span>at

              java.lang.Thread.run(Thread.java:745)</div>

          </div>

          <div><br>

          </div>

          <div>2. writing a local file will have no effect at all,

            though writing file in my customized rdd map function will

            work. like the following</div>

          <div><br>

          </div>

          <div>

            <div>import os</div>

            <div>from multiprocessing import *</div>

            <div>pid = current_process().pid</div>

            <div><br>

            </div>

            <div>def handle(sc, file, ofile):</div>

            <div>    rd = sc.textFile(file)</div>

            <div>    rd.map(mysub).saveAsTextFile(ofile)</div>

            <div><br>

            </div>

            <div>def mysub(row):</div>

            <div>    print 'from mapper process {0}'.format(pid)</div>

            <div>    print 'env :

              {0}'.format(os.getenv('COVERAGE_PROCESS_START'))</div>

            <div>    f = open('/home/kunchen/{0}.txt'.format(pid), 'a')</div>

            <div>    f.writelines([row])</div>

            <div>    f.close()</div>

            <div><br>

            </div>

            <div>    return row.replace(',',' ').replace('.','

              ').replace('-',' ').lower()</div>

            <div><br>

            </div>

            <div class="gmail_extra">I'm quite new to Spark and not sure

              how the worker process are executed. Anyone ever tried to

              tackle this problem?</div>

            <div class="gmail_extra"><br>

              <div class="gmail_quote">On Wed, Apr 27, 2016 at 3:00 AM,

                <span dir="ltr">&lt;<a moz-do-not-send="true"

                    href="mailto:testing-in-python-request@lists.idyll.org"

                    target="_blank">testing-in-python-request@lists.idyll.org</a>&gt;</span>

                wrote:<br>

                <blockquote class="gmail_quote" style="margin:0px 0px

                  0px

0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Send

                  testing-in-python mailing list submissions to<br>

                          <a moz-do-not-send="true"

                    href="mailto:testing-in-python@lists.idyll.org"

                    target="_blank">testing-in-python@lists.idyll.org</a><br>

                  <br>

                  To subscribe or unsubscribe via the World Wide Web,

                  visit<br>

                          <a moz-do-not-send="true"

                    href="http://lists.idyll.org/listinfo/testing-in-python"

                    rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>

                  or, via email, send a message with subject or body

                  'help' to<br>

                          <a moz-do-not-send="true"

                    href="mailto:testing-in-python-request@lists.idyll.org"

                    target="_blank">testing-in-python-request@lists.idyll.org</a><br>

                  <br>

                  You can reach the person managing the list at<br>

                          <a moz-do-not-send="true"

                    href="mailto:testing-in-python-owner@lists.idyll.org"

                    target="_blank">testing-in-python-owner@lists.idyll.org</a><br>

                  <br>

                  When replying, please edit your Subject line so it is

                  more specific<br>

                  than "Re: Contents of testing-in-python digest..."<br>

                  <br>

                  <br>

                  Today's Topics:<br>

                  <br>

                     1. how to generate coverage info for pyspark

                  applications (Kun Chen)<br>

                     2. Re: how to generate coverage info for pyspark

                  applications<br>

                        (Ned Batchelder)<br>

                  <br>

                  <br>

----------------------------------------------------------------------<br>

                  <br>

                  Message: 1<br>

                  Date: Tue, 26 Apr 2016 21:01:02 +0800<br>

                  From: Kun Chen &lt;<a moz-do-not-send="true"

                    href="mailto:kunchen@everstring.com" target="_blank">kunchen@everstring.com</a>&gt;<br>

                  Subject: [TIP] how to generate coverage info for

                  pyspark applications<br>

                  To: <a moz-do-not-send="true"

                    href="mailto:testing-in-python@lists.idyll.org"

                    target="_blank">testing-in-python@lists.idyll.org</a><br>

                  Message-ID:<br>

                          &lt;<a moz-do-not-send="true"

href="mailto:CAPTVxySrrmtV7kqYap0JJUnrctR5ifspXCwjtPFL1TCodjDdcQ@mail.gmail.com"

                    target="_blank">CAPTVxySrrmtV7kqYap0JJUnrctR5ifspXCwjtPFL1TCodjDdcQ@mail.gmail.com</a>&gt;<br>

                  Content-Type: text/plain; charset="utf-8"<br>

                  <br>

                  Hi, all<br>

                  <br>

                  I tried to run a simple pyspark application on spark

                  in local mode, and was<br>

                  hoping to get the coverage data file generated

                  somewhere for future use.<br>

                  <br>

                  0. I put the following lines at the head of<br>

                  /usr/lib/python2.7/sitecustomize.py<br>

                  import coverage<br>

                  coverage.process_startup()<br>

                  <br>

                  1. I set the following env variable in ~/.bashrc<br>

                  export

                  COVERAGE_PROCESS_START=/home/kunchen/git/es-signal/.coveragerc<br>

                  <br>

                  2. the config file

                  '/home/kunchen/git/es-signal/.coveragerc' has

                  following<br>

                  content<br>

                  [run]<br>

                  parallel = True<br>

                  concurrency = multiprocessing<br>

                  omit =<br>

                      *dist-packages*<br>

                      *pyspark*<br>

                      *spark-1.5.2*<br>

                  cover_pylib = False<br>

                  data_file = /home/kunchen/.coverage<br>

                  <br>

                  3. I put ci3.py and test.py both<br>

                  in /home/kunchen/Downloads/software/spark-1.5.2 ( my

                  spark home )<br>

                  <br>

                  4. in my spark home, I ran the following command to

                  submit and run the code.<br>

                  spark-submit --master local --py-files=ci3.py test.py<br>

                  <br>

                  <br>

                  6. after the application finished, I got two coverage

                  files in /home/kunchen<br>

                  .coverage.kunchen-es-pc.31117.003485<br>

                  .coverage.kunchen-es-pc.31176.826660<br>

                  <br>

                  but according to the process id in the file names and

                  the content of those<br>

                  files, none of them was generated by the spark worker

                  process(or thread?<br>

                  not sure here).<br>

                  <br>

                  My question is what I have to do to get the coverage

                  data of the code being<br>

                  executed by the spark workers?<br>

                  -------------- next part --------------<br>

                  An HTML attachment was scrubbed...<br>

                  URL: &lt;<a moz-do-not-send="true"

href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0001.htm"

                    rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0001.htm</a>&gt;<br>

                  -------------- next part --------------<br>

                  A non-text attachment was scrubbed...<br>

                  Name: ci3.py<br>

                  Type: text/x-python<br>

                  Size: 369 bytes<br>

                  Desc: not available<br>

                  URL: &lt;<a moz-do-not-send="true"

href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0002.py"

                    rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0002.py</a>&gt;<br>

                  -------------- next part --------------<br>

                  A non-text attachment was scrubbed...<br>

                  Name: test.py<br>

                  Type: text/x-python<br>

                  Size: 231 bytes<br>

                  Desc: not available<br>

                  URL: &lt;<a moz-do-not-send="true"

href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0003.py"

                    rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/a4ea1c92/attachment-0003.py</a>&gt;<br>

                  <br>

                  ------------------------------<br>

                  <br>

                  Message: 2<br>

                  Date: Tue, 26 Apr 2016 11:32:38 -0400<br>

                  From: Ned Batchelder &lt;<a moz-do-not-send="true"

                    href="mailto:ned@nedbatchelder.com" target="_blank">ned@nedbatchelder.com</a>&gt;<br>

                  Subject: Re: [TIP] how to generate coverage info for

                  pyspark<br>

                          applications<br>

                  To: <a moz-do-not-send="true"

                    href="mailto:testing-in-python@lists.idyll.org"

                    target="_blank">testing-in-python@lists.idyll.org</a><br>

                  Message-ID: &lt;<a moz-do-not-send="true"

                    href="mailto:c4f7fd43-60e8-b863-fb1c-862d301ae9a0@nedbatchelder.com"

                    target="_blank">c4f7fd43-60e8-b863-fb1c-862d301ae9a0@nedbatchelder.com</a>&gt;<br>

                  Content-Type: text/plain; charset="windows-1252";

                  Format="flowed"<br>

                  <br>

                  I don't know anything about spark, so I'm not sure how

                  it starts up its<br>

                  workers.  My first suggestion would be to use the .pth

                  method of<br>

                  starting coverage in subprocesses rather than the

                  sitecustomize<br>

                  technique, and see if that works better.<br>

                  <br>

                  --Ned.<br>

                  <br>

                  <br>

                  On 4/26/16 9:01 AM, Kun Chen wrote:<br>

                  &gt; Hi, all<br>

                  &gt;<br>

                  &gt; I tried to run a simple pyspark application on

                  spark in local mode,<br>

                  &gt; and was hoping to get the coverage data file

                  generated somewhere for<br>

                  &gt; future use.<br>

                  &gt;<br>

                  &gt; 0. I put the following lines at the head of<br>

                  &gt; /usr/lib/python2.7/sitecustomize.py<br>

                  &gt; import coverage<br>

                  &gt; coverage.process_startup()<br>

                  &gt;<br>

                  &gt; 1. I set the following env variable in ~/.bashrc<br>

                  &gt; export

                  COVERAGE_PROCESS_START=/home/kunchen/git/es-signal/.coveragerc<br>

                  &gt;<br>

                  &gt; 2. the config file

                  '/home/kunchen/git/es-signal/.coveragerc' has<br>

                  &gt; following content<br>

                  &gt; [run]<br>

                  &gt; parallel = True<br>

                  &gt; concurrency = multiprocessing<br>

                  &gt; omit =<br>

                  &gt;     *dist-packages*<br>

                  &gt;     *pyspark*<br>

                  &gt;     *spark-1.5.2*<br>

                  &gt; cover_pylib = False<br>

                  &gt; data_file = /home/kunchen/.coverage<br>

                  &gt;<br>

                  &gt; 3. I put ci3.py and test.py both<br>

                  &gt; in /home/kunchen/Downloads/software/spark-1.5.2 (

                  my spark home )<br>

                  &gt;<br>

                  &gt; 4. in my spark home, I ran the following command

                  to submit and run the<br>

                  &gt; code.<br>

                  &gt; spark-submit --master local --py-files=ci3.py

                  test.py<br>

                  &gt;<br>

                  &gt;<br>

                  &gt; 6. after the application finished, I got two

                  coverage files in<br>

                  &gt; /home/kunchen<br>

                  &gt; .coverage.kunchen-es-pc.31117.003485<br>

                  &gt; .coverage.kunchen-es-pc.31176.826660<br>

                  &gt;<br>

                  &gt; but according to the process id in the file names

                  and the content of<br>

                  &gt; those files, none of them was generated by the

                  spark worker process(or<br>

                  &gt; thread? not sure here).<br>

                  &gt;<br>

                  &gt; My question is what I have to do to get the

                  coverage data of the code<br>

                  &gt; being executed by the spark workers?<br>

                  &gt;<br>

                  &gt;<br>

                  &gt;<br>

                  &gt; _______________________________________________<br>

                  &gt; testing-in-python mailing list<br>

                  &gt; <a moz-do-not-send="true"

                    href="mailto:testing-in-python@lists.idyll.org"

                    target="_blank">testing-in-python@lists.idyll.org</a><br>

                  &gt; <a moz-do-not-send="true"

                    href="http://lists.idyll.org/listinfo/testing-in-python"

                    rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>

                  <br>

                  -------------- next part --------------<br>

                  An HTML attachment was scrubbed...<br>

                  URL: &lt;<a moz-do-not-send="true"

href="http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/faa13f8a/attachment-0001.htm"

                    rel="noreferrer" target="_blank">http://lists.idyll.org/pipermail/testing-in-python/attachments/20160426/faa13f8a/attachment-0001.htm</a>&gt;<br>

                  <br>

                  ------------------------------<br>

                  <br>

                  _______________________________________________<br>

                  testing-in-python mailing list<br>

                  <a moz-do-not-send="true"

                    href="mailto:testing-in-python@lists.idyll.org"

                    target="_blank">testing-in-python@lists.idyll.org</a><br>

                  <a moz-do-not-send="true"

                    href="http://lists.idyll.org/listinfo/testing-in-python"

                    rel="noreferrer" target="_blank">http://lists.idyll.org/listinfo/testing-in-python</a><br>

                  <br>

                  <br>

                  End of testing-in-python Digest, Vol 111, Issue 9<br>

                  *************************************************<br>

                </blockquote>

              </div>

              <br>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

testing-in-python mailing list

<a class="moz-txt-link-abbreviated" href="mailto:testing-in-python@lists.idyll.org">testing-in-python@lists.idyll.org</a>

<a class="moz-txt-link-freetext" href="http://lists.idyll.org/listinfo/testing-in-python">http://lists.idyll.org/listinfo/testing-in-python</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>