<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>I don't know anything about spark, so I'm not sure how it starts

      up its workers.  My first suggestion would be to use the .pth

      method of starting coverage in subprocesses rather than the

      sitecustomize technique, and see if that works better.</p>

    <p>--Ned.<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 4/26/16 9:01 AM, Kun Chen wrote:<br>

    </div>

    <blockquote

cite="mid:CAPTVxySrrmtV7kqYap0JJUnrctR5ifspXCwjtPFL1TCodjDdcQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <div dir="ltr">Hi, all

            <div><br>

            </div>

            <div>I tried to run a simple pyspark application on spark in

              local mode, and was hoping to get the coverage data file

              generated somewhere for future use.</div>

            <div><br>

            </div>

            <div>0. I put the following lines at the head of

              /usr/lib/python2.7/sitecustomize.py</div>

            <div>

              <div>import coverage</div>

              <div>coverage.process_startup()</div>

            </div>

            <div><br>

            </div>

            <div>1. I set the following env variable in ~/.bashrc</div>

            <div>export

              COVERAGE_PROCESS_START=/home/kunchen/git/es-signal/.coveragerc</div>

            <div><br>

            </div>

            <div>2. the config file

              '/home/kunchen/git/es-signal/.coveragerc' has following

              content</div>

            <div>

              <div>[run]</div>

              <div>parallel = True</div>

              <div>concurrency = multiprocessing</div>

              <div>omit =</div>

              <div>    *dist-packages*</div>

              <div>    *pyspark*</div>

              <div>    *spark-1.5.2*</div>

              <div>cover_pylib = False</div>

              <div>data_file = /home/kunchen/.coverage</div>

              <div><br>

              </div>

            </div>

            <div>3. I put ci3.py and test.py both

              in /home/kunchen/Downloads/software/spark-1.5.2 ( my spark

              home )</div>

            <div><br>

            </div>

            <div>4. in my spark home, I ran the following command to

              submit and run the code.</div>

            <div>spark-submit --master local --py-files=ci3.py test.py<br>

            </div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div>6. after the application finished, I got two coverage

              files in /home/kunchen</div>

            <div>.coverage.kunchen-es-pc.31117.003485<br>

            </div>

            <div>.coverage.kunchen-es-pc.31176.826660<br>

            </div>

            <div><br>

            </div>

            <div>but according to the process id in the file names and

              the content of those files, none of them was generated by

              the spark worker process(or thread? not sure here).</div>

            <div><br>

            </div>

            <div>My question is what I have to do to get the coverage

              data of the code being executed by the spark workers?</div>

          </div>

        </div>

        <br>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

testing-in-python mailing list

<a class="moz-txt-link-abbreviated" href="mailto:testing-in-python@lists.idyll.org">testing-in-python@lists.idyll.org</a>

<a class="moz-txt-link-freetext" href="http://lists.idyll.org/listinfo/testing-in-python">http://lists.idyll.org/listinfo/testing-in-python</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>