Armen Zambrano's battlefield: 2016

Friday, October 28, 2016

Usability improvements for Firefox automation initiative - Status update #8

NEW:
* Debugging on remote workers is not a main priority for this quarter since it got completed

* We've added investigating hyper chunking

On this update we will look at the progress made in the last two weeks.

This quarter’s main focus is on improving end to end times on Try (Thunder Try project).

For all bugs and priorities you can check out:

https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

Thunder Try - Improve end to end times on try

---------------------------------------------

Project #1 - Artifact builds on automation

##########################################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

Accomplished recently:

All --artifact build jobs now provide crash-reporter symbols. This means test jobs scheduled against linux64 debug work with --artifact.

Upcoming:

Patch in review - debug artifact builds on try. (Right now --artifact always results in an opt artifact build.)

Project #2 - S3 Cloud Compiler Cache

####################################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Accomplished recently:

Green builds on all platforms on Try, planning to compare build times vs. existing Python sccache now

Project #3 - Metrics

####################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Stabilized dashboard http://people.mozilla.org/~klahnakoski/MoBuildbotTimings/End-to-End.html
TaskCluster ingestion is Mozharness steps is now working, but not shown on charts yet.

Upcoming:

Still has the “small population” problem, where outliers make the 90th percentile look big.
Add a TaskCluster view of End-to-End times
Start knocking off bugs in the dashboard

Project #4 - Build automation improvements

##########################################

Nothing new for this edition

Project #5 - Hyper chunking

###########################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1262834

Nothing new for this edition

Project #6 - Run Web platform tests from the source checkout

############################################################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286900

Feature blocked on misconfigured EBS volumes in AWS (bug 1305174)

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Wednesday, October 12, 2016

Usability improvements for Firefox automation initiative - Status update #7

On this update we will look at the progress made in the last two weeks.

A reminder that this quarter’s main focus is on:

Debugging tests on interactive workers (only Linux on TaskCluster)
Improve end to end times on Try (Thunder Try project)

For all bugs and priorities you can check out the project management page for it:

https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

Status update:

Debugging tests on interactive workers

---------------------------------------------------

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1262260

Accomplished recently:

No new progress

Upcoming:

Android xpcshell
Blog/newsgroup post

Thunder Try - Improve end to end times on try

---------------------------------------------

Project #1 - Artifact builds on automation

##########################################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

Accomplished recently:

The following platforms are now supported: linux, linux64, macosx64, win32, win64
An option was added to download symbols for our compiled artifacts during the artifact build

Upcoming:

Debug artifact builds on try. (Right now --artifact always results in an opt artifact build.)
Android artifact builds on try, thanks to nalexander.

Project #2 - S3 Cloud Compiler Cache

####################################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Some of the issues found last quarter for this project was around NSS which also was in need of replacing. This project was put on hold until the NSS work was completed. We’re going to resume this for Q4.

Project #3 - Metrics

####################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Brittle running example here: http://people.mozilla.org/~klahnakoski/temp/End-to-End.html
Problem with low populations; 90th percentile is effectively everything, and a couple of outliers impacts the End-to-End time shown.

Upcoming:

Figure out what to do with these small populations:

Ignore them - too small to be statistically significant
Aggregate them - All the rarely run suites can be pushed into a “Other” category
Show some other statistic: Maybe median is better?
Show median of past day, and 90% for the week: That can show the longer trend, and short term situation, for better overall feel.

Project #4 - Build automation improvements

##########################################

Accomplished recently:

Bug 1306167 - Updated build machines to use SSD. Linux PGO builds now take half the time

https://bugzilla.mozilla.org/show_bug.cgi?id=1306167#c11

Project #5 - Run Web platform tests from the source checkout

############################################################

Nothing to add on this edition.

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Wednesday, September 28, 2016

[NEW] Added build status updates - Usability improvements for Firefox automation initiative - Status update #6

[NEW] Starting on this newsletter we will start giving you build automation improvements since they help with the end to end time of your pushes

On this update we will look at the progress made in the last two weeks.

A reminder that this quarter’s main focus is on:

Debugging tests on interactive workers (only Linux on TaskCluster)
Improve end to end times on Try (Thunder Try project)

For all bugs and priorities you can check out the project management page for it:

https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

Status update:

Debugging tests on interactive workers

---------------------------------------------------

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1262260

Accomplished recently:

Fixed regression that broke the interactive wizard
Support for Android reftests landed

Upcoming:

Support for Android xpcshell
Video demonstration

Thunder Try - Improve end to end times on try

---------------------------------------------

Project #1 - Artifact builds on automation

##########################################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

Accomplished recently:

Windows and Mac artifact builds are soon to land
|mach try| now supports --artifact option
Compiled-code tests jobs error-out early when run with --artifact on try

Upcoming:

Windows and Mac artifact builds available on Try
Fix triggering of test jobs on Buildbot with artifact build

Project #2 - S3 Cloud Compiler Cache

####################################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Nothing new in this edition.

Project #3 - Metrics

####################

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Drill-down charts:

Which lead to a detailed view:

With optional wait times included (missing 10% outliers, so looks almost the same):

Upcoming:

Iron out interactivity bugs
Show outliers
Post these (static) pages to my people page
Fix ActiveData production to handle these queries (I am currently using a development version of ActiveData, but that version has some nasty anomalies)

Project #4 - Build automation improvements

##########################################

Upcoming:

We identified an interaction with EBS in AWS that is likely making several parts of automation slower than they should be (https://bugzilla.mozilla.org/show_bug.cgi?id=1305174)

Project #5 - Run Web platform tests from the source checkout

############################################################

Accomplished recently:

WPT is now running from the source checkout in automation

Upcoming:

There are still parts in automation relying on a test zip. Next steps is to minimize those so you can get a loner, pull any revision from any repo, and test WPT changes in an environment that is exactly what the automation tests run in.

Other

#####

Bug 1300812 - Make Mozharness downloads and unpacks actions handle better intermittent S3/EC2 issues

This adds retry logic to reduce intermittent oranges

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Tuesday, September 13, 2016

Increasing test coverage

Last quarter I spent some time increasing mozci's test coverage. Here are some notes I took to help me remember in the future how to do it.

Here's some of what I did:

Read Python's page about increasing test coverage

I wanted to learn what core Python recommends
Tthey recommend is using coverage.py

Quick start with coverage.py

"coverage run --source=mozci -m py.test test" to gather data
"coverage html" to generate an html report
"/path/to/firefox firefox htmlcov/index.html" to see the report

NOTE: We have coverage reports from automation in coveralls.io

https://coveralls.io/github/mozilla/mozilla_ci_tools

If you find code that needs to be ignored, read this.

Use "# pragma: no cover" in specific lines
You can also create rules of exclusion

Once you get closer to 100% you might want to consider to increase branch coverage instead of line coverage

Read more in here.

Once you pick a module to increase coverage

Keep making changes until you run "coverage run" and "coverage html".
Reload the html page to see the new results

After some work on this, I realized that my preferred place to improve tests is focusing on the simplest unit tests. I say this since integration tests do require proper work and thinking how to properly test them rather than *just* increasing coverage for the sake of it.

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Tuesday, August 02, 2016

Usability improvements for Firefox automation initiative - Status update #2

On this update, we will look at the progress made since our initial update.

A reminder that this quarter’s main focus is on:

Debugging tests on interactive workers (only Linux on TaskCluster)
Improve end to end times on Try (Thunder Try project)

For all bugs and priorities you can check out the project management page for it:

https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

Debugging tests on interactive workers

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1262260

Accomplished recently:

Bug 1285582 - Fixed Xvfb startup issue
Bug 1288827 - Improved mochitest UX (no longer need --appname, paths normalized)
Bug 1289879 - Uses mozharness venv if available

Upcoming:

Support for smaller test harnesses (Cpp, Mn, wpt, etc)
Improved one-click-loaner UX

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1285582

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1288827

[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1289879

Thunder Try - Improve end to end times on try

Project #1 - Artifact builds on automation

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

No news for this edition and probably the next one.

Project #2 - S3 Cloud Compiler Cache

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Accomplished recently:

Working on testing sccache re-write on Try
More news on following update

Project #3 - Metrics

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Bug 1242017 - Metrics team will configure ingestion point into Telemetry

Upcoming:

Bug 1258861 - Working on underlying data model at the moment:

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1242017

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1258861

Other

Bug 1287604 - Experiment with different AWS instance types for TC linux64 builds

Some initial experiments have shown we can shave 20 minutes off an average linux64 build by using more powerful AWS instances, with a reasonable cost tradeoff. We’ll start the work of migrating to these new instances soon.

Bug 1272083 - Downloading and unzipping should be performed as data is received

Project for investigation has been started: https://github.com/armenzg/download_and_unpack

Bug 1286336 - Improve interaction of automation with version control

Buildbot AMIs now seeded with mozilla-unified repo (Bug 1232442)
TaskCluster decision and various lint/test tasks now use `hg robustcheckout` and share caches more optimally (Bug 1247168)

Flake8 tasks now complete in as little as 9s (~3m before)
Decision tasks now complete in <60s average="" font="" on="">

Some TaskCluster tasks now share VCS checkouts on Try (Bug 1289643)

Tasks will complete faster on Try due to not having to perform full VCS checkout

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1287604

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1272083

[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1247168

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Thursday, July 21, 2016

Mozci and pulse actions contributions opportunities

We've recently finished a season of feature development adding TaskCluster support to add new jobs to Treeherder on pulse_actions.

I'm now looking at what optimizations or features are left to complete. If you would like to contribute feel free to let me know.

Here's some highligthed work (based on pulse_action issues and bugs):

Use Treeherder as the source for querying jobs

This will help us save money in Heroku since using Buildapi + buildjson files is memory hungry and requires us to use bigger Heroku nodes.

Allow controlling pulse_actions production behaviour with environment variables

This is important to help us change the behaviour of the Heroku app without having to commit any code. I've used this in the past to modify the logging level when debugging an issue.

This is also useful if we want to have different pipelines in Heroku.

Create a staging pipeline

Having Heroku pipelines help us to test different versions of the software.

This is useful if we want to have a version running from 'master' against the staging version of Treeherder.

It would also help contributors to have a version of their pull requests running live.

Create test plan

We don't have any tests running. We need to determine how to run a minimum set of tests to have some confident in the product.

This needs integration tests of Pulse messages.

Nightly jobs started with Treeherder do not schedule properly

The comment is the bug is rather accurate and it shows that there are many small things that need fixing.

Manual backfilling does not take advantage of TC/BBB scheduling

Manual backfilling uses Buildapi to schedule jobs. If we switched to scheduling via TaskCluster/Buildbot-bridge we would get better results since we can guarantee proper scheduling of a build + associated dependent jobs. Buildapi does not give us this guarantee. This is mainly useful when backfilling PGO test and talos jobs.

If instead you're interested on contributing to mozci you can have a look at the issues.

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Tuesday, July 19, 2016

Usability improvements for Firefox automation initiative - Status update #1

The developer survey conducted by Engineering Productivity last fall indicated that debugging test failures that are reported by automation is a significant frustration for many developers. In fact, it was the biggest deficit identified by the survey. As a result,

the Engineering Productivity Team (aka A-Team) is working on improving the user experience for debugging test failures in our continuous integration and speeding up the turnaround for Try server jobs.

This quarter’s main focus is on:

Debugging tests on interactive workers (only Linux on TaskCluster)
Improve end to end times on Try (Thunder Try project)

For all bugs and priorities you can check out the project management page for it:

https://wiki.mozilla.org/EngineeringProductivity/Projects/Debugging_UX_improvements

In this email you will find the progress we’ve made recently. In future updates you will see a delta from this email.

PS = These status updates will be fortnightly

Debugging tests on interactive workers

Accomplished recently:

Landed support for running reftest and xpcshell via tests.zip
Many UX improvements to the interactive loaner workflow

Upcoming:

Make sure Xvfb is running so you can actually run the tests!
Mochitest support + all other harnesses

Thunder Try - Improve end to end times on try

Project #1 - Artifact builds on automation

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1284882

Accomplished recently:

Landed prerequisites for Windows and OS X artifact builds on try.
Identified which tests should be skipped with artifact builds

Upcoming:

Provide a try syntax flag to trigger only artifact builds instead of full builds; starting with opt Linux 64.

Project #2 - S3 Cloud Compiler Cache

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1280641

Accomplished recently:

Sccache’s Rust re-write has reached feature parity with Python’s sccache
Now testing sccache2 on Try

Upcoming:

We want to roll out a two-tier sccache for Try, which will enable it to benefit from cache objects from integration branches

Project #3 - Metrics

Tracking bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1286856

Accomplished recently:

Preliminary analytics / research based on job data from Treeherder found at: http://nbviewer.jupyter.org/url/people.mozilla.org/%7Ewlachance/try%20analysis.ipynb

Which jobs finish last?
Which jobs have the highest wait times?
Which jobs have the longest total wall clock time (i.e. are the largest consumers of resources)

Upcoming:

Putting Mozharness steps’ data inside Treeherder’s database for aggregate analysis

Other

Upcoming:

TaskCluster Linux builds are currently built using a mix of m3/r3/c3 2xlarge AWS instances, depending on pricing and availability. We’re going to be looking to assess the effects on build speeds of using more powerful AWS instances types, as one potential way of reducing e2e Try times.

https://bugzilla.mozilla.org/show_bug.cgi?id=1287604

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Thursday, June 30, 2016

Adding new jobs to Treeherder is now transparent

As of today, when you add new jobs to a push on Treeherder (among other actions), a job will get scheduled named 'Sch' for scheduling.

This gives two main advantages:

You can have access to the output of how the request went
A a link to file a bug under the right component and CC me

You can see in this screenshot what the job looks like:

In this push you can see three 'Sch' jobs. Each one is for a different action taken.

Backfill a job - Link to log
Add new jobs - Link to log
Trigger all talos jobs - Link to log

You can find the link to filing a bug after you do this:

Click on job (the job's panel will load at the bottom of the page)
Click on "Job details"
You will see "File bug" (see the screenshot below)

Thanks for reading and feel free to file bugs to make the output more understandable!

This work by Zambrano Gasparnian, Armen is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.