Eric Kow | 20 Apr 09:11 2012

Parallel Haskell Digest 9

Parallel Haskell Digest 9 
=========================

Hello Haskellers!

The Google Summer of Code is upon us and students have already submitted
their proposals. There are a couple potential projects on concurrent
data structures, which we'll have a look at below.

We will also be continuing our tour of Haskell concurrency abstractions
with our word month, *transaction*. This digest is brought to you by the
Parallel GHC project, an MSR-sponsored effort to push parallel Haskell
technologies out into the real world.  Check our project news below to
see how we're doing in that front.

Finally, you may heard Functional Propaganda from a Simon or two.
But how would the same message affect you if it came from a hardcore C++
hacker?  If you haven't seen it making the rounds, have a quick look at
Bartosz Milewski's [The Downfall of Imperative Programming][b0], and
maybe tell your imperative friends? The FP monster is hatching from its
academic egg; best be prepared!

[HTML version](http://www.well-typed.com/blog/65)

News
----------------------------------------------------------------------
Let's have a quick look at some of those GSoC proposals, particularly
those with a parallel Haskell theme.  It's all about performance this
year. Two of the proposals involve using or improving parallellism
in Haskell, and four are related to high-performance concurrency.

*    [Parallel CSG engine][g2]

     Constructive Solid Geometry (CSG) is the common approach to define
     complex bodies in engineering applications, ray tracing engines
     and physical simulators. Dmitry Dzhus' proposal is to deliver
     a fast parallel CSG engine using the Accelerate/Repa libraries
     for vectorised computations on GPU. 

*    [NUMA supporting features for GHC][g7]

     Sajith Sasidharan wants to reach into the GHC RTS, with the
     aim of “extracting that last ounce of performance from NUMA
     systems, by firing all CPUs if/when necessary and by ensuring a
     suitably NUMA-aware memory allocation behavior.”  Work in this
     area would benefit folks doing scientific computing, who may
     need great amounts of task and data parallelism.

*    [Windows support for the new GHC I/O manager][g6]

     The new I/O manager (GHC 7) brings great scalability improvements
     to Haskell — 10k simultaneous connections to your web server? no
     sweat.  Unfortunately for Haskellers on Windows, these improvements
     are currently only available on Unix.  Mikhail Glushenkov
     (who worked on GSoC last year to make improvements to Cabal)
     proposes to remedy this, making the new manager available on
     Windows. The challenge is that Windows I/O completion ports have
     slightly different semantics than their epoll/kqueue analogues on
     Unix.

*    [Lock-free hash table and priority queue][g3]

     More along the theme of high performance concurrency, Florian
     Hartwig's project aims to use GHC's new-ish atomic compare-and-swap
     (CAS) primitive to implement a high-performance lock-free hash
     table and a lock-free priority queue in Haskell.  The CAS
     instruction is a bit of hardware support for concurrency: it
     compares a value in memory to an expected value, and iff they
     match, replaces it with a new value.

*    [Implement Concurrent Hash-table / Hashmap][g5]

     Loren Davis also proposes a thread-safe mutable hash table
     implementation in Haskell. Loren is still weighing some of the
     alternative approaches suggested in the Haskell community. He is
     currently leaning towards a lock-stripping approach as it would
     fulfill an immediate need in the community.

*    [Concurrent Datastructures with good Performance][g4]

     Mathias Bartl aims to implement two concurrent data types 
     in Haskell, along with the usual battery of automated unit tests
     and benchmarks. The two that Mathias has in mind are a lock-free
     concurrent bag, and a concurrent priority queue.

Parallel GHC project update
----------------------------------------------------------------------
We have been continuing our work to make ThreadScope more helpful and
informative in tracking down your parallel and concurrent Haskell
performance problems.  We now have the ability to collect heap
statistics from the GHC runtime system and present them in ThreadScope.
These features will be available for users of a recent development GHC
(7.5.x) or the eventual 7.6 release. In addition to heap statistics,
we have been working on collecting information from hardware performance
counters, more specifically adding [support for Linux Perf
Events][hs-perf]. This could be useful for studying IO-heavy programs,
the idea being to visualise system calls as being distinct from
actual execution of Haskell code.

Speaking of performance, we are also continuing work on the new Cloud
Haskell implementation (see Duncan Coutts' [Fun in the Afternoon
Talk][t4]), and have lately been focused on reducing message latency.
This consists of work in three areas: improving binary serialisation,
investigating the implications of using `Chan` and `MVar` to pass
messages between threads, and perhaps improving the Haskell network
library implementation to compete better with a direct C implementation.

Word of the month
----------------------------------------------------------------------
Lately, we've been investigating the various ways Haskell helps us to
get to grips with concurrency. We talked about how the `MVar`,
the Haskell variant on locks,  allows us to share mutable variables
between threads, with some safeguards to help ensure consistency.
`MVar`'s may provide a nice high-level packaging around locks, but as we
mentioned in the last digest, they can still go horrifically wrong,
just like locks and synchronized methods in other languages.

We could go through the usual litany of reasons why locks are bad news,
but maybe a healthier approach would be for us to focus on the positive.
What do we want as programmers?  One possibility is what Simon PJ
([Beautiful Concurrency][stm-bc]) calls “modular programming”, the
ability to “[build] large programs by gluing together smaller programs”.
Locks fall short of helping us to meet this desire. First, because the
mere act of combining two locky programs may be inherently incorrect;
`withdraw acct1 amt >> deposit acct2 amt` is bad because of the gap
between the two actions where the money is in neither account.  Second,
because they seal off programs that we may otherwise like to moosh
together; if process p1 waits for input on a pipe, process p2 waits for
input on another pipe, how do wait for either of p1 or p2?  So how do we
wrestle back this modularity from our locky masters? And how do we make
programming fun again?

Our word of the month today is “transaction”. Software transactional
memory (STM) takes this idea of a transaction (a sequence of operations
that can be treated as a single atomic block) from database design.  The
Haskell implementation of STM was introduced in the 2005 paper
[Composable Memory Transactions][stm-05] by Harris *et. al.*  If
programming fun is what you're after, this is a paper that comes with
its own war-cry: “compositionality: a programmer can control atomicity
and blocking behaviour in a modular way that respects abstraction
barriers.”

Here are some quick highlights of the stm library. You may notice a couple of
things, first that this library introduces its own notion of variable, the
`TVar` and second that STM involves a new monad of its own.  Unlike the `MVar`
that we saw in the last digest, `TVar`'s do not have the same notion of being
full or empty; they just hold values plain and simple.  As for the STM monad,
we will see why it matters when we first try to do some IO.

     -- Control.Concurrent.STM
     data STM a
     instance Monad STM

     atomically :: STM a -> IO a

     data TVar a
     newTVar   :: a -> STM (TVar a)
     readTVar  :: TVar a -> STM a
     writeTVar :: TVar a -> a -> STM ()

     retry  :: STM a
     orElse :: STM a -> STM a -> STM a

To get a rough idea how some of this is used, let's look at the
transactional hello world, safely wiring money from one bank account
to another. For the purposes of our example, a bank account is just a
balance.  To get some money from an account, we read the balance,
subtract the amount, and write the new balance.  Making a deposit is
just withdrawing negative-money.

     type Account = TVar Int

     withdraw :: Account -> Int -> STM ()        
     withdraw acc amount = do
         bal <- readTVar acc
         writeTVar acc (bal - amount)

     deposit :: Account -> Int -> STM ()
     deposit acc amount = withdraw acc (- amount)

These primitive operations (withdraw and deposit) bring us to the
question of modularity. How do we know that it's safe to combine these
mini-programs into a bigger one?  In other words, if we write something
like `withdraw from 42 >> deposit to 42`, how do we avoid the
possibility of running into some twilight zone state where the money is
neither here nor there?  If people do strange things like simultaneously
transfering money in the other direction, will our program still work?

The answer lies in the distinction between `STM` (transactions) and `IO`
(actions). So long as we remain in STM, we are simply assembling
transactions, piecing smaller ones (“withdraw from a”) into larger ones
(“withdraw from a and deposit it to b”), but not actually performing
them! Having composed our transactions, we can use the function
`atomically` to turn them into IO actions.

     -- still just a transaction
     transfer :: Account -> Account -> Int -> STM ()
     transfer from to amount = do
         deposit to amount
         withdraw from amount

     -- now we have an action!
     doTransfer :: Account -> Account -> Int -> IO ()
     doTransfer from to amount =
         atomically $ transfer from to amount

And `atomically` does what it says on the tin: it runs the transaction
in a way that renders it indivisible, no twlight zones. Lest there is
any confusion, even though the transaction is indivisible, we can
*still* have concurrency during the course of the transaction, even
simultaneously read the affected `TVar`s if we want to.  The
indivisibility simply means that we never catch our transactions with
their pants down.  We neither read nonsense mid-transactional values
(simultaneous reads would either get the before or after value), nor
injecting values into a transaction mid-stream.

To get a feel for how these guarantees are possible, it could be useful
to take a peek under the hood. For each transaction that is run, GHC
maintains a thread-local log with an entry for each `TVar` accessed
in that transaction.  Each entry contains both the old value and the
new value that would be committed if the transaction is succesful.
This may be easier to see with a silly example:

    main = do
        v1 <- atomically $ newTVar "Joe"
        v2 <- atomically $ newTVar "Bob"
        done <- atomically $ newTVar 0
        -- thread A (you can just pretend forkDelayIO == forkIO)
        forkDelayIO . atomically $ do
                                  -- transaction log if A runs first
            x <- readTVar v1      -- v1: Joe -> Joe
            y <- readTVar v2      -- v1: Joe -> Joe, v2: Sue -> Sue 
            writeTVar v1 "Sue"    -- v1: Joe -> Sue
            writeTVar v2 x        -- v1: Joe -> Sue, v2: Bob -> Joe 
            writeTVar v1 y        -- v1: Joe -> Bob, v2: Bob -> Joe
            modifyTVar done (+1)  -- (stm 2.3 but easy to define)
        -- thread B 
        forkDelayIO . atomically $ do
                                  -- (if A runs first)
            writeTVar v1 "Jean"   -- v1: Bob -> Jean
            writeTVar v2 "Paul"   -- v1: Bob -> Jean, v2: Joe -> Paul
            modifyTVar done (+1)
        waitThreads 2 done
        people <- atomically $ do -- (if A runs first)
            p1 <- readTVar v1     -- v1: Jean -> Jean
            p2 <- readTVar v2     -- v1: Jean -> Jean, v2: Paul -> Paul
            return (p1, p2)
        print people -- if A runs first, (Jean, Paul)
                     -- if B runs first, (Paul, Jean).

    -- boring details just for this example
    forkDelayIO job = forkIO $
        randomRIO (1, 1000000) >>= threadDelay >> job
    waitThreads n v = atomically $
        do { d <- readTVar v;  when (d < n) retry }

In the above, we fork off two threads, A which swaps a pair of names
and, B which overwrites them with other names. Atomicity here means that
other threads never see any intermediary states and state changes from
other threads don't affect the current thread. For example, thread B
should never see `v1` being set to "Sue".  Likewise, if thread A should
still read "Joe" from v1 even if B simultaneously writes "Jean".

This is made possible by validation of the transaction logs. Validation
normally occurs at the end of a transaction (we won't cover the two
other cases here: exceptions, and thread wake-ups).  It consists of
checking that all the expected “before” values for `TVar`s still match
reality.  If the logs are good, we commit the new values; if not, we
simply discard them and try the transaction again, taking the new
reality into account. This validate-and-commit model allows us to run
transactions simultaneously, safely, but with the occasional rollback
and retry to ensure atomicity.

The notion of a transaction log brings us to the notion of cost. Good
things don't always come cheap, and using a good thing like STM may
require a little familiarity with the cost model behind it. Basically,
it's important to keep in mind that the values we write to `TVar`'s may
come from some arbitrary expression, and that arbitrary expressions may
be arbitrarily expensive.  So being forced to retry transactions may
involve redoing something expensive.  If the transactions affect many
variables, the chances of hitting a retry go up. Likewise, if the
transaction takes a long time to run, the chance goes up of some other
thread making a change that triggers a retry.  In the pathological worst
case, you can have some transactional behemoth that never manages to
commit; because some smaller faster transaction keeps stealing its
thunder. So keep an eye out for starvation and the more general problem
for retries being expensive.

Cost may be a bit of a bummer, but there's also a Haskell-related silver
lining behind all this. Because we have a purely functional language and
the enforced separation between pure functions and side-effecting
actions, STM is actually quite practical in Haskell. The number of
things we need to track in a transaction log is limited to handful of
explicit `TVar`s rather that just about everything. If you are coming
from other languages, you may have a memory of STM as being nice, but
wildly impractical.  Not so in Haskell. Eminently viable.

Aside from making STM practical, this sort of separation is also good
for general peace of mind. Suppose for example that we coded up a
feature in our banking software to send our users a text message alert
whenever their balances fall below a threshold. If we were in the
middle of a complicated transaction, we might be tempted to just slap
that logic right in the middle of the transaction; however, the Haskell
implementation makes this deliberately impossible. This can be a bit
frustrating at first (and new Haskellers are sometimes faced with the
“how do I get this out of the monad” puzzle), but saves us the
greater danger of bombarding our users with spurious retry-induced text
messages.

The guarantees that STM offer make it a great place to get started with
Haskell concurrency.  After all, why make software any buggier than it
needs to be?  If you do want to get started, have a look at Simon Peyton
Jones' [Beautiful Concurrency][stm-bc].  It's a particularly good idea
to do so, because there's some really interesting ground that we've not
covered here (briefly, blocking, the `retry` function aborts the current
transaction, and causes it to be retried when appropriate; and choice:
<code>a `orElse` b</code> tries `a`, and if that should retry, then `b`,
and if that should also retry, the whole expression again).  Other great
STM resources are Simon Marlow's [tutorial on parallelism and
concurrency][stm-sm] and the [Real World Haskell chapter on
STM][stm-rwh].  With the four resources combined, you'll see a nice
range of examples from the usual bank-account one to concurrently
shuffling windows between desktops. 

Blogs
----------------------------------------------------------------------
*    [The Downfall of Imperative Programming][b0] (9 Apr)

     Take a hardcore C++ veteran with imperative programming in his
     bloodstream and loads of experience under his belt. Now give him
     a passion for concurrency and what do you get? First, you get a
     keen awareness that the future is massively multicore. Second,
     you get a hard-won appreciation for how difficult concurrent
     programming can be; for all jokes we make in the Haskell community
     about firing the missiles, the consequences of data races can
     sometimes be [deadly][therac-25]. Third, you get the conviction
     that functional programming is the inevitable way forward.

     Bartosz Milewski sums up the situation thus: Sooner or later
     you’ll have to face the multicore reality. You will be forced to
     learn functional methods to the extent to which your imperative
     language supports them.  Despite that, data races will infest your
     code and leak into released products. So you might as well take a
     shortcut and embrace a full blown functional language now.

     See what you make of his blog post if you have not done so already.
     There's quite a bit of buzz about this post, so you may also be
     interested in the [programming reddit discussion][r2] around it
     as well.

*    [Building A Concurrent Web Scraper With Haskell][b1] (10 Mar)

     Let's make a concurrent web scraper! This blog post by Aditya
     Bhargava presents a hands-on introduction to both arrows (via hxt)
     and concurrency (via parallel-io). Aditya builds from the
     bottom-up, showing us little pieces of program that we might cobble
     together, culminating in a 52 line Haskell program that crawls web
     sites and fetches images within their pages.  The parallel-io
     library used in this tutorial provides a thread pool which
     minimises contention by guaranteeing a limit to the number of
     unblocked threads running at the same time.  It uses lock based
     concurrency with `MVar`'s under the hood.

*    [0MQ and Haskell][b2] (6 Mar)

     Magnus Therning could not find any excuses to look into 0MQ.  But
     “to hell with reason”, Magnus ended up deciding to just poke around
     without any specific goal in mind. He found a [nice
     tutorial][pyzero] based on Python and translated its mini examples
     into Haskell. Magnus wonders why the API for `subscribe` is
     `String` rather than `ByteString` based.  Also, he's finding that
     his client mysteriously dies after receiving a few messages.
     Any comments?

*    [SIMD Support for the vector library][b3] (27 Mar)

     Single instruction, multiple data (SIMD) is the sort of thing
     you might be interested in if you're into data parallelism:
     hardware that can perform the same instruction on multiple data
     simultaneously. Geoffrey Mainland posted about his efforts to
     bring SIMD support to GHC, the bigger picture being that you ought
     to be able to write nice high-level Haskell and have it work as
     fast the low-level Haskell or C that you might otherwise crank out.
     To try things out, Geoff benchmarks taking the dot product of two
     vectors in various Haskell and C versions. No happy ending yet,
     unfortunately: while the low-level Haskell version is competitive
     with C, the high-level is not. Check the post out for dissection of
     the results down the Core and assembly level.  Hopefully better
     news in a follow-up posting.

*    [Adding SIMD Support to Data Parallel Haskell][b3b] (18 Apr)

     Hopefully better news? Not as such, but perhaps something more
     interesting. In the previous post, we could make use of SIMD
     support by issuing some explicit instructions from the vector
     library. OK, but what about people who writing parallel code,
     say, by using a parallel arrays framework? And what if you could
     get this SIMD support virtually for free — no syntax attached?
     Geoff makes this possible by extending the [Data Parallel
     Haskell][dph] framework so you would only have to tweak a single
     import statement, and exploitation of SIMD instructions would be
     automatic.  See the posting for some nice benchmarks and also a
     brief introduction to Data Parallel Haskell.

*    [Work Efficient Higher-Order Vectorisation][b4] (24 Mar)

     Found on Manuel Chakravarty's tumblr: Our new draft paper on Work
     Efficient Higher-Order Vectorisation introduces a novel
     representation for nested, irregular parallel arrays that enables
     the work-efficient SIMD-ification of nested data parallelism —
     i.e., nested parallelism is transformed into flat parallelism,
     while maintaining the work complexity of a naive pointer-based
     representation of nested arrays. This solves a long standing
     problem that dates back to the original implementation of the
     language NESL.

Talks, tutorials, and packages
----------------------------------------------------------------------
*    [stm-chans 1.3.1][t1] (1 Mar)

     The stm-chans package offers a collection of channel types, similar
     to `Control.Concurrent.STM.TChan` but with additional features.
     This latest update by wren ng thornton takes advantage of
     optimisations in the newly released stm-2.3.  It's highly
     recommended that all users bump their minimum stm-chans requirement
     to version 1.3.1

*    [accelerate 0.10.1][t2] (12 Apr)

     Manuel Chakravarty has just released version 0.10.0.0 of
     Accelerate, an embedded language for GPU-accelerated array
     computations in Haskell that targets NVIDIA's CUDA framework and
     also has an experimental (and partial) OpenCL backend.  A
     considerable amount of example code is in the companion package
     accelerate-examples. The main user-visible changes in this release
     are frontend bug fixes.

*    [Parallel Functional Programming course][t3]

     Students at Chalmers and Gothenburg University are currently
     6 lectures into a course on parallel functional programming.
     The course has so far covered parallelism with `par`/`pseq`,
     Strategies and monad-par, using ThreadScope, and skeletons as
     a means to structure parallel computations. The course page
     has lecture notes and exercises which could be of interest
     even if you aren't currently following the course.

*    [Cloud Haskell][t4] (Fun in the Afternoon)

     Well-Typed's Duncan Coutts was at the recent [Fun in the
     Afternoon][fun-aft], a termly seminar on Functional Programming in
     the UK).  Duncan presented some of the motivation behind Cloud
     Haskell (”Erlang for Haskell”) and distributed programming, along
     with the Cloud Haskell design, and our work on a new implementation
     to follow on the initial protoype by Jeff Epstein. Our new
     implementation adds a swappable network transport layer. If you're
     happy with TCP/IP, don't wait for the new implementation; just
     `cabal install remote` and give Jeff's prototype a try.

Mailing lists
----------------------------------------------------------------------

### Concurrency

*    [Transactional memory going mainstream with Intel Haswell][m1] (9 Feb)

     Ben was wondering if any STM experts would comment on this recent
     [Ars Technica article][arswell] on the Intel Haswell chip. Austin
     Seipp pointed us to a comment by Duncan Coutts in the Reddit
     discussion (unfortunately not; the new extension would sledgehammer
     all instructions between the XBEGIN/XEND instructions).  Ryan
     Ingram suggests that maybe the extension could be used to optimise
     the existing implementation, perhaps by wrapping transaction
     commits with XBEGIN/XEND)

*    [Behavior of -threaded in GHC 7.4.1?][m2] (14 Feb)

     Mike Craig just debugging a recent issue with GHC 7.4.1 and the
     zeromq3-haskell library, which provides a provides an FFI binding
     to libzmq.  Unfortunately, code which used to run when compiled on
     GHC 7.0.4 dies with “operation on non-socket” when built with GHC
     7.4.1.  With the latter GHC, Mike can only run the code if he
     omits `-threaded`, or if he uses the `-V0` flag to turn off the RTS
     clock and associated signals.  After more debugging, he tracked the
     problem down to a `addFinalizer` on a `Socket` tuple.  The
     finalizer was being run prematurely, perhaps because the Socket
     type was being optimised away.  Putting the finalizer on the
     `Ptr ()` in the tuple seems to solve the problem.

*    [Question about concurrency, threads and GC][m6] (2 Mar)

     Paul Graphov is trying to implement a networked application that
     supports bidirectional conversations, ie. not just
     request/response, but also sending notifications to clients.
     Paul is particularly interested in STM, but he's stuck on a bit
     of a design problem. His thinking so far is that he'll need to
     start 3 threads for each client, one to read data from the socket,
     one that sends queued messages to that socket, and one for the
     main behaviour loop.

     Joey Adams noticed that this was the same sort of problem we
     reported in the [previous digest][ph-8]. Joey was grappling
     with making making asynchronous I/O composable and safe.  He
     wound up not using the stm-channelize package that he wrote, and
     recommends instead a 3-thread solution, using a thread each for
     receiving, sending, and coordination.  Check out the small
     Haskell [chat server][chat-1] that Joey wrote to illustrate the
     idea.

     Alexander V Vershilov suggests a data-driven behaviour based on
     conduits and stm channels. He's also provided an [example chat
     server][chat-2], which you can compare against Joey's version.
     The two examples take a similar approach, and `could perhaps be
     combined to good effect.

*    ["killThread" hangs! (ironic)][p1] (25 Feb)

     Ryan Newton is gathering information in preparation for a possible
     bug report.  He's testing the new network transport layer in
     distributed-process (Cloud Haskell) and gets hangs in `killThread`.
     Strangely, the pattern for hanging goes: GHC 6.12.3 [OK], 7.0.2
     [HANGS], 7.2.1 [HANGS], 7.4.1 [OK].  Any ideas? Simon Marlow
     suggests it may be a bug in the RTS asychronous exception handling
     code, fixed with commit `fa71e6c`.

*    [Synchronizations in memory allocation?][p2] (21 Mar)

     Following up on the recent scaling bottleneck thread, Ryan Newton
     wondered: “What is the reason for GHC managing all this pinned
     memory for foreign pointers itself rather than using an external C
     malloc/free implementation and thus keeping disjoint Haskell and C
     heaps?” Simon Marlow says it's basically because GHC's
     `mallocForeignPtrBytes` is much faster than `malloc()`/`free()`

     Ryan was asking because he is looking to how to do better on
     [NUMA][numa] platforms. “We've got a NUMA-aware work-stealing
     scheduler now for monad-par, but it isn't really helping much yet.
     So we need to answer the question of how well our memory is being
     localized to socket-preferred physical addresses.”  NUMA isn't
     something the GHC team have looked into for the RTS yet. He has
     some ideas for improvements to the block allocator; more details
     in the thread.

*    [Haskell for BigData][p3] (16 Mar)

     Andrei Varanovich observes that while Haskell has a lot to offer
     in the world of parallel/concurrent programming (from DPH to
     Cloud Haskell), it still lacks two important components for working
     with Big Data:

     * Integration with a distributed file system, such as
       Hadoop distributed file system
     * A data aggregration framework (eg. MapReduce, but of course
       something much richer; this being Haskell and all)

     Andrei was interested in submitting a Google Summer of
     Code proposal to build a big data framework for Haskell on
     top of Cloud Haskell. I didn't see a proposal this year,
     but maybe next time? See the thread for technical suggestions,
     supportive comment, and pointers for a succesful Haskell GSoC
     project.  

### Parallelism

*    *Help wanted!* [Parallelism causes space leaks][m9] (23 Mar)

     Yavuz Yetim posted a small chunk of code using Strategies for
     parallelism.  When he enables his `parList rdeepseq` strategy,
     though, he gets a stack overflow on smallish input (1 MB file),
     even if he allows GHC to use a 1GB stack.  Switching to
     `parMap`, `parListChunk` and other strategies don't seem to help
     either.

*    [Data.Array.Accelerate initialization timings][m5] …[more][m5b] (20 Feb)

     Paul Sujkov is finding that array initialisation in
     Data.Array.Accelerate takes 10x the amount of time than either
     Data.Array and bare C++ CUDA array initialisation. Is there
     anything Paul might be doing wrong in particular?  The accelerate
     package currently provides two backends, an interpreter (reference
     implementation) and a CUDA backend generating code for CUDA-capable
     NVDIA GPUs.  Martin Dybdal comments that Paul should use
     `Data.Array.Accelerate.use` to generate hints to transfer arrays to
     GPU, and `Data.Array.Accelerate.CUDA.run` to actually perform the
     transfer. Manuel Chakravarty adds that the the `fromList` function
     is really just meant for testing, or for initialising small arrays.
     For anything bigger, going from vanilla lists is a bad idea, so
     have a look at `Data.Array.Accelerate.IO`.

*    [Reasons for Super-Linear Speedup][m7] (5 Mar)

     Burak Ekici has parallelized RSA decryption and encryption schemes
     by using second generation strategies.  He's getting 10 times
     performance improvements… on a quad-core CPU (with an 8MB cache).
     Is this just mismeasurement, or are there some differences in how
     GHC handles serial/parallel of computation, say with respect to
     cache usage? Bardur Arantsson replies that the usual explanation
     for this sort of thing is that the working data suddenly fits
     within the per-CPU L2 cache when split up.

StackOverflow and Reddit
----------------------------------------------------------------------
* [Is there a way in Haskell to query thread state using ThreadID after a forkIO?][s1]
* [Are haskell channels `Control.Concurrent.Chan` safe for multiple readers/producers?][s2]
* [How to take F# measurements to get speedups][s3]
* [Using Haskell to program the GPU : haskell][r1]
* [The Downfall of Imperative Programming][r2]
* [Examples of easy parallelism in Haskell?][r3]

Help and Feedback
----------------------------------------------------------------------
If you'd like to make an announcement in the next Haskell Parallel
Digest, then get in touch with me, Eric Kow, at
<parallel <at> well-typed.com>. Please feel free to leave any comments and
feedback!

[arswell]: http://arstechnica.com/business/news/2012/02/transactional-memory-going-mainstream-with-intel-haswell.ars
[chat-1]: https://github.com/joeyadams/haskell-chat-server-example
[chat-2]: https://github.com/qnikst/chat-server/blob/master/src/Main.hs
[dph]: http://ghc-simd.blogspot.co.uk/2012/04/adding-simd-support-to-data-parallel.html
[fun-aft]: http://sneezy.cs.nott.ac.uk/fun/2012-02
[numa]:   https://en.wikipedia.org/wiki/Non-Uniform_Memory_Access
[hs-perf]: http://www.berniepope.id.au/linuxPerfEvents.html
[pyzero]: http://nichol.as/zeromq-an-introduction
[stm-bc]: http://research.microsoft.com/en-us/um/people/simonpj/papers/stm/beautiful.pdf
[stm-05]: http://research.microsoft.com/en-us/um/people/simonpj/papers/stm/stm.pdf 
[stm-sm]: http://community.haskell.org/~simonmar/par-tutorial.pdf
[stm-rwh]: http://book.realworldhaskell.org/read/software-transactional-memory.html
[ph-8]: http://www.well-typed.com/blog/64
[simd-ghc]: http://hackage.haskell.org/trac/ghc/wiki/SIMD
[therac-25]: http://en.wikipedia.org/wiki/Therac-25

[g1]: http://parfunk.blogspot.co.uk/2012/02/potential-gsoc-haskell-lock-free-data.html
[g2]: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dmitrydzhus/1002
[g3]: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/florianhartwig/1
[g4]: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/mathiasbartl/1
[g5]: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/lorehead/1
[g6]: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/refold/35002
[g7]: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/sajith/1

[b0]: http://fpcomplete.com/the-downfall-of-imperative-programming/
[b1]: http://adit.io/posts/2012-03-10-building_a_concurrent_web_scraper_with_haskell.html
[b2]: http://therning.org/magnus/archives/1009
[b3]: http://ghc-simd.blogspot.co.uk/2012/03/simd-support-for-vector-library.html
[b3b]: http://ghc-simd.blogspot.co.uk/2012/04/adding-simd-support-to-data-parallel.html
[b4]: http://tumblr.justtesting.org/post/19825349916/work-efficient-higher-order-vectorisation

[t1]: http://www.haskell.org/pipermail/haskell-cafe/2012-March/099805.html
[t2]: http://justtesting.org/gpu-computing-in-haskell-version-010-of-dataa
[t3]: http://www.cse.chalmers.se/edu/course/pfp/
[t4]: http://sneezy.cs.nott.ac.uk/fun/2012-02/coutts-2012-02-28.pdf

[m1]: http://www.haskell.org/pipermail/haskell-cafe/2012-February/099300.html
[m2]: http://www.haskell.org/pipermail/haskell-cafe/2012-February/099456.html
[m3]: http://www.haskell.org/pipermail/haskell-cafe/2012-February/099456.html
[m5]: http://www.haskell.org/pipermail/haskell-cafe/2012-February/099578.html
[m5b]: http://www.haskell.org/pipermail/haskell-cafe/2012-February/099584.html
[m6]: http://www.haskell.org/pipermail/haskell-cafe/2012-March/099830.html
[m7]: http://www.haskell.org/pipermail/haskell-cafe/2012-March/099890.html
[m9]: http://www.haskell.org/pipermail/haskell-cafe/2012-March/100349.html
[p1]: https://groups.google.com/d/msg/parallel-haskell/NI5qxYw-5RA/fTljjfXYWYUJ
[p2]: https://groups.google.com/d/msg/parallel-haskell/fsHrxz3ei70/FKe0kgV5mqcJ
[p3]: https://groups.google.com/d/msg/parallel-haskell/nxDoibiuGWE/N68MR19uu1kJ

[s1]: http://stackoverflow.com/questions/9475392/is-there-a-way-in-haskell-to-query-thread-state-using-threadid-after-a-forkio
[s2]: http://stackoverflow.com/questions/9616515/are-haskell-channels-control-concurrent-chan-safe-for-multiple-readers-produce
[s3]: http://stackoverflow.com/questions/9678284/how-to-take-f-measurements-to-get-speedups
[r1]: http://www.reddit.com/r/haskell/comments/qnq4y/using_haskell_to_program_the_gpu/
[r2]: http://www.reddit.com/r/programming/comments/s112h/the_downfall_of_imperative_programming_functional/
[r3]: http://www.reddit.com/r/haskell/comments/sdx5v/examples_of_easy_parallelism_in_haskell/
--

-- 
Eric Kow <http://erickow.com>

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Gmane