google cloud datastore - DataFlow not acking PubSub messages -
simple gcloud dataflow pipeline:
pubsubio.readstrings().fromsubscription -> window -> pardo -> datastoreio.v1().write()
when load applied pubsub topic, messages read not acked:
jul 25, 2017 4:20:38 pm org.apache.beam.sdk.io.gcp.pubsub.pubsubunboundedsource$pubsubreader stats info: pubsub projects/my-project/subscriptions/my-subscription has 1000 received messages, 950 current unread messages, 843346 current unread bytes, 970 current in-flight msgs, 28367ms oldest in-flight, 1 current in-flight checkpoints, 2 max in-flight checkpoints, 770b/s recent read, 1000 recent received, 0 recent extended, 0 recent late extended, 50 recent acked, 990 recent nacked, 0 recent expired, 898ms recent message timestamp skew, 9224873061464212ms recent watermark skew, 0 recent late messages, 2017-07-25t23:16:49.437z last reported watermark
what pipeline step should ack messages?
- stackdriver dashboard shows there acks number of unacked messages stays stable.
- no error messages in trace indicating message processing failed.
- entries show in datastore
dataflow acknowledge pubsub messages after durably committed somewhere else. in pipeline consists of pubsub -> pardo -> 1 or more sinks, may delayed of sinks having problems (even if being retried, slow things down). part of ensuring results seem processed effectively-once. see a previous question when dataflow acknowledges message more details.
one (easy) option change behavior add groupbykey (using randomly generated key) after pubsub source , before sinks. cause messages acknowledged earlier, may perform worse, since pubsub better @ holding unprocessed inputs groupbykey.
Comments
Post a Comment