This is a short continuation of the previous one shaking, where I described how to gradually export data from Datastore to BigQuery. Here, I discuss how the previous solution is extended to a situation where you have Datastors in multiple projects. The goal remains the same, we would like the data to be in BigQuery.
Overall, the problem can be expressed by the following formula
The dataflow process can work either in one of the source projects or it can be placed in a separate project – I put the data flow process in a separate project. The results can be stored in BigQuery, which is located either in the same project as the data stream or in another project.
Let’s start with the generalization. First, I’ve expanded the configuration file with two new fields:
SourceProjectIDs which is nothing more than a list of source GCP projects, and
Destination it determines where the BigQuery source data set resides.
The extended data flow is defined as follows:
rows = (
| 'projects' >> beam.Create(project_ids)
| 'get all kinds' >> beam.ParDo(GetAllKinds(prefix_of_kinds_to_ignore))
| 'create queries' >> beam.ParDo(CreateQuery(entity_filtering))
| 'read from datastore' >> beam.ParDo(ReadFromDatastore._QueryFn())
| 'convert entities' >> beam.Map(entity_to_json)
It will be continued in one additional step
projects which produces
PCollection with a list of source projects (from the configuration file). A little change
get all kinds a step was needed.
GetAllKinds was changed to the PTransform phase, which creates a list of sets for each project
(project_id, kind_name) .
processThe DoFn method accepts sets like any other object to be serialized. I saw the next step,
create queries , creates queries to retrieve records
kind_name who lives in Datastore in
def process(self, project_kind_name, **kwargs):
:param project_kind_name: a tuple with project_id, kind_name
"""project_id, kind_name = project_kind_name
The query generated by this step already contains
project_id so we no longer need to transfer the project ID.
The schema of JSON objects stored in BigQuery includes
__key__ field property
project in addition. The name of the BigQuery output tables is formed by the prefix
A hint of the solution presented is in the permissions. Dataflow uses two service accounts (SAs), one used during job creation and the other used by employees for resources.
We are interested in the SA used by the employee. By default, this service account is created automatically when the Compute Engine API is enabled in your project and has a common name
Therefore, to allow the Dataflow pipeline to access the Datastors in a different project. Therefore, add an account for each source project
<project-number>-email@example.com the role of the project
That’s it – the tube works as expected: