I just wanted to introduce part of the work that Insight is contributing as part of our P-Rep application. We are building a modular terraform deployment deployed through terragrunt. If you aren’t familiar with terraform, it is the leading infrastructure provisioning tooling around today allowing users to script the state of a cluster on any cloud. If you aren’t familiar with terragrunt, it is a wrapper around terraform that makes it easier to use and manage in complex, multi-environment deployments.
Here is a link to the repo I am working out of right now.
Will update README with current state but briefly, nodes need configuration via SSH. IAM permissions are being locked down now and working into a multi-account security pattern where all IAM roles are stored in single account (more on this later). Security groups can easily be statically locked down to the IP whitelist but looking for input on best way to automatically respond to changes to IP whitelist. Multi-host configurations not started yet (ie multi-p-rep-multi-citizen). Ansible configuration and security hardening step to be included soon.
I am hoping that over time, this style of deployment will:
Facilitate the deployment of both P-Rep and citizen nodes
Enable P-Rep node operators to test different node configurations on TestNet
Support a variety of different firewalls giving the network a decentralized grid of security features
I take copious notes and will have documentation to support much of what I do up later this week (another post on that coming shortly). In the mean time, if you have any comments / suggestions, please hit me up or lay them out in this thread. Looking for any insights into what people think would have the most impact soonest. If anybody has experience with terraform and is running a p-rep, please hit me up.
Right now only supporting AWS. GCP will be next. Also developing in parallel a supporting services / logging cluster. Will start separate thread on what is needed there soon.
Please don’t hesitate to ask questions. Looking for as much community involvement / use as possible.
Just a little update on our work, we are in the design process considering whether or not to centralize identity (IAM + IDP) in one account and have services in anothers. Segregating permissions with multiple accounts is the right way of doing things for security reasons as we can enforce least privileges with both service control polices and role based access control methods. It does on the other hand create issues as you have more connections to deal with and exploits very modern, enterprise security patterns that take a lot to manage. Ultimately I think we will want to use this pattern but it takes a lot of work to get it going.
This presentation will give you a little bit of background on what we are doing but essentially we will be running individual AWS accounts for each environment and as such, will need roles to assume to make changes inside each of those accounts and a central place to mange RBAC for each user.
Going to use our best judgement for now and if anyone has any inputs / questions, please leave them on this thread as we are hoping the community will benefit from this work as we standardize security best practices.
It’s a great idea, however I think you are overcomplicating things for a relatively straight forward use case.
I also took a brief look at the repo, there is a lot of duplication, that’s not necessary.
What I would suggest instead is to use CloudFormation and on top of that some scripting or terraform.
I wouldn’t go multi-account or/and AWS Organisations. I think you should just put the nodes in separate VPCs and that should pretty much do it.
For the IP whitelisting what you could do is to have an AWS Lambda pulling the whitelist once a day and update the SG accordingly, pretty much straightforward.
So my overall feedback is keep it simple, start MVP. Don’t overengineer.
Thank you for the feedback and was hoping someone would ask me about this.
To address your points (and please comments accordingly - really want any feedback)…
As for the duplication, this is the strategy for managing terraform with terragrunt. The actual terraform are kept in separate repos with versioned releases that are pinned to the terragrunt configuration flies. For instance in this file you can see it being pinned to main aws vpc module. Copying and pasting these terragrunt files is effortless compared to maintaining multiple different copies of the raw terraform and also encourages best practices by allowing developers to predictably promote changes across environments. Basically duplication on the terragrunt is a sacrifice but a VERY small one compared to managing terraform with lots different versions of terraform modules.
On using multiple accounts complicating things, there are two things. First off the this is a very valid concern of mine that it will over complicate things for basic users though for Insight I think we will need a little more control. For instance I don’t want any users to have access to our production environment and thus don’t want to worry about within one account locking down access to the production VPC. Insight will have several developers working on things that only need to have access to our dev and testing environments. Using AWS organizations is the best way to do this though I could be convinced to make special efforts for users who only want to run one account. It should be simpler in practice but from a governance perspective, you will want to lock down controls as much as possible on the production nodes and give users a playground to test new features. Organizations give you separation out of the box. And second, the scripts right now can all be run out of one account as each environment needs initialization. The differences really come from the IAM permissions and how to discover / connect services. I’ll keep this in mind though as you are totally right , the average user will probably only want to use one account though I would highly encourage at least two if you want to develop new features. Also, the setup of the accounts themselves will be automated eventually if that was a concern.
For the IP whitelisting, using terraform for running serverless lambda functions is very popular and I was planning on building what you suggested. Was talking with @thelionshire about this and it all breaks down to how often you want to run this function (ie cron or responsive through API). I’ll start a new thread on that but we’re on the same page I think on that.
As for our tech choices (terraform and terragrunt), a lot of this just boils down to what you are most comfortable with. Happy to incorporate cloud formation within the terraform deployment but the core of this project is going to stay in terragrunt for reasons cited above.
Thank you very much for the suggestions and please, let me know if anything I am saying does not jive. Been flying solo on this and hoping more can comment. This is project just started so really looking for as much input as possible. I will be releasing a video very soon describing how to use this setup but always looking to make things better / more usable.
I am not an expert in Terraform and specifically terragrunt but to me it looks overly complicated for the task as hand.
You are just deploying nodes, so basically EC2 instances. So, using IAM policies with one account should suffice. No one could access the instances, stop or terminate if they wouldn’t have access to them.
IAM as it is, allows for very fine controls on almost all resources. I would enforce MFA on the users though.
What your internal needs are and what are you developing should not be assumed for the community.
If someone wants to develop apps I think it should be beyond the scope of this.
As for the tooling, it doesn’t really matter although my rule of thumb is to choose what’s right for the job and not fix the tool and work around it.
I don’t know what that “logging cluster” is supposed to mean but sounds complicated. Anyone CloudWatch?
I think I should be a little more clear about what we are doing. Running a node on MainNet is not going to be that hard. You are right. But we need to harden the network and we just got guidance on how to setup more robust node configurations only 2 weeks ago or so. That means that we will need to get the best MainNet setup possible before we go live while also testing different node configurations in a TestNet and promoting those changes into MainNet as we make improvements. Automation is obviously needed for that and then the next challenge is managing those promotions into MainNet. I have a couple years of terraform experience and I can tell you with total confidence that terragrunt will make this all MUCH easier. I spoke the original author of the terragrunt and showed him snippets of what we are doing about a month ago and he gave me the thumbs up in my understanding of the optimal file structure layout for doing what we are doing.
For the logging cluster, this really doesn’t need to be a beefy cluster you are right, it could just be cloudwatch. This is simple with something like this repo or we can setup something better like elasticsearch. Cloudwatch is very bare bones so it gets down to use case. Insight is going to be doing work on double signing protection and will need a better log querying system than what cloudwatch provides. If using terragrunt, this is easy to swap out as the whole thing is modular but I do think that we all need to be on the same page so that we can aggregate logs easily. Open to suggestion but the devs are officially recommending elasticsearch so in that case, whether it is a node or cluster is semantics.
Another note on using terragrunt, running a given stack is as simple as this. CDing into the proper directory, running one command to initialize sensitive account information (ie just the account number), and then running one command to deploy all the infrastructure. So three steps. Other terraform topologies require you to either stuff all the node configuration into one large terraform file and praying that it all works. Modularization is done all internally and called from one script. If ANYONE wants to work on this, please hit me up and I will support. I personally find terragrunt easier to manage.
On the one account vs multiple, it all depends on what scale you have and what you are comfortable with. I’m down to make it work in both cases but for supporting our security posture, with as many people as Insight will be having working on our project, we need a little more segregation. Writing policies that are assumed from another account or all within the same should be a small tweek.
You obviously know AWS. Do you have any IaC that you are using that you can share? Been trying to find anyone who has anything that would help with automation.
I’m curious if anyone has ever looked into running p-rep and citizen nodes in a Kubernetes cluster. I’d be interested in the pros/cons. If it looks like a good idea I may put something together.
We are running a services cluster on k8s but not the P-Rep or citizen. Would love to run citizen but P-Rep should be run as close to metal as possible. There is a great resource that discusses running P-Rep on k8s found at https://kb.certus.one/systems.html?highlight=kubernetes#why-not-kubernetes.
Larger point, P-Rep can be mutable for prod deployments and everything else (including P-Rep for testnets) should be made immutable. I love k8s for everything and would have liked to run P-Rep there but unless it is for testing, I don’t think there is much of an advantage. Our one-click deployment right now configures the host with Ansible and we are pre-baking images with packer for most other things.
Our k8s services cluster is running Prometheus / grafana, an ELK stack (to be contributed by iBriz), and soon we’ll have a test bench running with all kinds of different testing tools (penetration, load, etc) that will run as a bunch of docker containers.
Happy to collab on this and share our setup. The services cluster we are building is going to be as close to one-click as possible. Hit me up on Telegram if you would like to discuss. “@robcio”