F*** Microservices

This is a rant. If you have some experience with microservice architecture then you will probably find nothing new here. I focus here on an architectural choice where a single team works on multiple deployment units especially where there are more deployment units than developers on the team. For me a team consists of up to 6 developers. If you have more, then your organization has other, more pressing, problems.

First of all, if you do not know what your business process is going to look like, especially if your product is not in production yet, do not make it a distributed process. I am not talking about orchestration vs choreography here. I am all for orchestration, and a clearly visible process that is defined in one place, however that discussion is for another post. I take issue with a business process that has to communicate with multiple deployment units in order to perform its task, regardless of the communication protocol (REST, event driven, message driven, etc). Apparently some believe that developing a process that changes weekly (or even daily), is easier to do it when it spans multiple services/repositories. 

– We are not in production, we can break our platform for a small period of time to save time. No worries. 

– Sure, but then why are you applying microservices architecture right now in the first place? 

– So that the team can learn how to work with microservices before we are in production.

– That is great, but if you are deploying services with breaking changes then how are they supposed to learn?

– Ok, let’s do it properly then. With backward compatibility and versioning.

And then you fail to meet deadlines arbitrarily set by business. 

Here is a simple calculation. We have two aggregates that are in two different services and a business process that interacts with them. New requirements require us to change a single field in each of the aggregates. If the process and aggregates were in a single deployment unit this would probably result in a single 20 – 30 line PR (this includes business logic, storage changes, unit and integration tests, etc). Consider a case where the aggregates and the process were in separate deployment units. If we were to introduce breaking changes then we would need 3 PRs of 20 – 30 lines of code. At least 3 times as much work plus higher chance to get it wrong on the service boundaries. If we were to adhere to backward compatibility then this would result in at least 5 PRs. As to why I will leave it to the reader.

Alternatively, although some consider it a heresy, you could keep all your code in a single repository. This would limit the amount of work needed, while providing an additional benefit of keeping the code in a consistent state. The trade off is the swelling of the repository (think indexing in IDE).

I once heard that no breaking changes will be introduced since only additive changes will be done to the integration model. This was for a system that was not even live yet. I have no comment. Even for a system that would be live this seems highly irresponsible. Will it result in a huge mess? Probably.

Next, continuous delivery. By continuous delivery I understand that, after a PR is merged to master, it gets built (e.g. as a docker image), then goes through automated testing and finishes in a state that it could be deployed to production (e.g. canary deployment) at any time, without the need for further testing. If your organisation does not have a reliable system for continuous delivery you should not do microservices. This guideline is not something I came up with but I wholeheartedly support. To be honest I would go as far as to say that if you have more than one deployment unit per team then you should have continuous deployment. Continuous integration is not enough. If for some reason you can’t, be it due to some weird regulations, or whatever, then do not do microservices.

Continuing, the size of the microservices. This has been debated to death. If you follow DDD approach then this should be straightforward and you should not go wrong with bounded context as a boundary. However. If you think that DDD is not for you or your product does not have a domain (do not ask me how), you might end up with one service per database table, or smaller. I draw the line where half of the microservice code is infrastructure. My preference is one service per team of 6 developers (this includes QAs, since I treat them as developers).  If service is too big for them to handle, split it. If there is not enough work to go around, reduce the team.

Microservices are not an excuse for poor design. Neither strategic (e.i. spanning the whole application, multiple services), or tactical (within the bounds of a single service). I have heard Greg Young once talk about treating services as small classes that could be easily written in a week. If you can write it in a week then it shouldn’t be a problem to rewrite it from scratch if needed. This is an awesome concept but this does not mean that the services are supposed to be unreadable. Small classes still are a subject to clean code principles. When time comes to rewrite it in 6 months time you have to be able to understand what this service does. After all the code is the documentation, isn’t it?

Monitoring. This is another thing that has been talked about multiple times. When applying microservice architecture your monitoring has to be top class. You will not be able to test all the integration variations that happen between services. Neither in an automated or manual manner. If your organisation cannot supply metrics in an easy to access way (and searching though individual AWS ECS instances does not count), then get this straightened out first before diving with multiple deployment units. 

IMPORTANT NOTE: I believe determining what should be monitored is the first thing that should be done when starting working on a new product. Both from a technical and business view. Reason for this is that if you know what you need to measure then you know what your product should deliver. Treat it as a specification of the fitness function for your product.

Whenever you start working on a new service, always run it with two nodes from the start. Regardless of the size of the service. If you cannot develop multi-node service, then your service is not horizontally scalable, you cannot provide at least some sensible measure of availability, doing zero down or canary deployments becomes very hard, etc. If I see such a service in development, that is a huge red flag.

Regarding performance. I am not that sold on the benefits of microservices regarding performance. Sure you can scale horizontally the services that are under heavy use (although I am yet to see a spring application that has a small startup time), or make sure that a service that needs low response time doesn’t get killed by that report generation. When you reach this point and these become true problems that can’t be fixed with just adding another machine, it will most likely mean that you succeeded in making a valid product and you have traffic that is probably in top 1%. Rejoice.

As a closing remark, people that often warn against microservices mention that you are not like Netflix. I say the opposite. You are like Netflix and you should behave like them. They had a monolithic application until well over 100 developers were working on it. That is when they decided to split the application. So be like Netflix.

F*** Microservices

On Product Development Team

Disclaimer: I am writing only my opinions that in no way reflect my previous, current or future employers. They are based on my experiences and reflections.

The purpose of this piece is mostly to organise my thoughts and ideas regarding the functioning of teams. If anyone finds something interesting or something they totally disagree with, then great!

I have been pondering upon the idea of a product development team and how they function for a while now. I have been a part of a few and had seen a few, be it in the workplace or spoken about in conferences. This lead me to a number of conclusions as to what, according to me, works well and what does not.

Team size

This feels fairly straight forward. I have worked in a full spectrum when it comes to a number of people on the team. I was the only team member and I was part of 20+ people. In my experience Scrum (not saying this is the origin of the idea), regardless if you like it or not, got this right. Four to six team members is more than enough to get the work done while not too many for the communication overhead to get too large. This, however, comes with a few caveats.

Team member competencies

The first are the team member competencies. The most performant teams I have seen were those where each member was capable of working on every area of the product. What do I mean by this? Let’s take a look at a typical web application that has both a frontend and a backend. Often times the product development team will naturally split into those that work only on the front side and those that work on the back side of the App. Of course there can be other splits, however I find this to be the most common one. My opinion is that allowing such divisions is highly irresponsible.

Let’s start with the ability to replace. Eventually, only people that know only one part of the application will be available. Be it due to holidays or sick leave, or because some people will change teams or even employers. The development, bug fixes, the planning will effectively halt. Some work will still be possible but this makes the employer vulnerable to unexpected circumstances. Its surprising that these situations are still permitted to occur. You could argue that for products that are not yet on production this shouldn’t be such a problem. On the contrary. This cripples the development and the ability plan further work which in effect drains the product funds.

We can go even further. Consider a modern continuous deployment pipeline. I would say that having only one person doing code reviews in such a case is unwise. Now to be able to deploy new code you need a person that publishes a PR and two people that review the code. In a team of six this means that you need to have half of the people working front and half working the back. This might not be optimal regarding characteristics of the project. Maybe UI is very simple and one person would normally suffice? But you need three. What happens if one leaves on a two week vacation? Deployments stop. No hot fixes. No bug fixes. You have two week sprints? Yeah, no new features. Even if you dial down to one reviewer you still need to have two team members in each area present at all times. 

Another huge problem with this approach is that such a split effectively creates two or more sub teams that need to communicate and integrate with each other. Each additional point of integration adds to complexity. This is costly. Instead of limiting the need for communication and overhead, we’re increasing it.

If developers know only a part of the product and how it works then the solutions they come up with will be sub-par. They will be operating with a limited knowledge when designing the product. This is unavoidable in a larger corporation where multiple teams have to cooperate but on a single product level this is unacceptable.

You might be tempted to split the team into e.g. two teams, front and back. Ignoring the fact that this goes against Scrum (if you are using it) idea of a team being able to deliver the whole business functionality from start to finish, you will now need to deal with the overhead associated with having two teams. Double number of meetings, double the amount of work for the business side, etc. In my opinion it is far more efficient to have a single team of six, where.each member can handle every area of the product at least in a basic capacity, rather than to have two, or even three, teams that focus on separate parts.

That is not to say that each member can not have a preference as to what they like to work on. Some might prefer to work more on optimising the database indexes while others configuring the network routing. This is OK. As long as each member has a working knowledge and has done some work in every area. The team should be, then, resistant to team member changes, while limiting the communication overhead and cost of development.

There might be a rare case where you will need someone with highly specialised knowledge, e.g. some detailed behaviour of Oracle Database. To handle this you do not swap out an existing member for one that only works with Oracle Databases. Instead use an external expert as either your teams consultant or add him to the team as a temporary member. We will talk more about this later.

The only reason that I see where such a team could exist is when the budget is lacking. Excuses like we can’t find the right people means that you can’t find the right people at this salary level. If such is the case then maybe you should consider the notion that maybe you can’t afford to build such a product.

Team roles

Generally speaking, each team member should be considered an equal in terms of competences and simply referred to as a developer. However, I would distinguish two other roles that I think are critical for efficient performance.

Side note: Regarding team leads. I have once heard that their role is to make the team better. The team makes the team better. Not one person. There should be no need for a team lead.

Architect

Team should have a dedicated Architect that is part of the development team and counted as a developer. I don’t want to get into minutia of titles (whether this is a system, enterprise, solution, etc) so for the sake of the argument I will refer to the role as simply the Architect. The purpose of the Architect in a development team is to be responsible for (obviously) the whole architecture of the product and its documentation. What I mean by architecture is very broad. It includes not only what you would get from the C4 diagrams (or whatever) but also package structures, design patterns, etc. Of course the Architect is not the only one responsible for it and the decisions should be made by the complete team as they are responsible for the product as a whole. It is however his main purpose.

Another responsibility of the Architect is for the teams best practices. Developing them and making sure they are followed. Actually, the whole team should be responsible for that but the Architect should lead by example.

The Architect must also write code on day to day bases. Not only the fun, cool one that flexes the brain matter (or even worse just POC) but also the boring and mundane one. This is so they can truly understand the ins and outs of their system and get a feel for which architectural/technical decisions were correct and which were not. The Architect is a technological leader and should be the most experienced developer on the team but he is not a team leader and not the boss of the other developers.

There is one more reason for having an Architect on the team. It is hard to swallow but is the reality of our industry. In a corporation, a person with an Architect in his email signature will have more sway and be able to get things that the team needs that other team members would have much bigger difficulty getting (or not at all).

The role of an Architect is a difficult one. It requires from them to be humble enough to allow the whole team to make decisions while being assertive enough to convince them to go another way if needed. They need to be able to understand how the product works as whole, from a greater perspective. How it interacts with the other systems. But, be also able to work on the detail implementation.

Specialists

We have touched upon specialists earlier. Specialists are people outside your core product development team that bring with them a very narrow but deep knowledge in some area. This can range from databases, through networks, security or even UX design. Core team members can work on all of these by themselves but as the saying goes “Jack of all Trades, Master of None”. In the vast majority of projects this is OK. Core team members do not need to be technological experts. They need to be product experts. When the situation comes where you do need that additional knowledge you can hire one for the team. Those specialists can take two forms. As mentioned before, either a consultant (not meaning a type of contract they signed, they can be salary workers from within the company) where you ask for help with a specific problem or as a temporary “visiting” team member. For example a DevOps joins a team for a month when work on new product starts and all the pipelines and environments need to be setup. He joins all the meetings, sits with them and for all intents and purposes is a part of the team. Until his services are no longer needed.

Regardless of which form the cooperation will take, they add up to the team member count and communication overhead. This is why teams of six should be a max. Add two experts like DevOps and UX designer and now there are eight.

Their role is not only to support but also to teach the core team members. Each member has all the same accesses and privileges. They only lack the knowledge. Once they pass that knowledge they shouldn’t be needed to access any resources. There might be some derogation from this rule for security reasons. Care must be taken that those are truly security reasons and not some imaginary managerial hogwash.

In larger organisations those specialists can form their own teams. Just to make sure this is clear, those teams are not product teams. They might even be developing some tools (be it internal or, better, open source) for their work but their main job is to provide specialised support to the product teams.

Team structure

Teams should have a good mix of experience. Of course this is very dependent on the project difficulty (although in my experience most of the complexity in projects comes from interactions and lack of communication than actual technical difficulty). Please do not get hung up on the titles that follow. I use them only to distinguish relative experience levels.

Ideally a team of six would consist of an Architect, Senior Developer, two Regular developers and two Junior developers. This is beneficial for both the employer and employees. The employer will get enough work power from a team to get the job done while not spending a fortune on only experienced developers. The less experienced developers will get a chance to learn while the more experienced developers will be able to delegate most of the mundane work. I know that this seems harsh but that is the truth. Someone has to write those mappers :).

Scaling Down

When scaling the teams down I would first remove one Junior, then one Regular. Making the team even smaller might make it too small to be effective except for the most unique cases. If you must, I would remove the last Junior in order to have a very small but highly performant expert team. Any less is not a team. It’s a pair. If you face a situation where you need to develop a small piece of software where one or two people should be sufficient then consider doing the development as part of one of the existing products. This will give the piece of software an immediate client that will validate (or not) its purpose. There is no need to create an artificial team. If other products could use it as well then make it available to them in any form you seem fit.

Scaling Up

This has been repeated multiple times but let’s repeat it again. You shouldn’t need to scale the team up because the communication overhead will become significant. If you must do so, I would only add Regular developers. You will need someone that can get quickly up to speed and this disqualifies less experienced developers. On the other hand, by the law of diminishing returns, adding a Senior will not get you the bang for your buck. If you need to implement something difficult then it would probably work better to assign an existing team member to it than a new one, even if they are very experienced.

Conclusion

These are my thoughts. You could say my dream team. It would seem that this would be easy to achieve. There really aren’t any hard elements to this. They seem common sense to me but my view is obviously subjective. I once had a chance to work in an environment that somewhat resembled this but only for a short time. And it was glorious.

On Product Development Team

Loading large CSV file using Pandas

I have recently encountered a problem when I tried to load a large CSV file into Python using Pandas. When I tried to load the file I was getting a message stating that the computer ran out of the memory. This was surprising because the file was just over 2 GB and I have well over 10 GB RAM machine. I was never able to figure out which part of pandas was causing this (and to be fair I didn’t want to waste my time on it) but I found a workaround that might be helpful to others not wanting to split the file in to smaller chunks and still load it in using pandas. Simply put – load the file several rows at a time, i.e. instead of splitting the file, slit the loading. Here is an example:

[sourcecode language=”python”]
data = pd.read_csv("train_data.csv", nrows = 50000)
data = data.append(pd.read_csv("train_data.csv", skiprows = 50001, header=None, names=[‘a’,’b’,’c’]))
[/sourcecode]

The file we load is called “train_data.csv”. I have noticed that the computer can handle around 1 GB loads at a time, which in this case corresponded to around 500K records. We load the first 500K records, then load the rest and append the created DataFrames. If your CSV is even larger then just put the read in a loop this way:

[sourcecode language=”python”]
for i in xrange((NUMBER_OF_RECORDS/MAX_RECORD_NUMBER)+(1 if NUMBER_OF_RECORDS%MAX_RECORD_NUMBER != 0 else 0)):
data1=pd.read_csv("your_file.csv", skiprows = MAX_RECORD_NUMBER*i,nrows=MAX_RECORD_NUMBER-1, header=None, names=[‘a’,’b’,’c’])
[/sourcecode]

Where NUMBER_OF_RECORDS is the number of records in the file and MAX_RECORD_NUMBER is the number of records that your computer can handle loading one at a time.

You also need to remember that if you skip lines the headers will be wrong. You can’t specify the header as index, because it takes the index of the already cut slice, not the original file. Just specify headers your self. This might, however, be problematic if you have a large number of columns but then you can simply save the column names from the first slice and use them in the following loads.

Loading large CSV file using Pandas

Searching for a car

Introduction

In a past year I have been searching for a new car. Actually my first car ever. I am a huge supporter of public transport and a good bike but the long commute and low bus frequency has started to get to me. This meant time for research. Normally I am quick to make a decision. It took me around a month between making a decision to buy a flat and signing the deal. However buying a car scares me a little. The number of maintenance costs can get large, especially with the newer cars. I had a talk with my friend who is a car mechanic and he recommended to me to look into Toyota Corolla, especially the pre 2005 models, or even as far as pre 2000. Apparently they are as failure-free as you can get. I am totally green when it comes to cars, so I took his word for it and got cracking on the search. While doing the research I quickly got lost in the sheer number of data that is available and came to realisation that this would be a great data mining exercise! This was also a good opportunity to learn some R. I have heard that it has some very nice visualisation tools and wanted to check them out.

Data Collection

I have picked one of the bigger sites that lists used cars for data gathering called otomoto.pl. I live in Poland and that is where I am searching for the car. For website scraping I like to use a Python library called Scrapy. It is simple to use and has a good documentation. I will not get into the details of the implementation, since the program was relatively simple. I filtered the list to show the Toyota Corolla cars build prior to 2005, looped through pages and accessed each offer. From the offer page I have retrieved information most important to me and stored the results in a csv file. Since there was only about 400 offers, there really was no reason to use anything more complicated for storage than a simple csv file. Of course data is from running the script only once. If we where to collect the data over a longer period, there would be more.

CODE

Data Analysis

As mentioned in the introduction, I will be using R to analyse the data. Be ready for a lot of graphs. I like graphs. Lets first find out where the cars are coming from:

country_of_origin

As expected, most cars come from Poland. The fact that a lot of cars come from Germany is also to be expected, since they are our neighbours and usual have better quality cars. However I did not expect them to be almost as much as from Poland. What could explain it, is a huge portion of cars whose origin is unknown. I would guess that most of them (if not all) come from Poland.

Since we are at locations, lets check out where the offers are located:

location_vs_price

Again no surprises. Most of them are around big cities like Warsaw or Krakow, but quite a few are spread around. On a side note, I must say creating this map with R was pleasantly quick and easy.

Next lets look at the production year.

production_year

Hmm… I wonder what happened in 2002. The number of offers sky-rocked in that year. I was expecting a more gradual rise. Now, the question is, are the cars prior to 2002 better and people don’t want to sell them or worse and no one wants to buy them.

Another important car feature is its mileage.

mileage

There was quite a spread. There was even a car with over 2 000 000 km mileage! I have decided to cut histogram to show only the mileage that would potentially be interesting to me, i.e. less than 500 000 km. Still the result is interesting. Especially, the 1 km mileage cars. I have chose to collect information on used cars so I highly doubt that those cars really have 1 km mileage! Not the first time people tried to scam on the mileage numbers but that is quite bold. Apart from that it seems that most of the cars fall in the 100 000 km category.

Now lets look at the most important aspect – price (PLN).

price_histogram

It seems that the price is spread out very evenly. I would say that I should be looking at a 10K – 15K price range. Not much more to say here.

Here is another, the price vs the production year.

price_vs_production_year

I was hoping that there will be some indication as to the quality of the car reflected in the mean price for a given production year, however the trend seems to be consistent and no anomalies are found, except for a small spike in the 1999 year (most probably caused by small sample size).

I was curious if there was a way to predict the price of a car based on the location where it is being sold. It would allow me to see how some machine learning tools work in R, as well as create an interesting map. I didn’t think there was a need to create an exact heatmap within the Polish contours, so I decided to just fit it as best as possible. The code is HERE, while the result is below:

location_vs_price_predicted

From the looks of it, there are some areas, where the cars could potentially be cheaper. The interesting fact is that most white areas are next to big cities but not precisely within them.

Here is the full REPO with R code and the collected data.

Conclusion

Overall I think this was a very fun exercise. I will definitely need look around more. The final decision is always made after you actually see the car but I have some idea in what price and mileage range, as well the location of the seller to look at. The most fun was of course playing with R. I will definitely came back to it, since the map tools I truly great.

Searching for a car

Starting out: The Question

In the first few posts I decided it would be a good idea to build a machine learning model. I have seen a several tutorials on this topic but rarely do they show the iterative process. It will be an interesting journey for both the reader and the writer because I have no idea if there is a correlation between the features and the output and therefore no idea if the final model will be of any use.

First we have to start with a problem definition. What question are we trying to answer. One of the most interesting problems out there is predicting the stock market movements. A lot of work has already been done in this area. You have day traders trying to make money on small price changes (a lot of automated trading systems work on this time scale). There are swing traders that try to feel out the momentum of the stock market. Then there are the long term investors, trying to gauge the real value of a company and catch those that are currently undervalued.

However this is not the type of predictions that we will be interested in. I believe that the value of a stock market in many ways reflects the current mood of a population. If there is a crisis, the stocks go down, if there are good news, the stock go up. By analysing the news we can try to work out the effect it will have on the stock market.

The analysis of the mood of the text is called sentimental analysis. There has already been a lot of work done in this field. We will not be reinventing the wheel. It is a different and vast topic that would require whole other project. We will use an existing library created by Stanford. They have created a command line program written in Java that uses the movie reviews comments to generate its models.

There has also been a lot work done with sentimental analysis and stock market prediction. There are companies that monitor Twitter for information and notify the traders of any relevant events. We will not be interested in Twitter in our case. We don’t have a free access to historical data and I would like to try something a little different. We will use the Reddit comments on the /r/news subreddit. Hopefully the comments on the news should be a little bit more valid to our problem space than 140 character twitts on Bieders new album. I know it is possible to define filters on twitter but it would be a lot of work to specify all the keywords we would like to monitor. Besides, maybe there IS a correlation between Bieder fans and the stock market! We will leave that for another day. There are other possible subreddits, that are even more financially specific but /r/news has by far biggest activity and we will start here. I will use PRAW to access the reddit API.

Finally we have to decide what machine learning tool-kit we will use. I will not go here into comparison of all the options since this is highly depended on personal preferences. I like Python with sklearn and this is what I will use.

To sum up:

Question: Can we predict the stock market (in our case it will be NASDAQ) movements using the sentimental analysis of /r/news comments?

In the next post we will implement a way to download the needed data.

Starting out: The Question