An open source python library for automated feature engineering

Overview

Featuretools

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to Know about Machine Learning

Tests Coverage Status PyPI version Anaconda-Server Badge StackOverflow Downloads

Featuretools is a python library for automated feature engineering. See the documentation for more information.

Installation

Install with pip

python -m pip install featuretools

or from the Conda-forge channel on conda:

conda install -c conda-forge featuretools

Add-ons

You can install add-ons individually or all at once by running

python -m pip install featuretools[complete]

Update checker - Receive automatic notifications of new Featuretools releases

python -m pip install featuretools[update_checker]

TSFresh Primitives - Use 60+ primitives from tsfresh within Featuretools

python -m pip install featuretools[tsfresh]

Example

Below is an example of using Deep Feature Synthesis (DFS) to perform automated feature engineering. In this example, we apply DFS to a multi-table dataset consisting of timestamped customer transactions.

>> import featuretools as ft
>> es = ft.demo.load_mock_customer(return_entityset=True)
>> es.plot()

Featuretools can automatically create a single table of features for any "target entity"

>> feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="customers")
>> feature_matrix.head(5)
            zip_code  COUNT(transactions)  COUNT(sessions)  SUM(transactions.amount) MODE(sessions.device)  MIN(transactions.amount)  MAX(transactions.amount)  YEAR(join_date)  SKEW(transactions.amount)  DAY(join_date)                   ...                     SUM(sessions.MIN(transactions.amount))  MAX(sessions.SKEW(transactions.amount))  MAX(sessions.MIN(transactions.amount))  SUM(sessions.MEAN(transactions.amount))  STD(sessions.SUM(transactions.amount))  STD(sessions.MEAN(transactions.amount))  SKEW(sessions.MEAN(transactions.amount))  STD(sessions.MAX(transactions.amount))  NUM_UNIQUE(sessions.DAY(session_start))  MIN(sessions.SKEW(transactions.amount))
customer_id                                                                                                                                                                                                                                  ...
1              60091                  131               10                  10236.77               desktop                      5.60                    149.95             2008                   0.070041               1                   ...                                                     169.77                                 0.610052                                   41.95                               791.976505                              175.939423                                 9.299023                                 -0.377150                                5.857976                                        1                                -0.395358
2              02139                  122                8                   9118.81                mobile                      5.81                    149.15             2008                   0.028647              20                   ...                                                     114.85                                 0.492531                                   42.96                               596.243506                              230.333502                                10.925037                                  0.962350                                7.420480                                        1                                -0.470007
3              02139                   78                5                   5758.24               desktop                      6.78                    147.73             2008                   0.070814              10                   ...                                                      64.98                                 0.645728                                   21.77                               369.770121                              471.048551                                 9.819148                                 -0.244976                               12.537259                                        1                                -0.630425
4              60091                  111                8                   8205.28               desktop                      5.73                    149.56             2008                   0.087986              30                   ...                                                      83.53                                 0.516262                                   17.27                               584.673126                              322.883448                                13.065436                                 -0.548969                               12.738488                                        1                                -0.497169
5              02139                   58                4                   4571.37                tablet                      5.91                    148.17             2008                   0.085883              19                   ...                                                      73.09                                 0.830112                                   27.46                               313.448942                              198.522508                                 8.950528                                  0.098885                                5.599228                                        1                                -0.396571

[5 rows x 69 columns]

We now have a feature vector for each customer that can be used for machine learning. See the documentation on Deep Feature Synthesis for more examples.

Featuretools contains many different types of built-in primitives for creating features. If the primitive you need is not included, Featuretools also allows you to define your own custom primitives.

Demos

Predict Next Purchase

Repository | Notebook

In this demonstration, we use a multi-table dataset of 3 million online grocery orders from Instacart to predict what a customer will buy next. We show how to generate features with automated feature engineering and build an accurate machine learning pipeline using Featuretools, which can be reused for multiple prediction problems. For more advanced users, we show how to scale that pipeline to a large dataset using Dask.

For more examples of how to use Featuretools, check out our demos page.

Testing & Development

The Featuretools community welcomes pull requests. Instructions for testing and development are available here.

Support

The Featuretools community is happy to provide support to users of Featuretools. Project support can be found in four places depending on the type of question:

  1. For usage questions, use Stack Overflow with the featuretools tag.
  2. For bugs, issues, or feature requests start a Github issue.
  3. For discussion regarding development on the core library, use Slack.
  4. For everything else, the core developers can be reached by email at [email protected].

Citing Featuretools

If you use Featuretools, please consider citing the following paper:

James Max Kanter, Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. IEEE DSAA 2015.

BibTeX entry:

@inproceedings{kanter2015deep,
  author    = {James Max Kanter and Kalyan Veeramachaneni},
  title     = {Deep feature synthesis: Towards automating data science endeavors},
  booktitle = {2015 {IEEE} International Conference on Data Science and Advanced Analytics, DSAA 2015, Paris, France, October 19-21, 2015},
  pages     = {1--10},
  year      = {2015},
  organization={IEEE}
}

Built at Alteryx Innovation Labs

Alteryx Innovation Labs
Comments
  • Spark Example for Featuretools

    Spark Example for Featuretools

    Bug/Feature Request Description

    In notebooks such as here: https://github.com/Featuretools/predict-next-purchase/blob/master/Tutorial.ipynb and documentation: https://docs.featuretools.com/usage_tips/scaling.html

    It mentions the ability to scale to Spark. Could an example be provided like it was for dask here: https://github.com/Featuretools/predict-next-purchase?


    Issues created here on Github are for bugs or feature requests. For usage questions and questions about errors, please ask on Stack Overflow with the featuretools tag. Check the documentation for further guidance on where to ask your question.

    opened by charliec443 26
  • Refactor LatLong and Datetime Primitives into Separate Files

    Refactor LatLong and Datetime Primitives into Separate Files

    Pull Request Description

    • Fixes #1855

    Changes: I decided to split all classes containing Lat/Long functions into their own file as well as classes containing date/time into their own file. In each file I also organized classes in alphabetical order. I don't believe there are any conflicts with the new files as I was able to run the tests.

    Comments: Whenever someone is able to review my changes I would also appreciate some input/advice on the testing. I am running them as described on Ubuntu. They run to the end but I do have some failed tests, not sure if this is due to my changes or if it is just part of the process.

    As an aside I apologize for all of the unnecessary commits. I'm still getting the hang of it and understand now I may have gone overboard. Also, I accidentally deleted my original branch which is why I am submitting a second pull request.

    opened by jacobboney 21
  • “IndexError: Too many levels” when running Featuretools dfs after upgrade

    “IndexError: Too many levels” when running Featuretools dfs after upgrade

    Featuretools' dfs() method fails to run on my entity set after upgrading from v0.1.21 to v0.2.x and v0.3.0.

    The error is raised when the Pandas backend tries to calculate the aggregate features _calculate_agg_features(). In particular:

    --> 442 to_merge.reset_index(1, drop=True, inplace=True) ... IndexError: Too many levels: Index has only 1 level, not 2

    This is working fine in v0.1.x and the entity set hasn't changed after the upgrade. The entity set is composed of 7 entities and 6 relationships. Each entity (dataframe) is added via entity_from_dataframe.

    opened by jrkinley-zz 20
  • Memory crashing when using featuretools/dask

    Memory crashing when using featuretools/dask

    I'm not sure what I'm doing wrong, but basically I'm taking a fairly large dataframe(11GB) and converting it to dask before running featuretools on it. During DFS my system is running out of memory, which is strange to me because I thought it should be writing to disk?

    from dask.distributed import Client, progress
    
    client = Client(n_workers=2, threads_per_worker=2, memory_limit='2GB')
    client
    
    import featuretools as ft
    import dask.dataframe as dd
    dt = {}
    dt.update(dict.fromkeys(catgoricalValues, ft.variable_types.Categorical))
    dt.update(dict.fromkeys(NumericColumns, ft.variable_types.Numeric))
    dask_df = dd.from_pandas(Main[NumericColumns + catgoricalValues], npartitions=50000)
    dask_df  # this works
    
    # Make an entityset and add the entity
    es = ft.EntitySet(id = 'Test')
    es = es.entity_from_dataframe(entity_id="dask_entity", dataframe=dask_df, make_index = True, index="index", variable_types=dt)
    
    # primatives to use
    default_agg_primitives =  ["sum", "std", "max", "min", "mean", "count", "percent_true", "num_unique"]
    default_trans_primitives =  ["add_numeric", 'multiply_numeric']
    
    feature_matrix, feature_defs = ft.dfs(entityset = es, target_entity = 'dask_entity',
                                           trans_primitives = default_trans_primitives,
                                           agg_primitives=default_agg_primitives, 
                                            max_depth = 2, features_only=False, verbose = True)
    

    My session crashes at this point from using all the memory. I followed various tutorials but I'm not sure what I'm doing wrong? My goal is after DFS is done, I would save the results to a file that I can then pass on to TF/Keras.

    opened by gautambak 16
  • How is `DIFF` calculated?

    How is `DIFF` calculated?

    I read docs but can't understand how does DIFF calculate its value.

    This part of my example:

    Screen Shot 2019-11-12 at 22 00 07

    I generated this dataframe using dfs(..., time_window=None)

    (time in index is meaning cutoff_time)

    What I can't understand is, DIFF(MAX(sales.amount)) will be calculated by applying DIFF on MAX(sales.amount) but since MAX(sales.amount) is an aggregated value, which would be only one single value(=max value before cutoff time), how does DIFF calculate its value? I think that DIFF requires at least 2 values to calculate?...

    If I missed something, please let me know how is first value of DIFF(MAX(sales.amount)), 25714.287, calculated..

    Thanks

    opened by rightx2 16
  • Calculating direct features use default value if parent missing

    Calculating direct features use default value if parent missing

    Pull Request Description

    (replace this text with your description)


    After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request.

    opened by seriallazer 15
  • Support/approach for sliding window/multiple snapshots in time

    Support/approach for sliding window/multiple snapshots in time

    Hi there! (first of all huge thx for dfs, vision & tools, superb work)

    My question, the predict_next_purchase sample uses a single cut_off time right? But doesnt that remove a lot of data that could help with the purchase prediction? and we're only using a single day for reference right?

    only this data/users -> "Using users who had acivity during training_window days before the cutoff_time, we look to see if they purchase the product in the prediction_window."

    I would like to use all data in a single final ml table for the models. Is there support to have the cut off being a sliding window (ex: for each customer) of features from last x days, predicting purchase (yes/no) up x days in the future. So each customer would appear multiple times, depending on choosen sliding window.

    Think it's a tipical pattern in predicting future events (predictive maintenance, churn, healthcare). Usually applies to any kind of event prediction. (ex: for every user, machine, predict probability of event E for the next x days for a specific point in time, obv the training the dataset has proper timestamps so that we can "recalculate" feature values for user/machine up to at any point in time)

    The dataset becomes non IID obv, some cautions apply.

    Makes sense? What's the approach to use DFS with these scenarios? thx!

    opened by rquintino 15
  • LatLong type

    LatLong type

    The issue in testing comes from mock_ds.py where the mock retail entityset is made with es.entity_from_csv(entity, (line 292). This makes the latlong type in that entityset a string rather than a tuple. The options as I understand them are:

    1. Modify Latitude and Longitude to check if the latlong is a string
    2. Modify entity_from_csv to convert certain strings to tuples
    3. Change the test to do the pandas _from_csv, modify the dataframe and then make entity_from_dataframe.
    4. Leave Latitude and Longitude with no real tests for now.

    My gut is to go with 3. Do you have a preference @kmax12?

    opened by Seth-Rothschild 15
  • Bug with parallel feature matrix calculation within sklearn cross-validation

    Bug with parallel feature matrix calculation within sklearn cross-validation

    Bug with parallel feature matrix calculation within sklearn cross-validation


    Bug Description

    Hello, guys! Thank you for the quick release of featuretools 1.1.0 !

    During my research I have faced the following bug: I have an estimator which is actually an imblearn Pipeline. The estimator consists of several steps including my custom transformer which calculates feature matrix with featuretools. And I want to check the quality of the model with sklearn cross_validate function . If I set n_jobs > 1 both in featuretools.calculate_feature_matrix and in sklearn.cross_validate, then I get an unexpected error ValueError: cannot find context for 'loky'. When either one of n_jobs is set to 1, then everything works fine.

    I googled for some time and I understood that such error might happen when parallelization is used without if __name__ == '__main__' - but it's the best information I've got - nothing more valuable. So for me it looks like there is some conflict in parallelization usage in sklearn and featuretools. And as far both of the libraries are essential as well as parallelization working with big data, i really hope you will be able to find a way to fix it :)

    P.S this problem was actual before 1.0.0 release - previously I used 0.24.0 and still faced it

    Output of featuretools.show_info()

    Featuretools version: 1.1.0

    SYSTEM INFO

    python: 3.7.5.final.0 python-bits: 64 OS: Darwin OS-release: 19.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: ru_RU.UTF-8 LOCALE: ru_RU.UTF-8

    INSTALLED VERSIONS

    numpy: 1.21.1 pandas: 1.3.2 tqdm: 4.62.2 PyYAML: 5.4.1 cloudpickle: 1.6.0 dask: 2021.10.0 distributed: 2021.10.0 psutil: 5.8.0 pip: 19.2.3 setuptools: 41.2.0

    opened by VGODIE 14
  • Add include_cutoff_time arg to control whether data at cutoff times a…

    Add include_cutoff_time arg to control whether data at cutoff times a…

    Add include_cutoff_time arg to control whether data at cutoff times are included in feature calculations and prevent traininig_window overlapping

    Pull Request Description

    There was a data overlapping problem when calculating the feature matrix: The data at cutoff time might be used both in calculating features and in calculating target values(#918 ). This could cause data cheating and affect the result as well. There was a trial to solve the issue (#930 ), but It still didn't solve the cheating problem. So, we decided to parameterize it to control whether data at cutoff times are included in feature calculations or not(#942 ) and this PR solves it.

    opened by rightx2 14
  • Fixed #297 update tests to check error strings

    Fixed #297 update tests to check error strings

    • On windows platform, there is an open issue currently in pandas where it raises an error when reading a file with accents in the file path (i.e. régions.csv). So, I resolved it with the following:
    # featuretools\tests\testing_utils\mock_ds.py:334
    df = pd.read_csv(open(filenames[entity], 'r', encoding='utf8'), encoding='utf-8')
    
    • This snippet np.dtype((np.integer, np.floating)).type was causing this issue. So, I resolved it by changing it to the following:
    np.issubdtype(time, np.integer) or np.issubdtype(time, np.floating)
    
    • Not sure how to get the error text for test_not_enough_memory
    opened by jeff-hernandez 14
  • Fix `base_of_exclude` handling

    Fix `base_of_exclude` handling

    • closes #2323
    • currently, the check_stacking function is only called during the generation of aggregation features. This means that features such as base_of_exclude and stack_on_exclude do not work when creating transform features. This PR addresses that bug. It also does some refactoring by moving the stack_on_exclude attribute and related member data off of AggregationPrimitiveBase and onto PrimitiveBase, seeing as the functionality provided does not need to be unique to AggregationPrimitives.
    opened by sbadithe 1
  • ValueError: Sample larger than population or is negative, and then Fatal Python error

    ValueError: Sample larger than population or is negative, and then Fatal Python error

    When I tried to calculate_feature_matrix by chunks, I kept encountering ValueError, following which a Fatal Python error usually occurred. Please note this error only occured after some chunks calculation, and no error showed up if I continued from where it failed with restarting the python script. Please see below for full trace info.

    2022-11-10 15:31:53,351 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.04 GiB -- Worker memory limit: 3.79 GiB Traceback (most recent call last): File "/home/zzz/python/test.py", line 306, in ft_test feature_matrix_ = ft.calculate_feature_matrix( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 316, in calculate_feature_matrix feature_matrix = parallel_calculate_chunks( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/featuretools/computational_backends/calculate_feature_matrix.py", line 792, in parallel_calculate_chunks client.replicate([_es, _saved_features]) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/client.py", line 3481, in replicate return self.sync( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/utils.py", line 338, in sync return sync( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/utils.py", line 405, in sync raise exc.with_traceback(tb) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/utils.py", line 378, in f result = yield future File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/client.py", line 3439, in _replicate await self.scheduler.replicate( File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1153, in send_recv_from_rpc return await send_recv(comm=comm, op=key, **kwargs) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 943, in send_recv raise exc.with_traceback(tb) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 769, in _handle_comm result = await result File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/scheduler.py", line 5781, in replicate for ws in random.sample(tuple(workers - ts.who_has), count): File "/home/zzz/.conda/envs/test/lib/python3.9/random.py", line 449, in sample raise ValueError("Sample larger than population or is negative") ValueError: Sample larger than population or is negative 2022-11-10 15:31:53,461 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.99 GiB -- Worker memory limit: 3.79 GiB Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join: Traceback (most recent call last): File "/home/zzz/.conda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/zzz/.conda/envs/test/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/zzz/.conda/envs/test/lib/python3.9/site-packages/distributed/process.py", line 236, in _watch_process assert exitcode is not None AssertionError Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join: Traceback (most recent call last): File "/home/zzz/.conda/envs/test/lib/python3.9/threading.py", line 980, in _bootstrap_inner Using EntitySet persisted on the cluster as dataset EntitySet-a3d41f24f216a89dd794828f2871b580 self.run() Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads Python runtime state: finalizing (tstate=0x17a50a0)

    Current thread 0x00007f992d262280 (most recent call first):

    Also, something that may be relevant with previous fatal error, there're tons of fragmented and Unmanaged memory use warning in the log:

    /home/zzz/.conda/envs/test/lib/python3.9/site-packages/featuretools/computational_backends/feature_set_calculator.py:938: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
      return data.assign(**new_cols)
    
    2022-11-11 09:39:14,505 - distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.15 GiB -- Worker memory limit: 3.79 GiB
    

    Any ideas would be highly appreciated! Best regard!

    opened by dehiker 1
  • Revert changes for local docs build once related sphinx issue is closed

    Revert changes for local docs build once related sphinx issue is closed

    In MR #2367, changes were made in docs/Makefile to allow docs to be build locally using the make html command. This was needed due to errors that happened when attempting to built the docs with Featuretools installed in editable mode. Docs builds failing in editable mode might be related to an issue with sphinx.

    When sphinx issue 10943 (https://github.com/sphinx-doc/sphinx/issues/10943) has been closed and resolved, we should revert the changes that were mode to the Makefile as indicated by the comments here:

    .PHONY: html
    html:
    	# Remove the following line when sphinx issue (https://github.com/sphinx-doc/sphinx/issues/10943) is closed
    	python -m pip install .. --quiet --no-dependencies
    	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html $(SPHINXOPTS)
    	# Remove the following line when sphinx issue (https://github.com/sphinx-doc/sphinx/issues/10943) is closed
    	python -m pip install -e .. --quiet --no-dependencies
    	@echo
    	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
    `
    good first issue documentation 
    opened by thehomebrewnerd 8
  • Add Primitive attribute to indicate importance of ordering

    Add Primitive attribute to indicate importance of ordering

    • As a user of Featuretools, I want to know which primitives are affected by the ordering of the data.
    • We can add ordering_is_important to the base Primitive class.
    • Default to False
    • Make it True for:
      • AbsoluteDiff, CumMax, GreaterThanPrevious, IsMaxSoFar, RollingMean, Lag, NumericBin.
    opened by gsheni 0
Releases(v1.18.0)
  • v1.18.0(Nov 15, 2022)

    v1.18.0 Nov 15, 2022

    • Enhancements
      • Add RollingOutlierCount primitive (#2129)
      • Add RateOfChange primitive (#2359)
    • Fixes
      • Sets uses_full_dataframe for Rolling* and Exponential* primitives (#2354)
      • Updates for compatibility with upcoming Woodwork release 0.21.0 (#2363)
      • Updates demo dataset location to use new links (#2366)
      • Fix test_holiday_out_of_range after holidays release 0.17 (#2373)
    • Changes
      • Remove click and CLI functions (list-primitives, info) (#2353, #2358)
    • Documentation Changes
      • Build docs in parallel with Sphinx (#2351)
      • Use non-editable install to allow local docs build (#2367)
      • Remove primitives.featurelabs.com website from documentation (#2369)
    • Testing Changes
      • Replace use of pytest's tmpdir fixture with tmp_path (#2344)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.17.0(Oct 31, 2022)

    v1.17.0 Oct 31, 2022

    • Enhancements

      • Add featuretools-sklearn-transformer as an extra installation option (#2335)
      • Add CountAboveMean, CountBelowMean, CountGreaterThan, CountInsideNthSTD, CountInsideRange, CountLessThan, CountOutsideNthSTD, CountOutsideRange (#2336)
    • Changes

      • Restructure primitives directory to use individual primitives files (#2331)
      • Restrict 2022.10.1 for dask and distributed (#2347)
    • Documentation Changes

      • Add Featuretools-SQL to Install page on documentation (#2337)
      • Fixes broken link in Featuretools documentation (#2339)

      Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.16.0(Oct 24, 2022)

    • Enhancements
      • Add ExponentialWeighted primitives and DateToTimeZone primitive (#2318)
      • Add 14 natural language primitives from nlp_primitives library (#2328)
    • Documentation Changes
      • Fix typos in aggregation_primitive_base.py and features_deserializer.py (#2317) (#2324)
      • Update SQL integration documentation to reflect Snowflake compatibility (#2313)
    • Testing Changes
      • Add Windows install test #2330

    Thanks to the following people for contributing to this release: @gsheni, @sbadithe, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.15.0(Oct 6, 2022)

    v1.15.0 Oct 6, 2022

    • Enhancements
      • Add series_library attribute to EntitySet dictionary (#2257)
      • Leverage Library Enum inheriting from str (#2275)
    • Changes
      • Change default gap for Rolling* primitives from 0 to 1 to prevent accidental leakage (#2282)
      • Updates for pandas 1.5.0 compatibility (#2290, #2291, #2308)
      • Exclude documentation files from release workflow (#2295)
      • Bump requirements for optional pyspark dependency (#2299)
      • Bump scipy and woodwork[spark] dependencies (#2306)
    • Documentation Changes
      • Add documentation describing how to use featuretools_sql with featuretools (#2262)
      • Remove featuretools_sql as a docs requirement (#2302)
      • Fix typo in DiffDatetime doctest (#2314)
      • Fix typo in EntitySet documentation (#2315)
    • Testing Changes
      • Remove graphviz version restrictions in Windows CI tests (#2285)
      • Run CI tests with pytest -n auto (#2298, #2310)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @thehomebrewnerd

    Breaking Changes

    • The EntitySet schema has been updated to include a series_library attribute
    • The default behavior of the Rolling* primitives has changed in this release. If this primitive was used without defining the gap value, the feature values returned with this release will be different than feature values from prior releases.
    Source code(tar.gz)
    Source code(zip)
  • v1.15.0.dev0(Oct 5, 2022)

  • v1.14.0(Sep 1, 2022)

    v1.14.0 Sep 1, 2022

    • Enhancements
      • Replace NumericLag with Lag primitive (#2252)
      • Refactor build_features to speed up long running DFS calls by 50% (#2224)
    • Fixes
      • Fix compatibility issues with holidays 0.15 (#2254)
    • Changes
      • Update release notes to make clear conda release portion (#2249)
      • Use pyproject.toml only (move away from setup.cfg) (#2260, #2263, #2265)
      • Add entry point instructions for pyproject.toml project (#2272)
    • Documentation Changes
      • Fix to remove warning from Using Spark EntitySets Guide (#2258)
    • Testing Changes
      • Add tests/profiling/dfs_profile.py (#2224)
      • Add workflow to test featuretools without test dependencies (#2274)

    Thanks to the following people for contributing to this release: @cp2boston, @gsheni, @ozzieD, @stefaniesmith, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.13.0(Aug 18, 2022)

    v1.13.0 Aug 18, 2022

    • Fixes
      • Allow boolean columns to be included in remove_highly_correlated_features (#2231)
    • Changes
      • Refactor schema version checking to use packaging method (#2230)
      • Extract duplicated logic for Rolling primitives into a general utility function (#2218)
      • Set pandas version to >=1.4.0 (#2246)
      • Remove workaround in roll_series_with_gap caused by pandas version < 1.4.0 (#2246)
    • Documentation Changes
      • Add line breaks between sections of IsFederalHoliday primitive docstring (#2235)
    • Testing Changes
      • Update create feedstock PR forked repo to use (#2223, #2237)
      • Update development requirements and use latest for documentation (#2225)

    Thanks to the following people for contributing to this release: @gsheni, @ozzieD, @sbadithe, @tamargrey

    Source code(tar.gz)
    Source code(zip)
  • v1.12.1(Aug 4, 2022)

    v1.12.1 Aug 4, 2022

    • Fixes
      • Update Trend and RollingTrend primitives to work with IntegerNullable inputs (#2204)
      • camel_and_title_to_snake handles snake case strings with numbers (#2220)
      • Change _get_description to split on blank lines to avoid truncating primitive descriptions (#2219)
    • Documentation Changes
      • Add instructions to add new users to featuretools feedstock (#2215)
    • Testing Changes
      • Add create feedstock PR workflow (#2181)
      • Add performance tests for python 3.9 and 3.10 (#2198, #2208)
      • Add test to ensure primitive docstrings use standardized verbs (#2200)
      • Configure codecov to avoid premature PR comments (#2209)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @sbadithe, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.12.0(Jul 19, 2022)

    v1.12.0 Jul 19, 2022

    warning: This release of Featuretools will not support Python 3.7

    • Enhancements
      • Add IsWorkingHours and IsLunchTime transform primitives (#2130)
      • Add periods parameter to Diff and add DiffDatetime primitive (#2155)
      • Add RollingTrend primitive (#2170)
    • Fixes
      • Resolves Woodwork integration test failure and removes Python version check for codecov (#2182)
    • Changes
      • Drop Python 3.7 support (#2169, #2186)
      • Add pre-commit hooks for linting (#2177)
    • Documentation Changes
      • Augment single table entry in DFS to include information about passing in a dictionary for dataframes argument (#2160)
    • Testing Changes
      • Standardize imports across test files to simplify accessing featuretools functions (#2166)

    Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @ozzieD, @rwedge, @sbadithe

    Source code(tar.gz)
    Source code(zip)
  • v1.11.1(Jul 5, 2022)

    v1.11.1 Jul 5, 2022

    • Fixes
      • Remove 24th hour from PartOfDay primitive and add 0th hour (#2167)

    Thanks to the following people for contributing to this release: @tamargrey

    Source code(tar.gz)
    Source code(zip)
  • v1.11.0(Jun 30, 2022)

    v1.11.0 Jun 30, 2022

    • Enhancements
      • Add datetime and string types as valid arguments to dfs cutoff_time (#2147 )
      • Add PartOfDay transform primitive (#2128)
      • Add IsYearEnd, IsYearStart transform primitives (#2124)
      • Add Feature.set_feature_names method to directly set output column names for multi-output features (#2142)
      • Include np.nan testing for DayOfYear and DaysInMonth primitives (#2146)
      • Allow dfs kwargs to be passed into get_valid_primitives (#2157)
    • Fixes
    • Changes
      • Improve serialization and deserialization to reduce storage of duplicate primitive information (#2136, #2127, #2144)
      • Sort core requirements and test requirements in setup cfg (#2152)
    • Documentation Changes
    • Testing Changes
      • Fix pandas warning and reduce dask .apply warnings (#2145)
      • Pin graphviz version used in windows tests (#2159)

    Thanks to the following people for contributing to this release: @gsheni, @ozzieD, @rwedge, @sbadithe, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.10.0(Jun 23, 2022)

    v1.10.0 June 23, 2022

    • Enhancements
      • Add DayOfYear, DaysInMonth, Quarter, IsLeapYear, IsQuarterEnd, IsQuarterStart transform primitives (#2110, #2117)
      • Add IsMonthEnd, IsMonthStart transform primitives (#2121)
      • Move Quarter test cases (#2123)
      • Add summarize_primitives function for getting metrics about available primitives (#2099)
    • Changes
      • Changes for compatibility with numpy 1.23.0 (#2135, #2137)
    • Documentation Changes
      • Update contributing.md to add pandoc (#2103, #2104)
      • Update NLP primitives section of API reference (#2109)
      • Fixing release notes formatting (#2139)
    • Testing Changes
      • Latest dependency checker installs spark dependencies (#2112)
      • Fix test failures with pyspark v3.3.0 (#2114, #2120)

    Thanks to the following people for contributing to this release: @gsheni, @ozzieD, @rwedge, @sbadithe, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.9.2(Jun 10, 2022)

    v1.9.2 June 10, 2022

    • Fixes
      • Add feature origin information to all multi-output feature columns (#2102)
    • Documentation Changes
      • Update contributing.md to add pandoc (#2103)

    Thanks to the following people for contributing to this release: @gsheni, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.9.1(May 27, 2022)

    v1.9.1 May 27, 2022

    • Enhancements
      • Update DateToHoliday and DistanceToHoliday primitives to work with timezone-aware inputs (#2056)
    • Changes
      • Delete setup.py, MANIFEST.in and move configuration to pyproject.toml (#2046)
    • Documentation Changes
      • Update slack invite link to new (#2044)
      • Add slack and stackoverflow icon to footer (#2087)
      • Update dead links in docs and docstrings (#2092)
    • Testing Changes
      • Skip test for normalize_dataframe due to different error coming from Woodwork in 0.16.3 (#2052)
      • Fix Woodwork install in test with Woodwork main branch (#2055)
      • Use codecov action v3 (#2039)
      • Add workflow to kickoff EvalML unit tests with Featuretools main (#2072)
      • Rename yml to yaml for GitHub Actions workflows (#2073, #2077)
      • Update Dask test fixtures to prevent flaky behavior (#2079)
      • Update Makefile with better pkg command (#2081)
      • Add scheduled workflow that checks for broken links in documentation (#2084)

    Thanks to the following people for contributing to this release: @gsheni, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.9.0(Apr 27, 2022)

    v1.9.0 Apr 27, 2022

    • Enhancements
      • Improve UnusedPrimitiveWarning with additional information (#2003)
      • Update DFS primitive matching to use all inputs defined in primitive input_types (#2019)
      • Add MultiplyNumericBoolean primitive (#2035)
    • Fixes
      • Fix issue with Ordinal inputs to binary comparison primitives (#2024, #2025)
    • Changes
      • Updated autonormalize version requirement (#2002)
      • Remove extra NaN checking in LatLong primitives (#1924)
      • Normalize LatLong NaN values during EntitySet creation (#1924)
      • Pass primitive dictionaries into check_primitive to avoid repetitive calls (#2016)
      • Remove Boolean and BooleanNullable from MultiplyNumeric primitive inputs (#2022)
      • Update serialization for compatibility with Woodwork version 0.16.1 (#2030)
    • Documentation Changes
      • Update README text to Alteryx (#2010, #2015)
    • Testing Changes
      • Update unit tests with Woodwork main branch workflow name (#2033)

    Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.8.0(Mar 31, 2022)

    • Changes
      • Removed make_trans_primitive and make_agg_primitive utility functions (#1970)
    • Documentation Changes
      • Update project urls in setup cfg to include Twitter and Slack (#1981)
      • Update nbconvert to version 6.4.5 to fix docs build issue (#1984)
      • Update ReadMe to have centered badges and add docs badge (#1993)
      • Add M1 installation instructions to docs and contributing (#1997)
    • Testing Changes
      • Updated scheduled workflows to only run on Alteryx owned repos (#1973)
      • Updated minimum dependency checker to use new version with write file support (#1975, #1976)
      • Add black linting package and remove autopep8 (#1978)
      • Update tests for compatibility with Woodwork version 0.15.0 (#1984)

    Thanks to the following people for contributing to this release: @gsheni, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Mar 16, 2022)

    v1.7.0 Mar 16, 2022

    • Enhancements
      • Add support for Python 3.10 (#1940)
      • Added the SquareRoot, NaturalLogarithm, Sine, Cosine and Tangent primitives (#1948)
    • Fixes
      • Updated the conda install commands to specify the channel (#1917)
    • Changes
      • Update error message when DFS returns an empty list of features (#1919)
      • Remove list_variable_types and related directories (#1929)
      • Transition to use pyproject.toml and setup.cfg (moving away from setup.py) (#1941, #1950, #1952, #1954, #1957, #1964 )
      • Replace Koalas with pandas API on Spark (#1949)
    • Documentation Changes
      • Add time series guide (#1896)
      • Update minimum nlp_primitives requirement for docs (#1925)
      • Add GitHub URL for PyPi (#1928)
      • Add backport release support (#1932)
      • Update instructions in release.md (#1963)
    • Testing Changes
      • Update test cases to cover main.py file (#1927)
      • Upgrade moto requirement (#1929, #1938)
      • Add Python 3.9 linting, install complete, and docs build CI tests (#1934)
      • Add CI workflow to test with latest woodwork main branch (#1936)
      • Add lower bound for wheel for minimum dependency checker and limit lint CI tests to Python 3.10 (#1945)
      • Fix non-deterministic test in test_es.py (#1961)

    Thanks to the following people for contributing to this release: @andriyor, @gsheni, @jeff-hernandez, @kushal-gopal, @mingdavidqi, @rwedge, @tamargrey, @thehomebrewnerd, @tvdboom

    Source code(tar.gz)
    Source code(zip)
  • v1.7.0.dev2(Mar 16, 2022)

  • v1.7.0.dev1(Mar 15, 2022)

  • v1.7.0.dev0(Mar 15, 2022)

  • v1.6.0(Feb 17, 2022)

    v1.6.0 Feb 17, 2022

    • Enhancements
      • Add IsFederalHoliday transform primitive (#1912)
    • Fixes
      • Fix to catch new NotImplementedError raised by holidays library for unknown country (#1907)
    • Changes
      • Remove outdated pandas workaround code (#1906)
    • Documentation Changes
      • Add in-line tabs and copy-paste functionality to docs (#1905)
    • Testing Changes
      • Fix URL deserialization file (#1909)

    Thanks to the following people for contributing to this release: @jeff-hernandez, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Feb 14, 2022)

    v1.5.0 Feb 14, 2022

    warning: Featuretools may not support Python 3.7 in next non-bugfix release.

    • Enhancements
      • Add ability to use offset alias strings as inputs to rolling primitives (#1809)
      • Update to add support for pandas version 1.4.0 (#1881, #1895)
    • Fixes
      • Fix featuretools_primitives entry point (#1891)
    • Changes
      • Allow only snake camel and title case for primitives (#1854)
      • Add autonormalize as an add-on library (#1840)
      • Add DateToHoliday Transform Primitive (#1848)
      • Add DistanceToHoliday Transform Primitive (#1853)
      • Temporarily restrict pandas and koalas max versions (#1863)
      • Add __setitem__ method to overload add_dataframe method on EntitySet (#1862)
      • Add support for woodwork 0.12.0 (#1872, #1897)
      • Split Datetime and LatLong primitives into separate files (#1861)
      • Null values will not be included in index of normalized dataframe (#1897)
    • Documentation Changes
      • Bump ipython version (#1857)
      • Update README.md with Alteryx link (#1886)
    • Testing Changes
      • Add check for package conflicts with install workflow (#1843)
      • Change auto approve workflow to use assignee (#1843)
      • Update auto approve workflow to delete branch and change on trigger (#1852)
      • Upgrade tests to use compose version 0.8.0 (#1856)
      • Updated deep feature synthesis and feature serialization tests to use new primitive files (#1861)

    Thanks to the following people for contributing to this release: @dvreed77, @gsheni, @jacobboney, @jeff-hernandez, @rwedge, @tamargrey, @thehomebrewnerd, @tuethan1999

    Source code(tar.gz)
    Source code(zip)
  • v1.4.1(Jan 28, 2022)

    v1.4.1 Jan 28, 2022

    • Changes
      • Set upper bound for compatible Woodwork version (#1872)
      • Restrict pandas and koalas max versions (#1863)
    • Testing Changes
      • Upgrade tests to use compose version 0.8.0 (#1856)

    Thanks to the following people for contributing to this release: @dvreed77, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Jan 11, 2022)

    • Enhancements
      • Add LatLong transform primitives - GeoMidpoint, IsInGeoBox, CityblockDistance (#1814)
      • Add issue templates for bugs, feature requests and documentation improvements (#1834)
    • Fixes
      • Fix bug where Woodwork initialization could fail on feature matrix if cutoff times caused null values to be introduced (#1810)
    • Changes
      • Skip code coverage for specific dask usage lines (#1829)
      • Increase minimum required numpy version to 1.21.0, scipy to 1.3.3, koalas to 1.8.1 (#1833)
      • Remove pyyaml as a requirement (#1833)
    • Documentation Changes
      • Remove testing on conda forge in release.md (#1811)
    • Testing Changes
      • Enable auto-merge for minimum and latest dependency merge requests (#1818, #1821, #1822)
      • Change auto approve workfow to use PR number and run every 30 minutes (#1827)
      • Add auto approve workflow to run when unit tests complete (#1837)
      • Test deserializing from S3 with mocked S3 fixtures only (#1825)
      • Remove fastparquet as a test requirement (#1833)

    Thanks to the following people for contributing to this release: @davesque, @gsheni, @rwedge, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Dec 2, 2021)

    • Enhancements
      • Add NumericLag transform primitive #1797
    • Changes
      • Update pip to 21.3.1 for test requirements #1789
    • Documentation Changes
      • Add Docker install instructions and documentation on the install page. #1785
      • Update install page on documentation with correct python version #1784
      • Fix formatting in Improving Computational Performance guide #1786

    Thanks to the following people for contributing to this release: @gsheni, @HenryRocha, @tamargrey, @thehomebrewnerd

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0.dev0(Dec 2, 2021)

  • v1.2.0(Nov 15, 2021)

    • Enhancements
      • Add Rolling Transform primitives with integer parameters (#1770)
    • Fixes
      • Handle new graphviz FORMATS import (#1770)
    • Changes
      • Add new version of featuretools_tsfresh_primitives as an add-on library (#1772)
      • Add load_weather as demo dataset for time series (#1777)

    Thanks to the following people for contributing to this release: @gsheni, @tamargrey

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0.dev0(Nov 15, 2021)

  • v1.1.0(Nov 2, 2021)

    v1.1.0 Nov 2, 2021

    • Fixes
      • Check base_of_exclude attribute on primitive instead feature class (#1749)
      • Pin upper bound for pyspark (#1748)
      • Fix get_unused_primitives only recognizes lowercase primitive strings (#1733)
      • Require newer versions of dask and distributed (#1762)
      • Fix bug with pass-through columns of cutoff_time df when n_jobs > 1 (#1765)
    • Changes
      • Add new version of nlp_primitives as an add-on library (#1743)
      • Change name of date_of_birth (column name) to birthday in mock dataset (#1754)
    • Documentation Changes
      • Upgrade Sphinx and fix docs configuration error (#1760)
    • Testing Changes
      • Modify CI to run unit test with latest dependencies on python 3.9 (#1738)
      • Added Python version standardizer to Jupyter notebook linting (#1741)

    Thanks to the following people for contributing to this release: @bchen1116, @gsheni, @HenryRocha, @jeff-hernandez, @ridicolos, @rwedge

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0.dev0(Nov 1, 2021)

Owner
alteryx
Alteryx Development
alteryx
An open-source, low-code machine learning library in Python

An open-source, low-code machine learning library in Python ?? Version 2.3.6 out now! Check out the release notes here. Official • Docs • Install • Tu

PyCaret 6.6k Nov 28, 2022
Feature Store for Machine Learning

Overview Feast (Feature Store) is an operational data system for managing and serving machine learning features to models in production. Please see ou

Feast 3.8k Dec 4, 2022
Hopsworks - Data-Intensive AI platform with a Feature Store

Give us a star if you appreciate what we do What is Hopsworks? Quick Start Development and Operational ML on Hopsworks Docs Who’s behind Hopsworks? Op

Logical Clocks AB 826 Dec 1, 2022
Unofficial third-party implementation of FFD (fast feature detector) published in IEEE TIP 2020.

fast_feature_detector Unofficial third-party implementation of FFD (fast feature detector) published in IEEE TIP 2020. Caution I have not got any perm

kamino410 12 Feb 17, 2022
FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling

FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling Comparisons of Running Time of Our Method with SOTA methods RandLA and KPConv:

Kangcheng LIU 77 Dec 2, 2022
LIDAR(Livox Horizon) point cloud preprocessing, including point cloud filtering and point cloud feature extraction (edge points and plane points)

LIDAR(Livox Horizon) point cloud preprocessing, including point cloud filtering and point cloud feature extraction (edge points and plane points)

hongyu wang 11 Nov 11, 2022
Implementation of Sift feature detection and matching in C + +

Sift-In-CPP This is SIFT feature detection and matching implemented in C + + Environment version information: VS2017、Opencv3.4.3 Reference link: 1.htt

null 1 Nov 19, 2021
Computer Science Bridge Program at NYU Tandon School of Engineering

NYU Tandon Bridge 2021 (24 Week) Personal notes, resources, and exercises for the Computer Science Bridge Program at NYU Tandon School of Engineering.

Sarah Tahir 40 Nov 23, 2022
Source code for the TKET quantum compiler, Python bindings and utilities

tket Introduction This repository contains the full source code for tket, a quantum SDK. If you just want to use tket via Python, the easiest way is t

Cambridge Quantum 160 Nov 8, 2022
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

Siavash Eliasi 33 May 31, 2022
Cinder is a community-developed, free and open source library for professional-quality creative coding in C++.

Cinder 0.9.3dev: libcinder.org Cinder is a peer-reviewed, free, open source C++ library for creative coding. Please note that Cinder depends on a few

Cinder 5k Nov 25, 2022
Open Source Computer Vision Library

OpenCV: Open Source Computer Vision Library Resources Homepage: https://opencv.org Courses: https://opencv.org/courses Docs: https://docs.opencv.org/m

OpenCV 65.2k Dec 5, 2022
Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for real-time gesture recognition.

Gesture Recognition Toolkit (GRT) The Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for re

Nicholas Gillian 791 Nov 27, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.8k Nov 26, 2022
ParaMonte: Plain Powerful Parallel Monte Carlo and MCMC Library for Python, MATLAB, Fortran, C++, C.

Overview | Installation | Dependencies | Parallelism | Examples | Acknowledgments | License | Authors ParaMonte: Plain Powerful Parallel Monte Carlo L

Computational Data Science Lab 178 Nov 21, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.5k Dec 2, 2022
Header-only C++/python library for fast approximate nearest neighbors

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: Hnswlib is now 0.5.2. Bugfixes - th

null 2.2k Nov 27, 2022
PyCppAD — Python bindings for CppAD Automatic Differentiation library

PyCppAD is an open source framework which provides bindings for the CppAD Automatic Differentiation(CppAD) C++ library in Python. PyCppAD also includes support for the CppADCodeGen (CppADCodeGen), C++ library, which exploits CppAD functionality to perform code generation.

SimpleRobotics 14 Nov 14, 2022